terça-feira, 7 de agosto de 2007

Those Dark Hiding Places: The Invisible Web Revealed

- Robert J. Lackie, Associate Professor-Librarian, Rider University

"If only I had known!" was the bitter cry of the searcher who relied just on search engines to search the Web. Although many popular search engines boast about their ability to index information on the Web, more of it (dynamically-generated pages, certain file formats, and information held within numerous databases) has become invisible to their searching spiders. Much of the Web is hiding information from us, but we can access this hidden content! Learn how you can reveal the secrets of these dark, hiding places.

"The Web," according to Chris Sherman, Internet search expert and Associate Editor of SearchEngineWatch.com, "is increasingly moving away from being a collection of documents and becoming a multidimensional repository for sounds, images, audio, and other formats." Because much of this information is not accessible to many general search engines' software spiders, we need to look for specific search tools that will lead us to this hidden content. Some of these tools include directories, searchable sites, free Web databases, and a few general and many specialized search engines. Begin searching with...

  • Directories and Portals when you:
    • have a broad topic
    • want selected, evaluated, and annotated collections
    • prefer quality over quantity
  • Invisible or Deep Web [searchable sites and databases] when you:
    • are looking for information that is likely in a database
    • are looking for information that dynamically changes in content
  • Search engines [general and specialized] when you:
    • have a narrow topic
    • want to take advantage of the newer retrieval technologies

Directories are Web sites that provide a large collection of links, arranged according to a classification scheme that enables browsing by subject area. I really like directories, but what I want to point out right away is that I am not against using search engines. I consider directories to be complements to search engines, not their replacements. However, there is a trend developing toward the use of directories because, in addition to their classification, their content is pre-screened, evaluated, and annotated by humans. Sometimes, though, this annotation and classification process makes the information not as timely as it could be. This is usually true in very large directories, so look at several, large and small. Let's look at a few smaller, more selective directories that can also lead you to some of the Web's hidden content.

  • Librarians' Internet Index (http://lii.org/) - Websites You Can Trust: LII offers a searchable and browsable collection of over 20,000 quality websites, "maintained by librarians and organized into 14 main topics and nearly 300 related topics," in addition to an excellent weekly newsletter [they have over 40,000 subscribers in many countries], available by email or RSS, of high-quality Websites related to current events, holidays, and popular and important issues. New features added with their Fall 2005 upgrade include icons following the titles allowing you to view more details, make comments about, or e-mail the site. Of course, LII can also lead you to Invisible Web databases by typing in a broad topic and adding the words: "and databases" (i.e., biology and databases).
  • FindLaw (http://www.findlaw.com/) - "The highest-trafficked legal Web site," FindLaw provides "the most comprehensive set of legal resources on the Internet for legal professions, businesses, students and individuals." To find an annotated list of free databases on many law-related topics, from their main page, click on the "For Legal Professionals" tab at the top, click on the "Practice Areas" link under the "Research the Law" section, pick a practice area/legal subject heading (i.e., "Health Law"), and then look for "Databases" under the Web Guide for that legal subject heading.
  • InfoMine (http://infomine.ucr.edu) - This scholarly resource collection includes tens of thousands of sites, grouped into 9 annotated, indexed categories (databases) for easy retrieval. This librarian-built "virtual library of Internet resources [is] relevant to faculty, students, and research staff at the university level," while also very useful for higher-level high school and professionals, too.
  • About.com (http://www.about.com/) - This portal, visited each month by more than 29 million people, neatly organizes, thousands of topics, including Invisible Web, with good news and commentary. Try typing "Invisible Web" as a phrase in quotes to find many links to hidden content on the Web, including the "Invisible Web: The Cloaked Internet," "Visible versus Invisible Web," and their new, "The "Cloaked" or "Deep" Web, Explained," from their Internet for Beginners guide, and "Invisible Web Gateways." You will see links to other pertinent articles, too--all worth reading & exploring.

Chris Sherman states that "vast expanses of the Web are completely invisible to general purpose search engines," but there are ways "to find the hidden gems search engines can't see."

Some Recommended Links to Invisible Web Databases:
  • ResourceShelf (http://www.resourceshelf.com/) - Gary Price, MLIS, of Gary Price Library & Internet Research Consulting, one of the foremost authorities on invaluable Invisible Web resources, has assembled a massive collection at his Direct Search (http://www.freepint.com/gary/direct.htm) found on his "ResourceShelf" Weblog & Newsletter site for information professionals and online researchers. Other well-known Web research tools, including "Price's List of Lists," are included on the left-hand list of links (below the list of dates).
  • CompletePlanet (http://www.completeplanet.com/) - BrightPlanet Corporation's site contains "70,000+ searchable databases & specialty search engines." Check out their Help/FAQs tab (http://aip.completeplanet.com/aip-engines/help/index.jsp), too.

Although there are thousands of Invisible Web databases available to us for free on the Web, below I have listed a few of my favorites:

  • AnimalSearch (http://animalsearch.net/) - A database for family-safe animal-related sites, you can also search here by group, type, and geographic regions.
  • Educator's Reference Desk (http://www.eduref.org/) - This site contains 2000+ lesson plans, 3000+ links to value-added online education information, and 200+ question archive collected on the award-winning AskERIC site during the past decade. This site also provides access to the ERIC database--the world's largest source of information on education research & practice, including free, full-text expert digest reports, and it also links you to the Gateway to Educational Materials (GEM), which "provides quick and easy access to over 40,000 educational resources found on various federal, state, university, non-profit and commercial Internet sites."
  • NatureServe Explorer (http://www.natureserve.org/explorer) - This online encyclopedia provides authoritative "information on more than 65,000 plants, animals, and ecosystems of the United States and Canada. Explorer includes particularly in-depth coverage for rare and endangered species."
  • Nuclear Explosions Database (http://www.ga.gov.au/oracle/nukexp_form.jsp) - Geoscience Australia's database provides location, time, & size of explosions worldwide since 1945. Click on "databases" under "Online Tools" to see a list of other searchable online mapping tools & databases.
  • On-Line Encyclopedia of Integer Sequences (http://www.research.att.com/~njas/sequences/) - "Type in a series of numbers and this database will complete the sequence and provide the sequence name, along with its mathematical formula, structure, references, and links."
  • PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi) - Provides access to 16 million+ MEDLINE citations, including links to full text articles & related resources. You will also want to explore PubMed Central (PMC), an e-archive of free, full text articles from 190+ life sciences journals, as well as Bookshelf, "a growing collection of [full text] biomedical books (40+) that can be searched directly." They now offer a "new global NCBI 'Entrez' search engine" where you can search across their many life sciences databases, too.
  • FindArticles (http://www.findarticles.com/) - The FindArticles database is an updated replacement of their original free, searchable article Web archive, with the current service now searching 10 million+ articles from "leading academic, industry and general interest publications. We give you free access to information you can trust, from a collection you'll only find here." You can also find magazines and articles by topic, and your can explore all publications by title or limit your search to "free articles only."
  • MagPortal.com (http://magportal.com/) - MagPortal.com is another site for finding freely available magazine articles on the Web, using keyword searching or category browsing methods. Indexing a little over 200 magazines, their focused content allows them to update with new articles within days of them becoming available. The material is of good quality, and their Hot Neuron Similarity software package allows them to measure the similarity between articles, linking similar articles to each other.
  • Directory of Open Access Journals (http://www.doaj.org/) - Launched in May 2003, Sweden's Lund University Libraries Head Office hosts this "one-stop shopping" open access directory, providing no-cost access to the full text of over 2,400 journals, with over 700 journals searchable on the article level (over 110,000 articles available)--in the science and humanities/social sciences--and its directory is continually growing in size.
  • HighWire Press: Free Online Full-text Articles (http://highwire.stanford.edu/) - Launched in early 1995, Stanford University Libraries' HighWire Press hosts what it claims to be the "world's largest and fastest growing archive of free full-text life science articles in the world, with over 1.4 million free, full-text articles [almost 1,000 journals are free in some way] online." I like how it also provides very quick full-text access to your institution's journal subscriptions to HighWire-affiliated journals via IP address recognition when using a computer workstation within your library/institution--journals to which you probably did not even know that you had access! (click on "For Institutions" tab on the top and follow the directions). You can also browse by topic or alphabetically on this page--you will be impressed!
By the way, if you like viewing accompanying Web sites from excellent books on Web research, you may also want to visit the Super Searchers Web Page (http://www.infotoday.com/supersearchers/), which "features a growing collection of links to subject-specific Web resources recommended by the world’s leading online searchers" in global business, primary research, mergers/acquisitions, news, writing, health/medicine, investment, business, entrepreneurial research, & legal information resources. The books and their Web sites can lead researchers to a wealth of hidden resources.

Some general and specialized search engines, like those listed below, can help you locate specific information or certain file formats, so I like to go to them first. I do use several search engines for research, but they are not all created equal when it comes to uncovering data in the Invisible Web domain. A great site for keeping up-to-date on search engines is Search Engine Watch (http://www.searchenginewatch.com/). Another great site on search engines is Search Engine Showdown (http://www.searchengineshowdown.com/). Let's explore these two sites and general & specialized search engines that allow us to find some Invisible Web data. Immediately below are a few interesting specialized search engine services/sites.

  • Singingfish (http://www.singingfish.com) - A superior audio/video search engine, Singingfish "only indexes multimedia formats, including Windows Media, Real, QuickTime, and mp3s." Their content is free, and you can search for both audio/video or just one type of media.
  • Google News (http://news.google.com/) - This award-winning automated (no Google editors) version scours the Web every 15 minutes, capturing news from 4,500+ sources. Recently, Google News added a new feature: a "Top Stories" drop-down menu that allows us to select the top news stories from several different countries. Note: Yahoo! News, Topix.net, and Daypop are also impressive news-aggregating services with special features, too.
  • Scirus (http://www.scirus.com/srsapp/) - This science search engine, with over 250 million science-specific Web pages, offers excellent advanced search options for a wide variety of information types and sources of materials on the Web, including journals. Scirus has become pretty successful at pinpointing science-specific data, reports, articles, and relevant scholarly Web pages--a considerable recent improvement. Check out their Advanced Search page, as well as their About Us links.
  • UFOSeek: The UFO and Paranormal Search Engine (http://www.ufoseek.com/) - "Yes, Mulder, the truth is really, um, out there, and you can find it using this paranormal/UFO search engine," currently indexing over 110,000 Paranormal, Spiritual and UFO sites in the their system.

We know that information on some sites is presented in formats other than static HTML, which gives search engines a problem. Adobe Portable Document Format (PDF) has been an example of this. If HTML text that accompanies the PDF file describes the file well, you may find the site, but if the site provides unhelpful headings or titles, then the file is pretty much "invisible." This is also true for Flash files, for instance. Fortunately for us, a few general search engines are more easily bringing some PDF, Flash, and other non-HTML files to our desktops.

  • Google (http://www.google.com/) - Still the most popular general purpose search engine on the Web, Google allows you to go to the page as it is currently on the Web, or go to a cached copy Google stored when it retrieved the page (nice when the current page won't connect). In addition, Google allows you to find those Invisible Web documents: PDF files. You can also view them in HTML (nice when you have a slow connection or the PDF is so large that you don't want to wait to display). From Google's Advanced Search, you will see that in addition to allowing you to limit your search to finding PDF files, you can limit or exclude other file formats, such as Postscript; Microsoft Word, Excel, or PowerPoint; & Rich Text formats. Check out their "Google Web Search Features" and "Google Labs" for other interesting items (like "Google Maps" with their satellite imaging), and visit the Google Scholar (Beta) site (http://scholar.google.com/) to search for some "articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web." Note:Google claimed (in August 2005) to track 11.3 billion objects--which consist of the some 8.2 billion Web pages and 2.1 billion images, as well as material from its group discussions--it no longer lists figures on its main pages.
  • Yahoo! Search (http://www.yahoo.com/) - Google's biggest competitor since dropping them as a partner, Yahoo! (selected in spring 2005 by Search Engine Watch as the "2004 Outstanding Search Service Winner") also provides cached copies and locates Word, Excel, PowerPoint, PDF, and RSS/XML files. Yahoo! also has full Boolean searching capability after purchasing the AlltheWeb and AltaVista search engines, so it looks like Google is going to be keeping an eye on Yahoo!'s continued aggressive progress. Check out their interesting "Search Shortcuts" for fun ways to quickly find everyday information, as well as their new "Yahoo! Search Subscriptions" in beta (http://search.yahoo.com/subscriptions), which enables you to search access-restricted content such as news and reference sites that are normally not accessible to search engines. Note:Yahoo! (in August 2005) stated that its index covered 20.8 billion online objects, made up of about 19.2 billion documents and 1.6 billion images--partly because of a 2005 upgrade--like, Google, figures are not listed on Yahoo's main pages.
  • Gigablast (http://www.gigablast.com/) - An interesting up-and-coming search engine, Gigablast also locates Word, Excel, PDF, and other non-HTML files, and like Google and Yahoo!, it provides cached (most recent "archived copy") of these files. It also links you to multiple "older copies" via The Internet Archive Wayback Machine. In addition, it also provides full Boolean searching, so keep an eye on Gigablast, too.
Below are a few of my recent articles on the invisible/hidden web (and other education-related topics) for your review; other articles/presentations can be found at my Robert J. Lackie's Selected Online Materials (http://www.robertlackie.com/rlackieepub.html) page:

Send comments or questions about this workshop and/or Web site to Robert J. Lackie (rlackie@rider.edu), including if you would like permission to link to Those Dark Hiding Places: The Invisible Web Revealed(http://www.robertlackie.com/invisible/index.html).

Source: http://www.robertlackie.com

Sem comentários: