IW Homepage Web Watch Resources Web Links Thought Leaders Site Search Contact Us
About Newsletter Contributors Multimedia Clips Futurepedia Podcast David Forrest's Blog
Join the Innovation Watch community... read and post in our online forums (coming soon) Innovation Forums
   Connections
 HOME
 Contributors
 
 Connections
 The Choice
 is Yours

A Needle in a Haystack
by David Forrest

© David Forrest, May 2001

Anyone who looks for information on the Internet has to marvel at its ability to provide content on almost any subject. Search for ‘beekeeping’ or ‘atomic physics’ or ‘Tibetan cooking’ and you will be sure to find something. But the quality and reliability of the results vary widely, and how is one to sort through 50,000 references?

What the Internet has gained in breadth compared to traditional information sources, it appears to have lost in depth. Definitive reference sites vie for attention with the online musings of millions of web users. And while more often than not a search is fruitful, there is more luck than certainty in success. As the web grows exponentially in size, the ability to find authoritative material will be as significant as the opportunity to give everyone a voice.

Cyveillance, Inc., of Washington, D.C., estimated the number of publicly accessible web pages last summer at approximately 2 billion, and projected that this number would double by early 2001. Yet the majority of these pages remain unindexed by online search engines. An NEC study published in Nature found that the most comprehensive search engine, Northern Light, indexed less than 16% of the web. Broken links are becoming more common as the Internet grows and changes, and search engines are increasingly unable to keep up.

BrightPlanet.com LLC of Sioux Falls, S.D., has identified a much more extensive collection of online resources, a ‘deep web’ that is unreachable by conventional search engines. It estimated last summer that this hidden resource was about 500 times the size of the visible Internet, comprising more than 500 billion documents. Much of the information on the Internet today is stored in databases, where it is inaccessible to the software used by search engines to build their indexes. This content, BrightPlanet says, is “hidden in plain sight.”

“Searching on the Internet today,” the company says, “can be compared to dragging a net across the surface of the ocean. There is a wealth of information that is deep and therefore missed. The reason is simple: basic search methodology and technology have not evolved significantly since the inception of the Internet.”

BrightPlanet has developed new search technology for identifying, retrieving, qualifying, classifying and organizing ‘deep’ and ‘surface’ content from the web. It completed a study of the hidden Internet last year. Among the most significant findings:

  • There are more than 100,000 deep web sites
  • About 60 of the largest sites collectively contain about 750 terabytes of information, comparable to 40 times the size of the visible Internet
  • On average, deep web sites receive about 50% greater monthly traffic than surface sites, and are more highly linked; however, the typical deep web site is not well  known to the public
  • The deep web is the largest growing category of new information on the Internet
  • Deep web sites tend to be narrower, with deeper content than conventional surface sites
  • Total quality content of the deep web is at least 1,000 to 2,000 times greater than that of the surface web
  • Deep web content is highly relevant to every information need, market and domain
  • More than half of the deep web content resides in topic specific databases
  • A full 95% of the deep web is publicly accessible information that is not subject to fees or subscriptions

BrightPlanet points out that if the most comprehensive search engine indexes only 16% of the surface web, Internet searchers are currently accessing only 0.03% of available Internet resources.

BrightPlanet has created a portal (www.completeplanet.com) where users can search for relevant deep web sites. The portal also provides a comprehensive directory of some 40,000 sites on the invisible web, organized under 4,000 subject headings. Other online directories of deep web resources include Direct Search, Invisible Web, Lycos Invisible Web Catalog and Web Data.

While these portals can take the user to the ‘front door’ of a promising deep web site, they do not provide the capability of a traditional Internet search engine to search multiple sites at the same time. BrightPlanet has addressed this need by developing LexiBot – a new software program that can search up to 60 deep web sites simultaneously. The software is available on the company’s web site, where it can be downloaded for a 30-day free trial. After the trial the user must purchase the software, which is priced at US$ 89.95. LexiBot is currently configured to access 600 deep web sites.

While technology has made it possible for us to create an unprecedented global information resource, it currently stands in the way of our making effective use of it. Since the lure of good information is irresistible, one hopes that it is only a matter of time until there are new standards for information sharing, and a new generation of tools, like LexiBot, that will allow us to access the Internet’s hidden resources. Until then, it seems, we’re still destined to keep looking for a needle in a haystack.

RESOURCES:

BrightPlanet white paper describing the Deep Web (Adobe Acrobat file) –
http://128.121.227.57/download/deepwebwhitepaper.pdf

BrightPlanet largest deep web sites –
http://www.completeplanet.com/topsites/topsites_largest.asp

BrightPlanet LexiBot software –
http://www.lexibot.com/download/transition.asp

CompletePlanet portal for searching the Deep Web –
http://www.completeplanet.com

Cyveillance estimates of Internet size –
http://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htm

Direct Search –
http://gwis2.circ.gwu.edu/%7Egprice/direct.htm

Invisible Web –
http://www.invisibleweb.com/

Ken Wiseman, Apple Distinguished Educator,
“The Invisible Web for Educators” –
http://www3.dist214.k12.il.us/invisible/article/invisiblearticle.html

Librarians’ Index to the Internet –
http://www.lii.org/

Lycos Invisible Web Catalog –
http://dir.lycos.com/Reference/Searchable%5FDatabases/

The Standard, “Diving Into the Deep Web” –
http://www.thestandard.com/article/0,1902,18134,00.html

Web Data –
http://www.webdata.com/webdata.htm

     Printer-friendly format
     PDF format
   
IW Homepage | Web Watch | Resources | Web Links | Thought Leaders | Site Search | Contact Us
About | Newsletter | Contributors | Multimedia Clips | Futurepedia | Podcast | David Forrest's Blog
Join the Innovation Watch community... read and post in our online forms: Innovation Forums
Send mail to webmaster (at) innovationwatch.com with questions or comments about this site.
Copyright © 2001-2008. Innovation Watch is a registered trademark.