Abstract
 

The rapidly growing number of available documents on the World Wide Web makes it difficult both to find useful information and to organize searchable index hierarchies for the contents of the Web. Two main types of search tools have been introduced, but they both have major weaknesses, in many cases preventing  users from finding what they want. Search engines offer searching in more than 100 million Web pages, but the search they offer is a simple one, based purely on matching automatically indexed keywords. Manually constructed directories let users base their search on context combined with keywords, but cover only a small portion of all available Web pages. The weaknesses are caused by the capacity problem of storing and manipulating extremely large amounts of data and by the high costs of performing manual indexing. Our goal is to find a way to build a searchable, high quality index, able to describe all the hundreds of millions of Web pages on the Internet accurately, and to create an efficient, user-friendly way for people to access the index.

Our work begins with a literature study of how the Web is searched and indexed today, and of what computer-supported mechanisms library science has developed for searching in large document collections throughout the last century. We proceed to look at intelligent agents and what possibilities for information handling this technology can offer.

The solution we propose is based on reducing the size of index entries and on using 1) Autonomous information agents to support the classification, and 2) User interface agents to support the search process. We introduce a compact, numeric, hierarchical metadata code system, capable of accurately describing the subject as well as other important properties of Web pages. Autonomous information agents retrieve information about pages on the Web, analyze the contents of what they gather and suggest a code for describing the page. The agents’ analyses are based on a combination of traditional text analysis techniques and a new technique we suggest. The novel idea of our approach is to take advantage of the hyperlink structure of the Web. We compare the URLs of the page we want to classify and of pages linking to and linked to by this page with information that is already stored in our index. If the index contains information about these or similar URLs, this can be used as an indication on what is the contents and context of the new page. The agents’ analyses of Web pages are checked and corrected by human classification personnel. Our position is that this support of manual work will help classification personnel to efficiently create a high quality index.

We suggest having user interface agents assist users with searching in the index. These agents let the novice user formulate advanced queries by answering questions posed by the agent, designed to efficiently narrow down the search scope. By combining the document properties the index contains, a strong notion of context can be created, so that searching can be performed in a more accurate and user-friendly fashion than current Internet search tools offer. Expert users are offered access to the index through a user interface based on search forms, but these users are also assisted by agents in locating and ranking the findings. The code hierarchy is navigable, so that all kinds of search types we identify are supported.

Compared to existing search tools, our position is that implementing the suggested system will result in two main achievements:

  1. We can build larger high quality indexes, thanks to agent support of manual work.
  2. We can offer more flexible and powerful navigation and search mechanisms, thanks to how we classify and sort Web pages by subject and certain other properties.
  
Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:  bct@pvv.org