2 Searching On the Internet -  Background



 

In December 1997, according to an announcement from the HotBot search engine [Sullivan, 1998], there were at least 175 million Web pages that could be reached from any computer connected to the Internet. The number is estimated to pass 1,000 millions in the year 2000. So, in a couple of years there will be a BILLION Web pages to choose between when someone wants to find the information they need, fast. As of today, we do not have the tools to handle such amounts of information.
 
 

2.1 The Tools of Today

Already there are several options to choose between when starting out looking for some particular piece of information on the Web. For many, the most efficient tools are link pages, where someone is responsible for maintaining the link list, by adding new, interesting links as they appear on the Net and removing links to sites that have ceased to exist. Such link pages most often appear when someone realizes that the collection of links they use in their daily work may be of interest to others, and offer their list of links to friends and colleagues. Thanks to e-mail, Usenet and word-of-mouth, soon many people with the same interest know of the link page. Eventually someone volunteers to maintain the links, and a special interest link page is born.

These pages are kept up to date by people with limited resources, especially when it comes to available time. Hence, the link pages will only contain a certain amount of all information available on the topic. In addition, in many cases people who need the information are not aware of the existence of these link pages. Therefore, while link pages like these may be the most efficient way to find information, finding the link page you need may be just as difficult as finding the actual information you need.

To cope with this problem the most useful general search tools are search engines, directories and hybrids between these [Sullivan, 1998]:
 

Hence, there are two main methods for searching for information on the Internet. If you use a directory, you look up information by finding the area of the directory that covers your field of interest in the hierarchy available. The other way is using a search engine. Here your input is a set of keywords related to the topic you want to find information about, which the search engine uses in a search through its index. This search results in a number of "hits", meaning information the search engine decides may be of interest to you.

The main advantage of directories is that the user is directly in charge of the contents of the pages that are offered by the search tool. On Yahoo!, if you"re looking for serious information about the White House, you will find that and nothing else if you have navigated to the sub-hierarchy of

Government/Executive Branch/The White House.

If you are looking for less serious bits of information in connection with the White House, you will find that, and nothing else, in the sub-hierarchy

Entertainment/Humor/Jokes and fun/Internet Humor/Web site Parodies/The White House.

If you use a search engine, on the other hand, and you tell it to look for information about "The White House", you will get all kinds of information that mentions the White House (224,506 documents matched "the white house" on Alta Vista, February 7, 1998). You may even innocently be exposed to e.g. adult material, on Web pages that contain the text string "The White House", maybe just for the purpose of luring people into the pages.

However, most often search engines are capable of coming up with a sensible set of suggestions for where to find information of relevance to the keywords the user provides for the search. There are two different approaches to text search:

  1. Free-text search, a complete word-by-word matching done to the complete contents of all documents in the index. A search like this is very accurate, and automated software agents can easily collect the raw material for the search. This makes the index much cheaper to create and to keep updated. The disadvantage of free-text searching is that it can never describe the context of a search very accurately (although logical combinations of several text string patterns can give a somewhat accurate search), and the computing costs of traversing large amounts of data looking for a specific pattern can be tremendous.
  2. Keyword searching, where Web pages are related to keywords that either are quoted as keywords by the author of the Web pages, or are words that are repeated so often on the Web page that they can be classified as keywords by an automated text analysis. Unless the user who wants to perform a search knows exactly what keywords to look for, and the author and the person who is searching share the same idea of what good keywords are, this way of searching can give poor results. Still, keyword searching most often performs better than free-text search.
Not depending on the particular search approach, the "hits" presented to the user are ranked using certain criteria: All the points in the list above have been exploited by Web publishers to make search engine robots report their Web sites to be as interesting/highly ranked as possible, to be able to attract as many visitors as possible. This has turned into quite a problem, as advertising has entered the Net and advertisers pay a small amount of money for each person that is exposed to their ads. The more people a Web page owner can trick into visiting his pages, the more money he makes.
 

2.1.1 Clouds on the Horizon

Until about a year ago, people were quite happy about the search engine situation and most often were able to find a page containing the information they were looking for through the search engines. The largest search engines had practically all the text on all the Web pages in store, and the robots roaming the Net were able to keep the index pretty much up to date. That was then, this is now.

In the summer of 1996, there were estimated to be 50-60 million Web pages available on the Internet. Following the steadily increasing growth rate of the Internet, the number of Web pages has since more than tripled. Nicholas Negroponte, head of MIT's Media Labs, has on several occasions claimed that the Web is doubling in size and number of users every fifty days. Unfortunately, the search engines' indexes haven't been able to keep up with this development, and the largest ones are still only able to index up to about 100 million Web pages. This has led to a situation where the users are not any longer guaranteed to find the page they are looking for, even if they have enough time to look through all the pages suggested by the search engine of their choice. Actually, if they are looking for any particular page, given the number of pages on the Web compared to the number of pages indexed on most search engines, there is a more than 50% probability that they will NOT find that page.

Search engine representatives will argue that even though "some" (today roughly one half) of all Web pages can not be found through their search engines, interesting and related Web pages will be found, and the user should be just as happy with not getting too many hits to look through. This sounds, somehow, quite sensible. There is a major catch, though, namely that this reality brings the search engine companies into a situation where they can profit on selectively choosing what pages are to be indexed and what pages are to be kept out of the indexes. In a way, we may come to a situation where the search engine owners decide who gets to practice their freedom of speech.

Alta Vista changed its slogan from "We index it all!" to "We index the best!" sometime during the winter/spring of 1997. The directory of Yahoo! is still growing, but the percentage of all available pages it covers is dramatically growing smaller every hour. In an increasingly important Internet marketplace, being listed in Yahoo! may come to mean life or death to a Web-based business. People have reported submitting their Web pages for review at Yahoo! up to 30 times without actually getting a review and a listing in the directory. Even the robots on the Net, covering several million Web pages every day, may take long to discover new sites, due to the size and complexity of the World Wide Web.

The reason for Yahoo!'s popularity is mainly its hierarchical, well-maintained, easy-to-navigate directory. To keep its position as the search tool market leader, Yahoo! will have to "keep up the good work", meaning that the directory must be manually looked after. They can not automate the submit process, as that would mean that the directory would be garbled by misplaced links and links leading to nowhere, submitted by people who either haven't understood how to do it properly or by the kind of people who enjoy messing up systems. Some search engines allow "instant submitting", meaning that robots will be sent directly to the Web site being submitted to index it within a matter of hours or a few days. Directories, whose strength is their ability to very accurately classify Web pages in some hierarchy, can not do this, as the effort they have to put into doing this exceeds the effort of submitting the page. Hence, the directories' strength becomes their weakness, as soon as there are more people submitting pages to the directory than there are people to handle the submits.

There is a need for a solution to this problem, to be able to create a World Wide Web where everyone can have equal possibilities to have their Web pages found through some search mechanism. It is not a healthy thing for the Internet community if the creators and owners of Web catalogues and search engines are to decide who gets to present their information. If the book is not on the bookshelves, it cannot be borrowed and read. This is a major issue for this thesis.
 
 

2.2 Providing Context to Information

As we have just seen, a directory's ability to provide a hierarchy of topics for classification of Web pages is what makes directories easier to use and more popular than search engines. This is mainly because it simply makes it easier to find what you need; the directories let you search within a context, removing nonrelevant information from the landscape surrounding the path along your journey of searching.

Other ways to create contextual directories than through human, manual indexing have been suggested. Two projects in particular have been met with enthusiasm in the international research community: The Dublin Core and the Meta Content Framework. The general idea used as a point of departure in both of these projects, is that all documents/objects should be equipped with a "tag", a container for information that can describe elements concerning what context the document/ object is created in. This kind of information is called metadata. Both concepts are meant to be used with all kinds of information, not only in relation to information that can be reached through the World Wide Web today. A meta-description has to be easily understandable, computable and generally demand as few resources to handle as possible.

The use of metadata is not new [Sølvberg, 1997]. In the database world different "schemes" are used to describe information elements and relations between them. The term "metadata" has lately been given a more specific meaning, mainly in the Digital Library community, where it is used to denote formats for describing online information resources. Here the concept "metadata" has been given several definitions:
 

A Web document can have a lot of meta information, spanning from information about the subject of the document to its file format and size. For our use, we will mainly be interested in meta information that describes the document properties that can be most useful to allow for searching within a context.
 

2.2.1 The Dublin Core

The Dublin Core (DC) is a very general description of a metadata set which has been further developed through additional workshops, mainly in the Warwick Framework [Warwick, 1996]. The goal of DC was to "provide a minimal set of descriptive elements that facilitate description and automated indexing of document-like networked objects". It is also a goal that these elements should be simple enough to be understood and used, with no extensive training, by anyone who might be interested in supplying their own "document-like objects" with DC element codes. The elements suggested in the Dublin Core (with modifications to make it less text-centered) are:
 
 
Title The name of the object
Auhor/Creator The person(s) primarily responsible for the intellectual content of the object
Subject/Keywords The topic of the object, or keywords, phrases or classification descriptors that describe the subject or content of the object
Description A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources
Publisher The agent or agency responsible for making the objects available
Other Contributors The person(s), such as editors and transcribers, who have made other significant intellectual contributions to the work
Date The date of publication
ObjectType The genre of the object, such as novel, poem, dictionary, etc.
Format The data representation of the object, such as PostScript file
Identifier String or number used to uniquely identify the object
Relation Relationship to other objects
Source Objects, either print or electronic, from which this object is derived
Language Language of the intellectual content
Coverage The spatial locations and temporal duration characteristic of the object
Rights management The content of this element is intended to be a link (URL or other suitable URI) to a copyright notice, a rights-management server, etc.
Table 2-1, Dublin Core descriptive elements
 

All elements can be multi-valued. For example, a document may have several author elements or subject elements. Also, all elements are optional and can be modified by one or more qualifiers.

This table is basically the result of the March 1995 Dublin Metadata Workshop, and was only intended as an initial step towards defining a core descriptive metadata set. It has been criticized on several points, but it does provide a basis for further discussion concerning metadata. For now, I'd like to remark that the 15 elements, although all with a purpose, cover more meta-information than is useful for the majority of Web pages in existence today, while at the same time ignoring certain areas of metadata interest for various Web pages.

The Warwick Workshop, building on the Dublin Core, suggests that new metadata sets will develop as the networked information infrastructure matures. As more information that is proprietary is made available for purchase and delivery on the Internet, the need for a suitable metadata set will push development in this area forward. Metadata that may be of special interest for Web objects in these cases are [Warwick, 1996]:
 

The result of the Warwick Workshop, the Warwick Framework, is a proposal for a container architecture, a mechanism for aggregating distinct packages of metadata. This means a modularization of the metadata issue, so that designers from different areas of interest can choose a set of metadata that suits their information objects best, and have it included in the Warwick Framework for various operations. Even future metadata sets can be incorporated in the framework.

An implementation in HTML, the common formatting language used on the WWW, is among the implementations outlined in the workshop papers. To make it as easy as possible to introduce a new metadata format to the World Wide Web, it should be possible to start using it without requiring any changes in neither Web browsers nor HTML editors. A solution that follows this precaution and conforms to HTML 2.0 was proposed at the May 1996 W3C-sponsored Distributed Indexing/Searching Workshop in Cambridge, Massachusetts. The implementation takes advantage of two tags:
 

An example of this implementation can look like this:

<HTML>
<HEAD>
<TITLE>Example Document with Metadata </TITLE>
<META NAME="Meta.Title" CONTENT="Example document">
<META NAME="Meta.Author" CONTENT="BC Torrissen">
<META NAME="Meta.DateCreated" CONTENT="26111997">
<LINK REL="Schema.Meta" HREF="http://meta.idi.ntnu.no/meta.html">
<LINK REL="META.FORMAT" HREF="http://meta.idi.ntnu.no/metadefinition/">
</HEAD>
<BODY>
Insert the document with contents as described in the metadata above here.
</BODY>
</HTML>

A Web spider familiar with the Warwick Framework, in addition to gathering the contents of the body of this HTML document, will also be able to index what is the title, author and creation date of the document. This is done according to the metadata scheme Meta, which can be found at the location meta.ntnu.no. There is also a pointer to where human readers can find a description of the metadata schema used here.
 

2.2.2 The Meta Content Framework (MCF)

The goal of MCF is to provide a common basic way to abstract and standardize the representation of the structures we use for organizing information. Today we have e-mail applications for keeping track of our e-mail, we have word processors/viewers to handle our documents, we have Web browsers for handling Web pages and a few other Internet protocols and so on. What Guha [Guha, 1997], who almost single-handedly has developed MCF at Apple research, is trying to come up with, is a way to give all these objects a meta description. With a set of metadata like this, we can gather information from files in various formats and keep track of them within the same application environment, or within the same "information management and communications application" (IMA), as he writes. After all, the reason information is accessed through different applications is not that the information is different from application to application, but that it is formatted and organized by different protocols.

The main applications for the MCF project has so far been Apple's HotSauce project and ProjectX, which provides a new way of visualizing and navigating through hierarchically stored information, whether it resides on the Web or on a single computer. Apple has officially dropped the research on MCF, but the concept has gained many enthusiastic followers, and seems to live on. One of the largest experiments has been to convert the Yahoo! directory to MCF, making it possible to navigate Yahoo! by "flying" through a three-dimensional information space, as shown in Figure 2-2. By moving in close to a category, the category opens and sub-categories and actual documents appear.
 

Figure 2-2, Screenshot of the Yahoo! Web site as visualized by Apple's HotSauce
 

The core of the MCF is the .mcf-file, containing meta information about the contents of the documents that the file is to cover. These files are generated from data manually produced by human users. MCF provides an SQL-ish language for accessing and manipulating meta content descriptions, as well as a standard vocabulary for terms to describe the document's attributes, such as "author" and "fileSize". Users can choose to use their own terms if they like. If they do, however, integrating their content information with others' will be more difficult.

MCF is fully scalable, meaning that the same architecture is to be used, whether it is for holding meta content information for a single computer or for the whole Internet. It is also designed to minimize the up-front cost of introducing the new technology for developers of existing applications. MCF does not aim to replace existing formats for exchanging and storing meta content. Instead, information in existing formats can be assimilated into richer MCF structures, thus making it possible to combine information from several formats into a larger MCF-based index.
 
 

2.3 How It Used to Be Done

How to quickly locate relevant information from a large source of information is not a new challenge. Archives and libraries of different kinds have existed for thousands of years. Suggestions for how to categorize and index books, magazines and other publications have been many. For paper-based information librarians today seem to be satisfied by the systems most widely in use: Dewey- and MARC-based cataloguing tools.
 

2.3.1 The MARCs

The MARC (MAchine-Readable Cataloguing) format was developed in the 1960s as a standard format for exchange of library catalogue records. The Library of Congress in Washington, USA maintains the most widely used MARC (MAchine-Readable Cataloguing) system, USMARC, in consultation with various user communities. This format is a very detailed set of codes and content designators defined for encoding machine-readable records, well suited for computer processing.

A number of additional dialects of MARC exist, both for national and international communities, but the basic idea remains the same in all MARCs. In USMARC, formats are defined for five types of data: Bibliographic, Authority, Holdings, Classification and Community information. Within these types a number of fields are defined, and may contain all kinds of information about the documents. For example, for bibliographic data, [MARC, 1996] codes are assigned like this:

0XX = Control information, numbers, codes
1XX = Main entry
2XX = Titles, edition, imprint
3XX = Physical description, etc.
4XX = Series statements
5XX = Notes
6XX = Subject access fields
7XX = Name, etc. added entries or series; linking
8XX = Series added entries; holdings and locations
9XX = Reserved for local implementation

As the name indicates, the MARC system is for interchange of bibliographic information between computer systems. However, to create a high quality MARC record requires skilled personnel who are experienced in the use of cataloguing rules. The motivation for creating MARC records is that if every library keeps a list of their resources in this format, information from several libraries can be collected and an index of all available resources from all libraries in a specific region can be created. This provides a great tool for locating information from wherever in the region it is available. Below is an example of a typical MARC record, taken from the Norwegian MARC dialect, NORMARC:

*001972095632
*008   eng
*015   $alc97024364
*020   $a0-07-035011-6
*082   $c006.3
*100   $aKnapik, Michael
*245   $aDeveloping intelligent agents for distributed systems
       $bexploring architecture, technologies, and applications
       $cMichael Knapik, Jay Johnson
*260   $aNew York$bMcGraw-Hill$cc1997
*300   $ap. cm.
*650   $aIntelligent agents (Computer software)
*650   $aElectronic data processing$xDistributed processing
*650   $aComputer software$xDevelopment
*700   $aJohnson, Jay$d1957
*096c  $aRMH$n97c016905

The three-digit codes are indicators of what information follows on the line, $ indicates the start of an information field, and then the actual information is given. In the example above, the line reading  "*082 $c006.3" tells us that the book is classified under 006.3 using the Dewey classification system.

As we see, the MARC format is similar to the aforementioned Dublin Core.
 

2.3.2 The Dewey Decimal Classification

The Dewey Decimal Classification (DDC) is not an alternative to MARC, but a tool for categorizing information into different topics. In this way, it is just a subset of the information contained in the MARC system. Still, the DDC is a very much appreciated tool in the library world. For reasons that will become apparent, we will take a closer look at the way DDC is organized.

The DDC was invented and published by the American librarian Melvil Dewey in the mid-1870s [Dewey, 1994]. It was originally devised as a system for small libraries to catalogue books, but has since been used also in larger scenarios than the local and school libraries that were the first to start using the system extensively. The system is based on ten classes of subject (000-999) which in turn are further subdivided. The main classes, which shall cover all human knowledge, are:

000-099 Generalities
100-199 Philosophy & Psychology
200-299 Religion
300-399 Social Science
400-499 Language
500-599 Natural Sciences&Mathematics
600-699 Technology (Applied Sciences)
700-799 Arts & Entertainment
800-899 Literature
900-999 Geography & History

Each of the ten classes is further divided into ten subclasses. For example, the 700's are divided into these subclasses:

700-709 The Arts
710-719 Civic & Landscape art
720-729 Architecture
730-739 Plastic Arts, Sculpture
740-749 Drawing & Decorative arts
750-759 Painting & Paintings
760-769 Graphic Arts, Prints
770-779 Photography & Photographs
780-789 Music
790-799 Recreational & Performing Arts

The subdivision of classes continues for as long as necessary to describe very precise topics. An example is "butterflies", going into Dewey Decimal Classification 595.789, deduced along the path: Natural Sciences (500) --> Zoological Sciences (590) --> Other Invertebrates (595) --> Insects (595.7) --> Lepidoptera (595.78) --> Butterflies (595.789).

Dewey has several advantages making it easy to use. The codes are uniformly constructed with room for describing any particular topic. "Birds found in Italy" can be constructed to be 598.0945, as 598 means Aves/Birds, the .09 indicates geographical treatment, and 45 is the geographical Dewey code for Italy. Adding an additional 9 at the end, making it 598.09459 will limit the topic to be Sardinian birds.

Although at first one may be a bit skeptical about the idea, as it seems like a mess of numbers to remember to be able to use the system for anything at all, this system is very easy to use. The main reason for this is that the codes are conceptually well organized, so that you actually do not have to know any codes at all to start searching for the information you need. Knowing the codes will just make gathering of information from the system go faster. The codes are well maintained centrally, by Library of Congress personnel, and one code will never have more than one meaning.

As of today, the DDC numbers appear in MARC records issued by countries throughout the world and are used in a multitude of national bibliographies; Australia, Botswana, Brazil, Canada, India, Italy, Norway, Pakistan, Papua New Guinea, Turkey, the United Kingdom, Venezuela and other countries. Relatively up-to-date comprehensive translations of the DDC are available in Arabic, French, Greek, Hebrew, Italian, Persian, Russian, Spanish and Turkish. Together with English these languages are spoken by more than 1,1 billion people today [WABF, 1996]. Less comprehensive but still fairly detailed translations exist in a large number of languages. A number of tutorials and guides to the DDC are also available.

Some general advantages of the use of the Dewey Decimal Classification are:

A similar classification system, the Universal Decimal Code (UDC) was designed in the late 1890's by the International Federation for Information and Documentation. This was a mainly European effort, and had permission from Dewey to translate and adapt DDC for the purpose of preparing a universal bibliography. The principles behind the two code systems remain the same, and both UDC and DDC are in use in many libraries worldwide today.
 
 

2.4 Problem Summary

So far in this background chapter we have identified some of the main problems that need to be addressed, to come up with a solution to the problem of how to perform high quality information searches on the Internet, within a context. The problems are summarized here, not necessarily in any order of importance:
  In this paper the focus will be on the top problem; the problem of context. The other problems will not be ignored, but used as guidance when it comes to making decisions concerning the suggested solution to the problem of context.

From the nature of some of the partial problems, it is obvious that we have a situation where the job that needs to be done is too large to handle for a relatively small group of people. We also have a situation where it seems as if manual work of human quality is required. This is not a new situation to man. We have come up with technology for doing things we could not do ourselves before. Let us take a closer look at what it is, exactly, that we want to be done.
 


Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:  bct@pvv.org