8 Implementation Aspects



 

Throughout the chapters, we have presented many aspects of the technology behind the kind of search tool we outline in this thesis. Before we can conclude the thesis, some loose threads need to be tied up, strengthened or cut off.
 
 

8.1 Webrarians are Agents

We have claimed that the webrarians are agents. Our agency employs two main types of agents: We will now show that the webrarians meet the agent definition from Chapter 4.1, and explain what possibilities using agent technology opens for our system.
 

8.1.1 Retrieval and Pre-classification Agents

The information retrieval webrarians: The Web page pre-classification webrarians:

8.1.2 Search Interface Agents

The search interface webrarians:

8.1.3 Choosing A Webrarian Language

This report has concentrated on the search tool application only. This may give the impression that the  agents do not have to communicate very much with each other, but can get the information they need directly from the Web and the index that is built. However, in the future it will be important that the webrarians are able to communicate with other agents and applications, so we need to base an agent system implementation on a widely “spoken” agent language. There are several reasons for this.

Most important, the architecture we have described is based on there being one central index covering all Web pages on the whole Internet. This is maybe not the most probable approach in real life. Instead, an alternative is to create a number of indexes, where each index covers either Web pages only from a  certain geographical area or Web pages all over the world that cover some more or less specific subject. Each index will have retrieval agents trained to only report back information about pages that are of interest to that particular index. Distributing the indexing system like this has many advantages, as we shall see later. A situation like this will require a possibility for webrarians to communicate with agents in other indexes, to be able to tell the user where to look for the information the user is requesting.

Another very important reason for using agents with the ability to communicate with other agents, is that in the future more and more operations, especially when it comes to information handling, will be performed by more or less intelligent agents. Personal news agents should be able to look up breaking news on the Web from all kinds of categories for the user through our index. Broker agents may also be introduced, meaning that instead of the user doing the search assisted by a Webrarian, a personal agent that knows its owner’s desires and interests really well, may be able to talk directly to our webrarians and locate interesting information for the user. Web publishing software may come equipped with publishing agents, which at publication generates the necessary metadata and goes straight to our index to register the new Web pages quickly and correctly.

To make all this possible we must base our agent system on the de facto standard agent communication language, ACL, as described in Chapter 4.5. When an actual standard agent communication is decided on, that must be our choice as well. Communication with other agent systems is vital to the success of the webrarians.
 
 

8.2 Centralized vs. Distributed Indexing

As we have already briefly mentioned, an alternative to having one large “Mother of all indexes” is to divide the search tool’s functionality and contents into several separate but agent-connected units. An index in the distributed model can cover either a geographical area, a specific topic, a certain context or a combination of these. The main advantages of distributing the indexing and search system are:
 
  1. A faster search process in more up-to-date data is made possible. This means that as soon as a user finds an index that covers his or her most important information needs, the user will get a faster service and higher quality data
  2. Smaller indexes means that less raw computing and networking power is required, as the workload is divided on several computers. Therefore, instead of buying expensive equipment capable of handling massive amounts of requests and transactions and storing enormous amounts of data, we can use standard computers and Internet connections with a reasonable bandwidth.
  3. With the distributed workload, the chance of a system breakdown is reduced dramatically. This automatically results in more robustness and flexibility in the system. If one index computer or index site goes out of service, temporarily or forever, the requests from users to this index can be routed on to another index which holds information that may be of interest to the user.
These advantages are so significant that as soon as a prototype of the search system is built, the next step should be to start distributing the index.
 
 

8.3 Preparations For System Start-Up

In addition to the technical part, the programming and hardware setup, the EDDIC system is dependent on a number of organizational and human factors. To create and maintain our search tool, we need to team up with a number of partners. These partners will provide the code framework for our indexing system, they will help with improving the quality of the contents and they will help us with comments and suggestions for general system improvements. Who these partners shall be and how we can cooperate with them must be clear before we start building the system.
 

8.3.1 The Codes

The main fundament of the EDDIC system suggested in this report is the Dewey Decimal Classification code, which provides us with codes for describing the context of Web pages. The DDC code is maintained by the Library of Congress (LOC), so this will be an important partner for us. To ensure international popularity, we also need to cooperate with organizations in various countries that are responsible for translations of the DDC. These organizations are already looking for new ways to ensure easy, public access to all the information that is put on-line on the Internet. Because of this, it should be possible to convince them of the importance of the EDDIC system and their cooperation with us.

To develop a Web version of the Dewey codes, similar to the Dewey for Windows software, we must cooperate with the Online Computer Library Center (OCLC). In addition to letting people navigate in the hierarchy of codes as described in chapter 7, we must also define rules for the agents to use based on the notes from LOC on how to classify information using Dewey. This partnership should come natural as the idea behind the EDDIC system coincides well with OCLC’s purpose, namely to be “…a nonprofit, membership, library computer service and research organization dedicated to the public purposes of furthering access to the world’s information and reducing information costs” [OCLC, 1998].

Finally, we must cooperate with other developers and suppliers of various popular rating and code systems for Web pages. For now that first of all means the World Wide Web Consortium’s PICS group. From these partners we need thorough descriptions of the code formats we are to include in our index, so that our system may contribute to making new codes popular faster, and make it easier to search and filter information from the Internet. These two main effects should be motivation enough for various rating and code format “owners” to cooperate with us.
 

8.3.2 The Co-workers

To run the EDDIC service we need a staff. The staff’s tasks can roughly be divided into technical maintenance and development work on one side and manual classification work to maintain the contents of the index on the other.

For the technical part, we should work together with people who already have experience in creating Web spiders for information retrieval. This means most search engine providers can be chosen. However, it will be best to start with a smaller domain, and choose to cooperate with the people behind a search engine that only covers one part of the Internet and has comprehensive data about this part. A very interesting partner would be the Nordic Net Centre [NNC, 1998]. This is the main actor in a joint effort from several large educational institutions in the Nordic countries, who is running a project where they are looking at metadata information and Web pages, scheduled to end in the spring of 1998; The Nordic Metadata Project, which is used in for example the Nordic Web Index [NWI, 1998]. Throughout this project a database of tens of thousands of pages with Dublin Core metadata has been built, as well as 5-6 millions of full-indexed Web pages, including link structure, from all over the Nordic region. The experience and data from this workgroup can be very useful to us, both as a help in creating the software and to test the hypothesis about the importance of using URLs to decide the context of a Web page.

In the long run, we will need a number of technical people working on keeping the search tool running and improving it. These should be full-time employees, working directly for the institution or company offering the search service.

For the manual work, we must employ people with classification experience, and the natural choice is to use librarians. The number of librarians to employ will be decided by how ambitious we want to be. In the beginning we do not need specialists in any fields, but just general classification personnel with a good knowledge of the languages we want to cover Web pages in. Later, we may need specialists on some of the subjects that has a large number of Web pages devoted to them. To find suitable librarians, we must contact a university library or a technical library which has resources for and is interested in participating in projects such as ours.
 

8.3.3 The Contents

To ensure a high quality database from the start, we should initiate it by mapping an existing directory into it. The directory should be divided into topics in a well-designed hierarchy. The most comprehensive directory suitable for the purpose is the Yahoo! service. Compared to building our index from scratch, it is an easier task to map the categories from Yahoo! into Dewey codes and, using the mapping, move Web page entries from the Yahoo! hierarchy into the EDDIC hierarchy. The amounts of data covered by the Yahoo! index should be sufficient to give our webrarians a good basis for their classification work.

In return for providing their index information, Yahoo! can later be given information useful to them, such as information about expired links and the possibility of copying information about high quality Web pages from our index.

Another option is to use the part of the Nordic Web Index that contains Web pages with Dublin Core metadata to build a base set of index entries. These about 80,000 entries contains information about title, author, subject, description and publication date, which should be a suitable training environment for the  webrarian agents and the human classification personnel.

To reduce the amount of network traffic in the future, we may want to cooperate with one or more of the major search engine companies. As long as we verify the actual existence of a page, we may as well use the page as it is stored in a search engine index instead of the page itself, when we classify a Web page.
 

8.3.4 The Comments

The EDDIC system must be built iteratively, with a prototype for test users to use and give feedback on as soon as possible. We need test users both on the classification side and the search side. The test users may or may not have had any experience with existing systems for organizing and searching the Web or other large information spaces. Our system introduces innovative ways of doing both in a way not very similar to any existing system, so the most important thing is not the test users’ knowledge and experience, but that they represent the exact kind of users the system is intended for; everyone.

If the system behaves as we hope it will and actually makes it easier to locate the information you want, finding people to volunteer as test users should not be difficult. All feedback received from the users should be considered and rapidly result in changes to the system if this seems called for. Successful modifications are kept as part of the system.

The comments from users is a very important part of a rapid development of a functional system. Because of the way the Web is growing, it is necessary to get the system up and running as soon as possible. Having many satisfied test users is also important in the process of making the search tool known to the world. In addition to advertisements, word of mouth is the best way to spread the news about our system.
 
 

8.4 Financing the Service

To start and run the EDDIC system will be expensive, due to the cost of developing the software, buying the neccessary hardware, employing classification personnel and making the world aware of the system’s existence. We must consider how this all can be financed.

The EDDIC project can either be implemented within the university world or by a purely commercial company or coalition. In any case it is important that the actors understand that this service must be realized as soon as possible, and are willing to cover the costs as they appear. The system is more or less ready to be implemented, based on the specifications and references contained in this report. Possible financial supporters may be national research councils, money reserved for major university projects or one of the main actors in the Internet/search engine world who wants to be profiled as innovative.

Independently of who the investor behind the system is, we have several options to choose between when it comes to finding ways to return the investor’s money and more:

First the traditional solutions, easy to implement:

With the arrival of e-cash, we can start charging for the use of our service even in very small units of money: Selling access to the whole search tool or just our index to other search tools and Web sites, may also be a way to finance the service.
 
 

8.5 Dissecting the Monster

To build a functional, robust and scalable system like the one suggested in this report, a number of subdisciplines from several research areas must be brought together. Presented here is an outline of what problem areas our system construction task consists of, based on a similar, more general analysis for electronic commerce / digital library systems in [Adam & Yesha, 1996].

Area 1: Acquiring and storing information.

Area 2: Finding and filtering information. Area 3: Securing information and auditing access. Area 4: Universal access. Area 5: Cost management and financial instruments. Area 6: Socioeconomic impact. As we see, numerous fields from computer science, library science, mathematics, graphical design, social sciences, economy and psychology must be combined to successfully implement the EDDIC system. A broad cross-disciplinary effort is required.
 

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:  bct@pvv.org