6 The EDDIC Code Format



 

This chapter contains a discussion about the requirements for the Extended Dewey Decimal Internet Classification (EDDIC) code. We start by defining the factors that must be considered when designing the format, and present a suggestion for how the index entries should be built and what they will look like. An EBNF-like definition of the EDDIC code concludes the chapter.
 
 

6.1 Critical Factors For The Code Design

When designing the format for a code to be used for our index entries, there are a number of factors we have to keep in mind.  

6.2 Important Document Properties

To be able to offer a flexible narrowing of search scope, we need to decide which document properties are most important to store information about in our index. We must pick properties that can be automatically detected in as many cases as possible. The most important properties are probably the kind of properties that people quickly perceive themselves when they read and look at Web pages, as these are the ones people are most likely to use when they search for a particular Web page.
 

6.2.1 Unique Identificator

The only attribute of a Web document that is guaranteed to be unique to that particular page is the URL. Therefore, we pick the URL to be used as an identificator. This is automatically collected by the Web search agents, together with the HTML <TITLE> tag, which is a very short description of the page provided by the creator of the page.

Using the URL as the identificator will result in very large differences in the length and contents between the different index entries, but we have no other options. It is necessary to provide robust mechanisms for handling the situations where URLs cease to exist and where a totally new page appears at the same URL as a page that is already in the index.
 

6.2.2 Topic / Category

The single most important document property when it comes to searching among and separating between home pages is the extended Dewey code that provides the context for the Web page. As a basis for the hierarchy of interest area codes, we have chosen to use the Dewey Decimal Classification system.

One criticism that may be raised against it is that it was made a long time ago and does not cover all newer areas very well. As described earlier, we are convinced that there are enough useful properties of the DDC to justify using it as a basis. An interesting addition to the 000-999 codes is to introduce new codes related to areas such as electronic commerce and specific kinds of Internet services. This should be kept in a format similar to the original Dewey codes, for example as E00-E99. Most users of the system will not see the actual codes anyway. Instead, they see the full title of the topic each particular code covers, automatically translated from the code numbers by the system.
 

6.2.3 Web Page Class

As discussed in Chapter 3.3, Web pages can be classified not only by what topic they cover, but also by what kind of page it is. Different classes of Web pages are personal home pages, fan pages/pages dedicated to someone or something, major link pages, news pages, sports news pages, public information pages and so on. This is a document property that may be very useful to include in the index, particularly for pages that carry contents of high quality. Also, “Breaking news” could be a page class, making it easy to find news about very recent happenings. When the news is not so new anymore, the code can be removed from the page’s index entry. A two-digit code should be sufficient for indicating Web page classes; 00-99.
 

6.2.4 Language

To indicate what language is used in a Web page we use the Z39-53 standard codes from the National Information Standard Organization. This is a three-character text code, covering some 400 different languages as of today.
 

6.2.5 Contents Ratings

Several formats for rating the contents of Web pages in different ways have appeared. There is no de facto standard for this rating system yet, but it seems as if the PICS project, described in Chapter 3.5, will be a major system in the near future. We shall support new systems as they appear on the scene, but for now we should start out with incorporating PICS codes in the EDDIC code where available. This means ratings level codes for violence, nudity, sex and profane language on the Web pages.
 

6.2.6 Graphics Use

To people using modems to access the Internet, an important factor when choosing what page to look at is how much graphics, that is pictures and animation, the page contains. If you are looking for textual information, you will probably be more interested in pages with a lot of text than pages using a lot of graphics, which may make retrieval of the page slow. On the other hand, if you are looking for pictures of someone or something, you may want to concentrate your search on pages containing much graphics.

How much graphics there is on the page may be measured by how many <IMG> tags the HTML-formatted page contains. It may be more fair to large Web documents to calculate a “graphics value” by counting how many <IMG> tags there are per 1,000 words. No matter what approach is chosen, some kind of numeric value for graphical contents level should be calculated and included in the index.
 

6.2.7 Periodicity

How often a Web page is updated may be of interest in many situations. Therefore the index should contain information about how often the page is believed to change contents in one way or another. Values may be codes indicating if it changes for example “All the time”, “Daily”, “Weekly”, “Bi-weekly”, “Monthly”, “Annually” or “Never”.
 

6.2.8 Keywords

Since we do not want to store whole Web pages in the index, it is necessary to offer a mechanism for allowing some kind of search using keywords and search strings within the context the user is interested in. Each page can have up to ten keywords for this purpose. The words should be as distinctive for that page as possible. Typical examples of good keywords are names of people and geographic locations that do not have their own special code within the DDC system, registered trademarks, brand names and other words that are given sufficiently high values in the word weighting procedure described in Chapter 5.2.

Setting the maximum number of allowed keywords to 10 is merely a preliminary suggestion. What the ideal number of keywords is should be settled by future research. An alternative to include and use keywords in the index like it is suggested here, is to first use the other document properties covered by the EDDIC system to limit the number of possibly interesting URLs, and then perform a free-text search on a “normal” search engine, limiting the search to only return hits from the Web sites assumed to be of most interest by EDDIC.
 

6.2.9 Future Extensions

To prevent our system from becoming obsolete within a few years, it must be possible to add and remove Web document search properties. When new formats and standards for including meta information are introduced and turn out to be in significant use, this should be incorporated by our system. If the system has enough capacity, additional Web document qualities, such as Dublin Core elements, Java scripts and different plug-ins, should be detected and registered by the EDDIC index.

An extension we already now may consider to include is expiration dates for Web pages. For certain happenings and business offers, the Web page may contain information on when the page will be taken off the Net, most often because the information has no value anymore. The agents should be able to perceive such meta information and make sure the index entry is removed at the same date as the page itself is taken off the Web, instead of having the system detect it automatically.

Other interesting extensions may be for describing any costs related to accessing certain Web pages, what plug-ins or other technology is required to fully experience a Web page, codes for pages that offer special security for money transactions and whether the page contains advertisements or not.
 

6.2.10 Index Maintenance Data

Unfortunately, even though we get a Web page classified and indexed, the work is not over. Due to the dynamic nature of the Web, pages sometimes move from one URL to another, or they may even be taken off the Web completely. To prevent our index from containing links to non-existing pages, we need to check on every single link every now and then. This can be done automatically, but to do it systematically, we introduce the expiration check field to the index entries.

Each page must be checked on at least once every month. If the expiration value is set to 30 when the page is indexed, and the value is decreased by 1 every day, an expiration check can be initiated when the value reaches 0 (zero). If the page seems to be ok, the counter is reset to 30. If the page can not be found, a new existence check must be done in a few days to make sure a page is not prematurely removed from the index. Because of high load, server failures and temporary malfunctions in computers, networks and cables, a page is sometimes off the Web for a short period. This way of dividing the expiration check dates between all the pages will soon divide the work evenly from day to day. The value of 30 is only a preliminary suggestion that guarantees that “dead” links are removed from the index after just over a month in the worst case.
 
 

6.3 Automatic and Analytic Meta Information

The document properties we have decided to include in the index entries for Web pages can either be found automatically by the agents or they need to be settled by manual classification personnel, supported by the agents’ preparatory work. In general, we can say that the properties will be found like this:

Automatically: URL, Title, Language, Ratings, Graphics, future extensions and expiration data.
Through agent-supported manual analysis: DDC code, Page Class, Periodicity, Keywords

In some cases it may be impossible to find suitable information to put into each of the possible fields of the index entry of a Web page. Instead of having an “Undecided” value for each field, we will leave these fields empty for that particular Web page. This means that each field must be headed by a short code saying what kind of information the field contains. This is also the most practical way of doing it when considering that we may also add new fields to index entries in the future. The suggested field codes and some additional information are shown in Table 6-1.

Field codes the “Required” column is ticked for must be a part of each index entry. Other field codes are optional. An index entry can contain several values for the field codes where the “Multivalue” column is ticked.
 
 
Field Field Code Short for
Required
Optional
Multivalue
URL AD Address
*
Title HD Header
*
Dewey Decimal
     Classification
DC Dewey Classification
*
*
Page Class PC Page Class
*
*
Language, Z39-53 LC Language Code
*
Ratings RL Ratings Levels
*
*
Graphics Use GR Graphics
*
Periodicity PE Periodicity
*
Keywords Kn Keyword n
*
*
Extensions Enn Extension nn
*
*
Expiration check Xnn Page eXisitence due in nn days
*

Table 6-1, Field codes for use in the EDDIC index

Based on the table, this is what an index entry may look like, using only ad-hoc codes for now:

ADhttp://www.ntnu.no/indexe.html;;HDWelcome to NTNU;;DC378.05;;
PC41;;LCENG;;GR2;;PE0;;K1NTNU;;K2university;;K3Trondheim;; K4faculties;;X10;;

Explanation: The entry tells us

Including the field delimiter “;;”, chosen because it is a string that is unlikely to occur in the fields themselves, all this information is contained in a 142 character entry. The codes used for page class, graphics use and periodicity are not actual codes, but codes created for the example only.

Because we do not index the actual contents about the Web page, but only its “properties” including topic/context, we also avoid any copyright infringements. This point has become more important as there have appeared cases of lawsuits between search engine companies and Web page owners concerning the rights to offer “second-hand” information.
 
 

6.4 EDDIC – The Code

To describe the EDDIC index entry in its general form, we use an EBNF-like [Backus, 1959] (Extended Backus-Naur Form) syntax, where “[ .. ]” means optional field(s), “{ .. }” means “one or more entry fields”, < .. > means “all required fields” and “|” indicates alternatives:

IndexEntry          ::= RequiredProperties [ OptionalProperties ]
RequiredProperties  ::= < ReqProperty >
OptionalProperties  ::= { OptProperty }
ReqProperty         ::= ReqPropertyCode Value Delimiter
OptProperty         ::= OptPropertyCode Value Delimiter
ReqPropertycode     ::=  AD | { DC } | GR | { Kn } | Xnn
OptPropertyCode     ::= HD | { PC } | LC | { RL } | PE | { Enn }
Delimiter           ::= ;;
Value               ::= LegalString
LegalString         ::= { LegalCharacter }
n                   ::= 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
LegalCharacter      ::= Any character except ”;” when following a “;”

Since standard EBNF have no simple way of indicating what terms must be a part of an expression, let us stress that the above means that the index entry of any Web page shall contain at least the fields containing the URL (AD), one or more Dewey codes (DC), a graphics value (GR), one or more keywords (Kn) and an expiry code, Xnn. Optionally an index entry may also include a title field (HD), one or more page class codes (PC), a language code (LC), one or more ratings for the page (RL), the periodicity of the updating of the page (PE) and one or more extensions (Enn).

The string “;;AD” will indicate the start of a new index entry in a file containing several entries.
 

6.4.1 Meeting the Requirements

Our code meets the requirements from chapter 6.1, where we listed a number of important factors when designing a code framework for indexing use:
  The parts of the EDDIC code that are directly related to describing contextual information about a Web page also answer to the factors mentioned in Chapter 3.2: Hence, we can conclude the chapter feeling satisfied about the outlines for the code, and go on to look at how our index can be used to offer new ways to search the Internet.
 
 
Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:  bct@pvv.org