next up previous contents
Next: Our Task Up: No Title Previous: List of Figures

Introduction

In our society, ever increasing amounts of information is stored electronically in so called databases. This information, called data, has been collected from sales registers, patient journals, scientific observations, office work, public services and many other areas. The information describes details about persons' height, weight, age and disease symptoms, which goods was sold to whom and when, what kind of observed night sky contained a comet and much other. Common for most of this data is that it has an enormous potential for containing valuable information which is yet to be discovered. Unfortunately, inferring information from these databases is beyond human capabilities simply due to size.

A computer system capable of analyzing such data, and presenting new knowledge in a form and amount understandable for human beings would be very useful. The process of doing so is called data mining. If the computer systems could learn from its own discoveries and use this knowledge in the future, we would have another important aspect of data mining.

The knowledge available in a database might be used in two different ways to gain new knowledge. Deductive learning combine information already in the database to come up with new information which in reality was already in the database. Inductive learning finds hidden patterns which exist among the different data, and this will lead to more interesting and general knowledge.

An example of data mining could help to show its potential. Consider a hospital which performs a special kind of surgery. Sometimes this operation should be given, and sometimes not, which is often not clear for the doctors in advance. If patient data is collected, along with a classification of whether this patient really should have been taken to surgery or not, a data mining system might be helpful. The system could create rules saying when a patient would have a successful surgery, and could predict this for new patients based on previous data. For such a system to be used, it is important that the system is able to justify its decision, by saying that the rule it used was that all female patients between 50 and 60 with the symptoms of the current patient had a successful surgery. This kind of analysis would be done faster than by any doctor, but should by no means be the only criteria for an operation.

What we are mainly interested in in this project report, is classification rules. These rules say that given some preconditions, the object belongs to a certain class. This class could be that an operation would be successful, or that a particular person should be given a loan. The other kind of rules from data mining are those which represent general patterns in the database, and are not necessarily usable for classification.

Sometimes, it is not possible to create rules which are 100% correct for all objects in the database. This could be because there are attributes that are missing which could have described the data, or because of noise and errors. Because of this, we are interested in learning default rules, which describe the most common situations. A default rule would for instance say that most birds fly. Because there very often are regular exceptions to common patterns, this could be included in the knowledge gained by for instance saying that penguins do not fly.

In this project, an algorithm creating such default rules was investigated. Our main goal of this project was to create a complete system showing the usability of a data mining system based on default rules. The algorithm for generating default rules was proposed by Torulf Mollestad[MS96, Mol95, Mol96], our supervisor. This algorithm is based on Rough Set theory, a mathematical concept introduced by Pawlak.

A rule generation program using the principles of Mollestad's algorithm was created by Jon Petter Hjulstad[Hju96]. The program generates many definite and default rules, which should be able to classify many new objects correctly. Still, the number of such rules is very large even for small data sets, and is within practical limits useless for humans.

We have continued the work of Hjulstad, and created a system which uses the generated default rules. By doing so, a better understanding of the rules generated, and the algorithm itself has been gained. In the next section, we present the project assignment text, and explain a some of our work.




next up previous contents
Next: Our Task Up: No Title Previous: List of Figures

Helge Grenager Solheim
Sat May 4 03:30:02 MET DST 1996