next up previous contents
Next: Conclusion Up: Conclusions and Future Work Previous: Conclusions and Future Work

Summary of Important Features

The use of the algorithm described in Chapter gif has been shown to give good results in noisy data sets (see Chapter gif for the test results). The best results are achieved when rules from all parts of the lattice are used, but this is only slightly better than when only the top node is used. The weight of evidence measure method performed slightly better than the method of linear voting. Below we sum up what we consider to be the most important aspects of our system, in both good and bad:
Time Complexity: One of the most serious problem with RGEN is the time complexity during learning. When faced with large data sets, it will use a lot of time. The time complexity is a function both of the number of objects and the number of attributes. If all nodes in the lattice are visited, the complexity with regard to the number of attributes is . Thus, there is a rather low limit on the number of attributes, and this limitation should be solved somehow.

The use of the generated rules is rather fast, and has a complexity of for n objects and r rules. This only could be improved with some kind of seeking method to find the classification fast. Since the classification is linearly dependent on the number of objects, it is highly scalable in contrast to the rule generation program.
Diverse Rules: The reason that RGEN uses so much time is that it creates a very complete set of default rules. These rules are very good in the sense that some of the defaults almost always will match new objects. They are also efficient to use, and simple to understand. If attributes which distinguish between certain objects with different classification are missing, default rules which somehow describe this situation will be created.

A possible drawback is that the large set of rules will be unnecessary big and a bit difficult to use. With a low threshold value, a lot of rules which really are no good could affect the methods based on voting to take a different decision than with a higher threshold value.
Noise Handling: The default rules created give the system good possibilities for dealing with noise. First of all, RGEN associates an accuracy measure with each rule, and RCLASS then use this for selecting the most plausible classification. In this way, noise both in the training and test sets are dealt with in an orderly manner. Noise in the form of missing attribute values are not allowed in the training set, but will be handled by the classification program if the value ``unknown'' or similar is inserted for the missing value.
Classification Accuracy: Perhaps the most important feature of a data mining system is whether it classifies correctly or not. If it does not, it will be no good no matter how fast or user friendly it is. Our system classifies data sets rather good, and has its main strength when it comes to noisy data. The number of correct decisions does not fall drastically as noise is added to the database.
Lattice Information: The rules created by RGEN at the top node of the lattice correspond to the rules of RSES, and the rules at the bottom node are equal to the distribution of the training set. What is interesting is that it does not seem to be that the sum of the rules in all parts of the lattice gives any better classification results than the rules at the top node alone. This could be so because the different classification methods does not consider the lattice information when choosing classification.
Different Methods for Classification: RCLASS contains six different methods for finding the correct classification. The three methods of voting and the method of weight of evidence give very similar results, but it seems that the weight of evidence measure method is slightly better than the others.
Visualization: RCLASS has some visualization possibilities which allows the user to select rules from some parts of the lattice only. From our experiments, this has not shown to be very useful, as using all rules usually give the best results. Still, if some attributes are of no significance to the classifications, one way of discovering this is to avoid using rules which are based on that attribute's value. This is easy to do by using the graphical user interface. Selecting different parts of the lattice is also interesting if one wants to improve the rule generation algorithm.
Areas of Use: The best utilization of the program system would be a problem with the following characteristics:

If one or more of the items above are not satisfied, other systems may be more appropriate. For instance will RSES probably do just as good with consistent data without noise as our system will.


next up previous contents
Next: Conclusion Up: Conclusions and Future Work Previous: Conclusions and Future Work

Helge Grenager Solheim
Sat May 4 03:30:02 MET DST 1996