The use of the algorithm described in Chapter
has been shown to give good results in noisy data sets (see
Chapter
for the test results). The best results
are achieved when rules from all parts of the lattice are used, but
this is only slightly better than when only the top node is used. The
weight of evidence measure method performed slightly better than the
method of linear voting. Below we sum up what we consider to be the
most important aspects of our system, in both good and bad:
Time Complexity:
One of the most serious problem with RGEN is the time
complexity during learning. When faced with large data sets, it will
use a lot of time. The time complexity is a function both of the
number of objects and the number of attributes. If all nodes in
the lattice are visited, the complexity with regard to the number of
attributes is . Thus, there is a rather low limit on the
number of attributes, and this limitation should be solved somehow.
The use of the generated rules is rather fast, and has a complexity of
for n objects and r rules. This only could be
improved with some kind of seeking method to find the classification
fast. Since the classification is linearly dependent on the number of
objects, it is highly scalable in contrast to the rule generation
program.
Diverse Rules:
The reason that RGEN uses so much time is that it creates a very
complete set of default rules. These rules are very good in the sense
that some of the defaults almost always will match new objects. They
are also efficient to use, and simple to understand. If attributes
which distinguish between certain objects with different
classification are missing, default rules which somehow describe this
situation will be created.
A possible
drawback is that the large set of rules will be unnecessary big and a
bit difficult to use. With a low threshold value, a lot of rules which
really are no good could affect the methods based on voting to take a
different decision than with a higher threshold value.
Noise Handling:
The default rules created give the system good possibilities for
dealing with noise. First of all, RGEN associates an accuracy
measure with each rule, and RCLASS then use this for selecting the
most plausible classification. In this way, noise both in the training
and test sets are dealt with in an orderly manner. Noise in the form
of missing attribute values are not allowed in the training set, but
will be handled by the classification program if the value ``unknown''
or similar is inserted for the missing value.
Classification Accuracy:
Perhaps the most important feature of a data mining system is whether
it classifies correctly or not. If it does not, it will be no good no
matter how fast or user friendly it is. Our system classifies data
sets rather good, and has its main strength when it comes to noisy
data. The number of correct decisions does not fall drastically as
noise is added to the database.
Lattice Information:
The rules created by RGEN at the top node of the lattice correspond
to the rules of RSES, and the rules at the bottom node are equal to
the distribution of the training set. What is interesting is that it
does not seem to be that the sum of the rules in all parts of the
lattice gives any better classification results than the rules at the
top node alone. This could be so because the different classification
methods does not consider the lattice information when choosing
classification.
Different Methods for Classification:
RCLASS contains six different methods for finding the correct
classification. The three methods of voting and the method of weight
of evidence give very similar results, but it seems that the weight of
evidence measure method is slightly better than the others.
Visualization:
RCLASS has some visualization possibilities which allows the user to
select rules from some parts of the lattice only. From our
experiments, this has not shown to be very useful, as using all rules
usually give the best results. Still, if some attributes are of no
significance to the classifications, one way of discovering this is to
avoid using rules which are based on that attribute's value. This is
easy to do by using the graphical user interface. Selecting different
parts of the lattice is also interesting if one wants to improve the
rule generation algorithm.
Areas of Use: The best utilization of the program system would be a problem with the
following characteristics: