The first paragraphs give a couple of suggestions to the
traversal algorithm presented in Chapter
.
The remaining deal with more general aspects of data
mining with Rough Sets.
Traversing Lattice in Opposite Order:
The traversal in the lattice could be started at the bottom node
instead of at the top (Chapter
). As more
attributes are taken into consideration while ``climbing'' the
lattice, the rules generated cover fewer and fewer objects. A criteria
for when to stop ``climbing'' the lattice could be when less objects
than a certain threshold value are described by the rules at a node in
the lattice.
Using this approach, one could avoid visiting many of the time
consuming nodes (the more attributes involved in the discernibility
matrix, the more time it takes to generate rules) towards the top
of the lattice. Also, many of the specific rules found near the top is
likely to be very specific for the training set, and thus
less applicable for classifying unseen objects.
Traversing a Smaller Part of the Lattice:
The number of nodes in the lattice is exponential with respect to the
number of condition attributes in the data set. Only a computed subset of
these are visited, but still, many nodes are used. If ways to exclude
more nodes without diminishing the total quality of the rules
generated could be found, this could greatly reduce the execution
speed of RGEN.
Optimizing the code:
The storage space for generated rules during execution sometimes gets
very large. Rules from a node in the lattice are not used after all rules
from this node has been calculated. Still, they are not removed from memory
but occupy valuable memory. An attempt to fix this by calling the
destructor for the rule list, did not free all
memory occupied by this, presumably due to a bug in the
implementation. If nothing gets done, this is a major obstacle against
generating rules for anything but relatively small data sets.
Other areas of concern is mentioned in [Hju96].
Validity Measure On Blocks:
As previously stated, a block is generated when an exception to a
rule is found. A block is generated only when it is a 100% block. A
validity measure on the blocks could perhaps give rise to more diverse
classification methods than those of today.
To explain what is meant by a 100% block, consider a data set from which a rule saying that birds can fly is generated. Then again, in the data set there are 99 penguins. All of these are birds, and none of them can fly. A 100% block is created for this data set. Consider now the additional existence of a super-penguin with flying abilities in the data set. The 100% block would no longer be valid. With a validity measure, this block could still be represented as a 99% block.
In addition to the validity, a value giving the number of objects
a block was created from could be helpful. A block created from 10
objects could be ``stronger'' in a sense than a block
created from only one object.
Collecting Equal Objects in Classes:
Nothing is done for handling multiple appearances of the same object in
the data set. Not only does this lead to a slower rule generation, it
also increases the degree of duplicate rules in the rule file
output by RGEN.
In order to comply with this problem, objects could be given a
significance measure. For instance, 3 equal objects could be
represented as one with a significance measure of 3.
Using Genetic Algorithms to Calculate Reducts:
Generation of reducts is one of the most time consuming part of
generating rules. As of today this is done in an
rather inefficient way. A faster method is the use of genetic
algorithms for finding approximate reducts [Wro95]. Approximate
reducts may in fact work equally well as mathematically correct reducts.
Discretization:
When the data set to be processed contains real-valued data, the
number of attribute values for each attribute is likely to be very
high. Generating rules for this set would most likely lead to rules very
specific to this dataset. When a similar data set is to be classified,
the degree of attribute values not represented in the rules will be
high, thus leading to a poor classification.
In order to cope with this problem, the interval domain can be
partitioned, or discretized, to reduce the number of possible values.
A method for interval partitioning of the condition attributes
is presented in [LP]. The main goal of their approach is to
minimize the number of decision rules under the
assumption that the rules are deterministic. Another approach to
discretization is
described in [SN95]. Either of these
methods could be tried implemented as a preprocessing routine before rules are
generated using RGEN.
Joining of Attributes:
The number of attributes in a data set strongly affects the
computation time of rules. To reduce the number of attributes, two or
more attributes could be joined into one. Consider for
instance a data set with cars.
For an object, the two attributes brand and model have the
values Ford and Escort, respectively. These two could be
joined into a new attribute brand_model with value
Ford_Escort.
Binarization Technique:
Symbolic learning systems, like RGEN, is usually not concerned with
the order of discretized values. However, if it is of interest, a
binarization technique can be applied to derive a new set of
attributes from an original attribute. This is best explained with an example.
Attribute A1 has values ``low'', ``medium'' and ``high''. A set of new attributes can be constructed instead of the original attribute A1:
has binary value 0/1 : īf means ``low'', then means ``not low''has binary value 0/1 : if means ``low, medium'', then means
''not (low, medium)'', in other words ``high''
After the new set of attributes is constructed and replaced with the original
attribute, the rule generation can be applied to the new table. Of
course, one later needs to translate the binary values to the original value.
The binarization method is sometimes used in the decision tree method.
Dynamically Changing Rules:
It is a well known fact that databases are dynamic, new
information is added and no longer relevant information is
removed. Regeneration of the rule base for each alteration is an
expensive, and in many cases impossible job.
What can be done to keep the rule base updated in a dynamic database? When a object is added or removed, the number of objects fitting a rule must be updated for every rule matching the object. In a similar manner the accuracy of each rule can be updated.
The problem of creating new rules and deleting old one which have
become invalid from the rule base is a lot more complex, and will not
be discussed here. In RGEN a new object might also lead to a
different lattice traversal, with rules generated based on different
attributes than earlier.
Dealing With Unknown Values:
RGEN has no special functions to deal with unknown attribute values
in the data set. By recognizing these and taking proper actions,
better classification results could result.
The problem can be
approached in a number of ways. Some of these which could be
investigated are the following: