next up previous contents
Next: Improvements to RCLASS Up: Conclusions and Future Work Previous: Conclusion

Improvements to RGEN

The first paragraphs give a couple of suggestions to the traversal algorithm presented in Chapter gif. The remaining deal with more general aspects of data mining with Rough Sets.
Traversing Lattice in Opposite Order: The traversal in the lattice could be started at the bottom node instead of at the top (Chapter gif). As more attributes are taken into consideration while ``climbing'' the lattice, the rules generated cover fewer and fewer objects. A criteria for when to stop ``climbing'' the lattice could be when less objects than a certain threshold value are described by the rules at a node in the lattice.

Using this approach, one could avoid visiting many of the time consuming nodes (the more attributes involved in the discernibility matrix, the more time it takes to generate rules) towards the top of the lattice. Also, many of the specific rules found near the top is likely to be very specific for the training set, and thus less applicable for classifying unseen objects.
Traversing a Smaller Part of the Lattice: The number of nodes in the lattice is exponential with respect to the number of condition attributes in the data set. Only a computed subset of these are visited, but still, many nodes are used. If ways to exclude more nodes without diminishing the total quality of the rules generated could be found, this could greatly reduce the execution speed of RGEN.
Optimizing the code: The storage space for generated rules during execution sometimes gets very large. Rules from a node in the lattice are not used after all rules from this node has been calculated. Still, they are not removed from memory but occupy valuable memory. An attempt to fix this by calling the destructor for the rule list, did not free all memory occupied by this, presumably due to a bug in the implementation. If nothing gets done, this is a major obstacle against generating rules for anything but relatively small data sets.

Other areas of concern is mentioned in [Hju96].
Validity Measure On Blocks: As previously stated, a block is generated when an exception to a rule is found. A block is generated only when it is a 100% block. A validity measure on the blocks could perhaps give rise to more diverse classification methods than those of today.

To explain what is meant by a 100% block, consider a data set from which a rule saying that birds can fly is generated. Then again, in the data set there are 99 penguins. All of these are birds, and none of them can fly. A 100% block is created for this data set. Consider now the additional existence of a super-penguin with flying abilities in the data set. The 100% block would no longer be valid. With a validity measure, this block could still be represented as a 99% block.

In addition to the validity, a value giving the number of objects a block was created from could be helpful. A block created from 10 objects could be ``stronger'' in a sense than a block created from only one object.
Collecting Equal Objects in Classes: Nothing is done for handling multiple appearances of the same object in the data set. Not only does this lead to a slower rule generation, it also increases the degree of duplicate rules in the rule file output by RGEN.

In order to comply with this problem, objects could be given a significance measure. For instance, 3 equal objects could be represented as one with a significance measure of 3.
Using Genetic Algorithms to Calculate Reducts: Generation of reducts is one of the most time consuming part of generating rules. As of today this is done in an rather inefficient way. A faster method is the use of genetic algorithms for finding approximate reducts [Wro95]. Approximate reducts may in fact work equally well as mathematically correct reducts.
Discretization: When the data set to be processed contains real-valued data, the number of attribute values for each attribute is likely to be very high. Generating rules for this set would most likely lead to rules very specific to this dataset. When a similar data set is to be classified, the degree of attribute values not represented in the rules will be high, thus leading to a poor classification.

In order to cope with this problem, the interval domain can be partitioned, or discretized, to reduce the number of possible values. A method for interval partitioning of the condition attributes is presented in [LP]. The main goal of their approach is to minimize the number of decision rules under the assumption that the rules are deterministic. Another approach to discretization is described in [SN95]. Either of these methods could be tried implemented as a preprocessing routine before rules are generated using RGEN.
Joining of Attributes: The number of attributes in a data set strongly affects the computation time of rules. To reduce the number of attributes, two or more attributes could be joined into one. Consider for instance a data set with cars. For an object, the two attributes brand and model have the values Ford and Escort, respectively. These two could be joined into a new attribute brand_model with value Ford_Escort.
Binarization Technique: Symbolic learning systems, like RGEN, is usually not concerned with the order of discretized values. However, if it is of interest, a binarization technique can be applied to derive a new set of attributes from an original attribute. This is best explained with an example.

Attribute A1 has values ``low'', ``medium'' and ``high''. A set of new attributes can be constructed instead of the original attribute A1:

   has binary value 0/1 : īf  means ``low'', then  means
``not low''

has binary value 0/1 : if means ``low, medium'', then means

''not (low, medium)'', in other words ``high''

After the new set of attributes is constructed and replaced with the original attribute, the rule generation can be applied to the new table. Of course, one later needs to translate the binary values to the original value. The binarization method is sometimes used in the decision tree method.
Dynamically Changing Rules: It is a well known fact that databases are dynamic, new information is added and no longer relevant information is removed. Regeneration of the rule base for each alteration is an expensive, and in many cases impossible job.

What can be done to keep the rule base updated in a dynamic database? When a object is added or removed, the number of objects fitting a rule must be updated for every rule matching the object. In a similar manner the accuracy of each rule can be updated.

The problem of creating new rules and deleting old one which have become invalid from the rule base is a lot more complex, and will not be discussed here. In RGEN a new object might also lead to a different lattice traversal, with rules generated based on different attributes than earlier.
Dealing With Unknown Values: RGEN has no special functions to deal with unknown attribute values in the data set. By recognizing these and taking proper actions, better classification results could result. The problem can be approached in a number of ways. Some of these which could be investigated are the following:


next up previous contents
Next: Improvements to RCLASS Up: Conclusions and Future Work Previous: Conclusion

Helge Grenager Solheim
Sat May 4 03:30:02 MET DST 1996