next up previous contents
Next: Preparing for the Tests Up: No Title Previous: Running RCLASS Directly from

Experimenting with RGEN and RCLASS

 

When comparing the effectiveness of different learning from examples (LFE) algorithms, they are commonly tested against real world data sets. There are a number of freely available data sets around, and some are more popular than others with regard to testing. Particularly the ones stored at Irvine University Database Repository [MM96] have been used in many experiments, and has more or less set a ``standard'' for data sets. The repository makes it easy for researchers working with LFE algorithms to compare new algorithms to previous work.

However, as Steven S. Salzberg points out in [Sal96], comparative studies of classification algorithms can easily result in statistically invalid conclusions, unless done very carefully. He describes several pitfalls when many different experiments are done with a moderate amount of data sets. At the repository, there are about 140 data sets with varying characteristics. From these, only a few may be applicable for experiments for the system undergoing a test.

Another way to test LFE algorithms is to use well understood synthetic data sets. With synthetic data sets classification accuracy can be more accurately assessed. The SCDS (Synthetic Classification Data Sets) [Mel96] program was created to generate synthetic data sets. Parameters of all sort may be given to produce tailor made data sets.

A combination of data from the UCI repository and data generated with the SCDS program was used during the experiments.





Helge Grenager Solheim
Sat May 4 03:30:02 MET DST 1996