Sometimes the training set is not the ultimate training set due to
several reasons. These are the following:
Not Representative Data:
If the data in the training set is not representative for the objects
in the domain, we have a problem. If rules for diagnosing patients are
being created and only elderly people are registered in the training
set, the result for diagnosing a kid based on these data probably will
not be good. Even though this may have serious consequences, we
would say that not representative data is mainly a problem of machine
learning when the learning is based on few examples. When using large
data sets, the rules created probably are representative, as long
as the data being classified belongs to the same domain as those in
the training set.
No Boundary Cases:
To find the real differences between two classes, some boundary cases
should be present. If a data mining system for instance is to classify
animals, the property counting for a bird might be that it has wings
and not that it can fly. This kind of detailed distinction will only
be possible if e.g. penguins are registered.
Limited Information:
In order to classify an object to a specific class, some condition
attributes are investigated. Sometimes, two objects with the same
values for condition attributes have a different classification. Then,
the objects have some properties which are not among the attributes in
the training set, but still make a difference. This is a problem for the
system, which does not have any way of distinguish these two
types of objects.