Top 10 Data Mining Algorithms
1. C4.5 and Beyond
- Constructs a classifier in the form of a decision tree or a ruleset.
- Choosing a test attribute is commonly done using either information
gain or gain ratio (information gain / information in sample).
- Attributes can be numeric or nominal.
- Trees are pruned from the bottom-up to avoid overfitting
using the binomial distribution.
- Missing values can be distributed probabilistically.
- Rulesets are constructed from the unpruned tree using a
hill-climbing algorithm that drops conditions until the
lowest pessimistic error rate is found. Pessimistic error
is E / N where E samples (out of N) do not belong to the
most frequent class. Subsets of simplified rules for each
class are then formed and ordered.
- Ruleset construction takes more time.
- C5.0 (1997) is a commercial system that includes boosting,
new data types (dates, NA, variable misclassification costs, etc.),
unordered rulesets (all applicable rules are found and vote),
and better scalability via multi-threading.
- Research issues: Can stable trees be constructed? Can a complex
tree be decomposed?
2. K-Means Algorithm
- Finds k clusters.
- Algorithm: (1) choose k initial points as centroids,
(2) assign each data point to a centroid,
(3) recalculate centroids, (4) go to step 2 if the centroids changed
- To classify a data point, Euclidian distance is typically used.
- Convergence to a local optimum is guaranteed.
- Doesn't work well when data is not well described by reasonably
separated spherical balls.
- What should k be?
- Sometimes preprocessing (to remove outliers) or postprocessing
(to merge nearby clusters) is useful.
- It is the most widely used partitional clustering algorithm. It
is simple, easily understood, reasonably scalable and can deal with
streaming data.
3. Support Vector Machines (SVM)
- Advantages: robust, accurate, sound theoretical basis,
requires few training examples, insensitive to number of dimensions.
- In two-class learning, a SVM finds the hyperplane that maximizes
the distance between the two classes.
- Question 1: Can we understand the meaning of the SVM through
a solid theoretical foundation?
Yes, it relates to the VC dimension.
- Question 2: Can we extend the SVM formulation to handle cases
where errors exist and when the best hyperplane must admit some
errors in the training data? Yes - introduce a slack
function that allows for some misclassification.
- Question 3: Can we extend the SVM formulation so that it works in
situations where the training data are not linearly separable?
Yes - use a kernel function to map the data to a higher dimension.
- Question 4: Can we extend the SVM formulation so that the task is to predict
numerical values or to rank the instances in the likelihood of being
a positive class member, rather than classification?
Yes - use SVR (support vector regression) and allow the difference
between the actual and predicted to be less than ε
- Question 5: Can we scale up the algorithm for finding the maximum
margin hyperplanes to thousands and millions of instances?
Yes - break the large problem up into a series of smaller ones.