Top 10 Data Mining Algorithms

Constructs a classifier in the form of a decision tree or a ruleset.
Choosing a test attribute is commonly done using either information gain or gain ratio (information gain / information in sample).
Attributes can be numeric or nominal.
Trees are pruned from the bottom-up to avoid overfitting using the binomial distribution.
Missing values can be distributed probabilistically.
Rulesets are constructed from the unpruned tree using a hill-climbing algorithm that drops conditions until the lowest pessimistic error rate is found. Pessimistic error is E / N where E samples (out of N) do not belong to the most frequent class. Subsets of simplified rules for each class are then formed and ordered.
Ruleset construction takes more time.
C5.0 (1997) is a commercial system that includes boosting, new data types (dates, NA, variable misclassification costs, etc.), unordered rulesets (all applicable rules are found and vote), and better scalability via multi-threading.
Research issues: Can stable trees be constructed? Can a complex tree be decomposed?

Finds k clusters.
Algorithm: (1) choose k initial points as centroids, (2) assign each data point to a centroid, (3) recalculate centroids, (4) go to step 2 if the centroids changed
To classify a data point, Euclidian distance is typically used.
Convergence to a local optimum is guaranteed.
Doesn't work well when data is not well described by reasonably separated spherical balls.
What should k be?
Sometimes preprocessing (to remove outliers) or postprocessing (to merge nearby clusters) is useful.
It is the most widely used partitional clustering algorithm. It is simple, easily understood, reasonably scalable and can deal with streaming data.

Advantages: robust, accurate, sound theoretical basis, requires few training examples, insensitive to number of dimensions.
In two-class learning, a SVM finds the hyperplane that maximizes the distance between the two classes.
Question 1: Can we understand the meaning of the SVM through a solid theoretical foundation? Yes, it relates to the VC dimension.
Question 2: Can we extend the SVM formulation to handle cases where errors exist and when the best hyperplane must admit some errors in the training data? Yes - introduce a slack function that allows for some misclassification.
Question 3: Can we extend the SVM formulation so that it works in situations where the training data are not linearly separable? Yes - use a kernel function to map the data to a higher dimension.
Question 4: Can we extend the SVM formulation so that the task is to predict numerical values or to rank the instances in the likelihood of being a positive class member, rather than classification? Yes - use SVR (support vector regression) and allow the difference between the actual and predicted to be less than ε
Question 5: Can we scale up the algorithm for finding the maximum margin hyperplanes to thousands and millions of instances? Yes - break the large problem up into a series of smaller ones.