
Fundamental concepts covered in this module
Exemplary techniques:
After completing this module, students should be able to:
In the following pages we will see different ways how we can use mathematical functions to achive classification.
A quick review of classificaton trees:
Decision barriers are perpendicular to the axis. Is there another way we can classify this data?
By being able to draw a line that is still straight, but not perpendicular, we can better divide the space:
The new decision boundary is essentially a weighted sum of the values of the various attributes:
Most function fitting techniques are based on this one model. It is taking multiple attributes into account through one mathematical formula – know as a parameterized model.The general formula is: \(f(x) = w_0 + w_1x_1 + w_2x_2 + ...\) Where w0 is the intercept, wi are the weights (parameters) and xi are the individual component features.
Depending on the value of the weights/parameters, however, the results (below) can be quite different:
The task of the data mining procedure is to “fit” the model to the data by finding the best set of parameters, in some sense of “best.”
How can we chose a suitable objective function?
First step: Determine the goal or objective of choosing parameters – this helps determine the best objective function.
Objective Function examples:
Then, find optimal value of the weights by maximizing and minimizing the objective function.
Iris Example
Below is a classification example using the Iris Dataset (available as data(iris) in R).
Mining a linear discriminant from data. Two different objectives are shown. Note how they divide and classify data differently.
Goal is to maximize the margin or the space between the two dashed lines. The linear discriminant (solid line below) is the center between the two dashed lines.
If data cannot be linearly separated, the best fit is a balance between a fat margin and a low total error penalty.
Hinge Loss: incurs no penalty for an example that is not on the wrong side of the margin. The hinge loss only becomes positive when an example is on the wrong side of the boundary and beyond the margin. Loss then increases linearly with the example’s distance from the margin, thereby penalizing points more the farther they are from the separating boundary.
Zero-one loss: assigns a loss of zero for a correct decision and one for an incorrect decision.
Objective Functions for Linear Models:
Absolute Error method:
Squared Error method:
Both have pros/cons - depends on the business application.
Logistic regression is an objective function used to estimate probability of something occurring (Ex. Probability of bank fraud).
In order to fit a linear model the distance from the decision boundary needs to be between \(-\infty\) and \(\infty\), but probabilities only go between zero and one.
The problem can be solved by taking the logarithm of the odds (log-odds) because for any number zero to \(\infty\), its log odd will be between \(-\infty\) and \(\infty\).
See below:
Log Odds Linear Function:
\(log (\frac{p+(x)}{1-p+(x)}) = f(x) = w_0 + w_1x_1 + w_2x_2 + \ldots\)
Where \(p + (x)\) is the estimate of a particular event occuring.
The Logistic Function:
\(p+(x) = \frac{1}{1+e^{-f(x)}}\)
Graph of a Logistic Function:
Note: The further away from the decision boundary \((x=0)\), the more certain the model
Logistic regression vs. Tree Induction
Models will likely have different levels of accuracy
It is difficult to determine which method is more effective at the beginning of the experiment
Data science team doesn’t always have ultimate say in which model is used - stakeholders must agree
Learned by Logistic Regression:
Classification Tree:
Results:
Linear equation learned by logistic regression
98.9% accuracy
Only 6 mistakes total
Classification tree model
99.1% accuracy
One mistake out of 569 samples
Which is better?
Nonlinear Functions: The two most common families of techniques that are based on fitting the parameters of complex, nonlinear functions are nonlinear support-vector machines and neural networks.
Nonlinear support vector machines are essentially a systematic way of adding more complex terms (ex. Sepal width\(^2\)) and fitting a linear function to them.
Neural networks can be thought of as a “stack” of models. We could think of this very roughly as first creating a set of “experts” in different facets of the problem (the first-layer models), and then learning how to weight the opinions of these different experts (the second-layer model).
UCI Iris Example, revisited with \(Sepal width^2\)

The Iris dataset with a nonlinear feature. In this figure, logistic regression and support vector machine—both linear models—are provided an additional feature, Sepal width\(^2\), which allows both the freedom to create more complex, nonlinear models (boundaries), as shown.
Note on Nonlinear Functions:
Be careful! As we increase the flexibility of being able to fit the data, we increase the risk of the data fitting too well. The concern is that the model will fit well to the training data, but not be general enough to apply to data drawn from the same population or application
In this chapter we: