
Fundamental concepts covered in this module
Exemplary techniques:
After completing this module, students should be able to:
Building on churn example: let’s think of predictive modeling as supervised segmentation.
Question we want to ask: How can we segment the population into groups that differ from each other with respect to some quantity of interest?
Target could be something we like to avoid (e.g., churn, write-offs, turbine failure) or something we would like to see (e.g., respond to offer).
Key (part 1): is there a specific, quantifiable target that we are interested in or trying to predict?
Examples:
Key (part 2): do we have data on this target?
Supervised data mining requires both parts 1 & 2
We don’t need the exact data (e.g., whether the current customers will leave) BUT we need data for the same or a related phenomenon (e.g., from customers from last quarter)
We will then use these data to build a model to predict the phenomenon of interest.
Question to ask: Is the phenomenon of “who left the company” last quarter the same as the phenomenon of “who will leave the company” next quarter?
Question to ask: Who might buy this completely new product I have never sold before?
Key (part 3): the result of supervised data mining is a MODEL that given data predicts some quantity
Model could be some rule:
if (income <50K)
then no Life Insurance
else Life Insurance
Terminology for supervised classification problem. The problem is supervised because it has target attribute and some training data. Classification because target is a category (yes/no) rather than numeric.
Within supervised learning: classification vs. regression
The difference is the type of target available:
Classification \(\Rightarrow\) categorical target (in historical data)
Regression \(\Rightarrow\) numerical target
Supervised Segmentation
Intuition: How can we segment the population into subgroups that have different values of target variable? (and the same value within a group)?
Problem: How can we judge whether a variable contains important information about the target variable? How much?
Selecting Information Attributes
Which attributes are best suited to segment the data according to the target values? In the above example, would “round body” be a good predictor of the target attribute? Why or why not? Is it sufficient?
What does the data look like?
Target: Two classes (yes & no)
Predictor attributes: head (round, square); body-type (rectangular, oval); body-color (white, grey)
Which of the attributes would be best to segment these people into groups?
Challenges in Attribute Selection
\(entropy = -p_1 log_2(p_1) - p_2 log_2(p_2) - \ldots - p_n log_2(p_n)\)
Selecting Information Attributes
Information Gain
Information gain: how much an attribute improves (or decreases) entropy over the whole segmentation it creates. i.e., information gain is change in entropy
\(IG(parent,children) = entropy(parent)-[p(c_1)entropy(c_1)+...]\)
Write-off Example
Consider set \(S\) of 10 people with 7 of non-write-off class and 3 of write-off class:
\(p(non-write-off) = 7/10 = 0.7\)
\(p(write-off) = 3/10 = 0.3\)
\(entropy(S) = -[0.7*log_2(0.7)+.03 * log_2(0.3)]\)
\(entropy(S) = 0.88\)
This is the amount of information contained in the overall data.
What is the information gain (IG) for splitting the data based on “balance 50k”?
How about splitting the data into three classes based on the Residence attribute?
Example: Attribute Selection with Information Gain
Dataset: Mushroom Target: edible vs. poisonous
21 predictor variables
Which variable predicts the target? How well?
Intuition: Compute information gain from splitting the data based on different predictors.
For Entropy charts below:
Entropy chart of the entire dataset (no split):
Split by “Gill-Color”
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
Split by Spore-Print-Color
spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
Split by Odor
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
‘Upside down tree’:
Simple decision tree for credit example
Exercise: Given the above decision tree, how would we classify the observation `Claudio’: Balance=115k, Employed=No, Age=40?
Intuition: Starting with the whole dataset, continue to split the data based on different attributes to create the purest leaves possible (using the idea of information gain from above).
First partitioning: splitting on body shape
Second partitioning: the oval body people sub-grouped by head type
Third partitioning: rectangular body people subgrouped by body color
The complete tree
Back to credit example: Probability estimation
Instead of simply classifying something as yes/no, write-off/non-write-off, we can also produce probability estimates: simply divide the number of the majority instances by the total number of instances in the segment. (This is called frequency-based class membership probability.)

One problem that arises is with segment with very small numbers of instances. If a leaf happens to have only a single instance, should we be willing to say that there is a 100% probability of that member of that segment will have the class that this one instance happens to have?
This phenomenon is one example of a fundamental issue in data science called “overfitting”: our predictions are very good but they are also tailored very closely to the specific dataset we were learning from. We will cover overfitting in more detail in a separate module.
One simple way to address it is the Laplace correction (see book p. 71 for details).
We can also visualize data observations according to the decision rules of our tree.

Legend: Black dots=write off; plus-sign=non-write-off
Another way to think of classification trees: a sets of rules
IF (Balance < 50k) AND (Age < 50) THEN Class=Write-off
IF (Balance < 50k) AND (Age ≥ 50) THEN Class=No Write-off
IF (Balance ≥ 50k) AND (Age < 45) THEN Class=Write-off
IF (Balance ≥ 50k) AND (Age < 45) THEN Class=No Write-off