\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{graphicx}
\usepackage[]{algorithm2e}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{array}
\usepackage{multirow}
\usepackage{enumitem}
%, amsthm, amssymb, bm, natbib, longtable}
\input{CS6140_Assignment_Style.tex}
\newcommand\MyBox[2]{
\fbox{\lower0.75cm
\vbox to 1.7cm{\vfil
\hbox to 2.5cm{\hfil\parbox{5.0cm}{#1\\#2}\hfil}
\vfil}%
}%
}
\begin{document}
% first argument is assignment number
% 2nd argument is assignment date
% 3rd argument is assignment due name
\myheader{1}{January, 12}{January, 25}{Decision Trees}
\section{Classification Trees With Numerical Features}
In this problem you will be working with two datasets:
\begin{itemize}
\item \textbf{Iris}: has three classes and the task is to accurately predict one of the three sub-types of the Iris flower given four different physical features. These features include the length and width of the sepals and the petals. There are a total of $150$ instances with each class having $50$ instances.
\item \textbf{Spambase}: is a binary classification task and the objective is to classify email messages as being spam or not. To this end the dataset uses fifty seven text based features to represent each email message. There are about $~4600$ instances
\end{itemize}
Since, both datasets have continuous features you will \textbf{implement} decision trees that have \textbf{binary splits}. For determining the optimal threshold for splitting you will need to search over all possible thresholds for a given feature(refer to class notes and discussion for an efficient search strategy). Use \textbf{information gain} to measure node impurity in your implementation.
\subsection{Growing Decision Trees (\textit{60 points})} Instead of growing full trees, you will use an early stopping strategy. To this end, we will impose a limit on the minimum number of instances at a leaf node, let this threshold be denoted as $\eta_{min}$, where $\eta_{min}$ is described as a percentage relative to the size of the training dataset. For example if the size of the training dataset is $150$ and $\eta_{min}=5$, then a node will only be split further if it has more than eight instances.
\begin{enumerate}[label=(\alph*)]
\item{For the Iris dataset use $\eta_{min} \in \{5,10,15,20\}$, and calculate the accuracy using ten fold cross-validation for each value of $\eta_{min}$}
\item{For the Spambase dataset use $\eta_{min} \in \{5,10,15,20,25\}$, and calculate the accuracy using ten fold cross-validation for each value of $\eta_{min}$}
\end{enumerate}
You can summarize your results in two separate tables, one for each dataset (report the average accuracy and standard deviation across the folds).
\subsection{Interpreting the results (\textit{Extra-credit})} Select the best value of $\eta_{min}$ for the Iris dataset, and create a class confusion matrix using ten-fold cross validation(\textit{use only the test set for populating the confusion matrix}). How do you interpret the confusion matrix, and why?
%%------------------------------------------------------------------%%
\section{Classification Trees With Categorical Features (\textit{Extra-credit})}
In this problem you will be working with a new dataset:
\begin{itemize}
\item \textbf{Mushroom}: is binary classification dataset and the task is to accurately predict whether a mushroom is poisonous or edible given $21$ different categorical (ordinal) features for each mushroom. These features describe various physical properties of the mushrooms such as length, diameter, etc. There are a total of $8124$ instances.
\end{itemize}
As this dataset has only ordinal features we will \textbf{implement} both binary and multiway decision trees. Use \textbf{information gain} to measure node impurity in your implementation.
\subsection{Multiway vs Binary Decision Trees (\textit{50 points})} In this problem we will grow both multiway and binary decision trees using categorical features. Similar to the last problem we will be using an early stopping strategy to keep our decision tree from overfitting the data.
\begin{enumerate}[label=(\alph*)]
\item Grow a multiway decision tree using $\eta_{min} \in \{5,10,15\}$, and calculate the accuracy using ten fold cross-validation for each value of $\eta_{min}$
\item Replace each categorical feature $F \in \{ f_{1}, f_{2}, \ldots ,f_{v} \}$ with $v$ binary features, corresponding to each distinct value of the feature. For example given the second feature \textit{cap-surface}, that takes values \{fibrous=f, grooves=g, scaly=y,smooth=s\}, we will replace the original feature with four new boolean features: \textit{\{is cap-surface=f\}}, \textit{\{is cap-surface=g\}}, \textit{\{is cap-surface=y\}} and \textit{\{is cap-surface=s\}}. Only one of these features will be $1$ for a given instance and the rest will be $0$. Grow a binary decision tree with this new extended feature set, using $\eta_{min} \in \{5,10,15\}$, and calculate the accuracy using ten fold cross-validation for each value of $\eta_{min}$.
\end{enumerate}
You can summarize your results in two separate tables, one for each dataset (report the average accuracy and standard deviation across the folds). Do you see a difference between the two approaches? Explain your answer briefly based on the results.
\subsection{Interpreting the results} Select the best value of $\eta_{min}$ for the the above two cases i.e., multiway and binary splits, and create a class confusion matrix using ten-fold cross validation(\textit{use only the test set for populating the confusion matrix}). How do you interpret the confusion matrix, and why?
%%------------------------------------------------------------------%%
\section{Entropy (\textit{20 points})}
Consider training a binary decision tree using entropy splits.
\begin{enumerate}[label=(\alph*)]
\item Prove that the decrease in entropy by a split on a binary yes/no feature can never be greater than 1 bit.
\item Generalize this result to the case of arbitrary multiway branching.
\end{enumerate}
%%------------------------------------------------------------------%%
\section{Gain and Impurity Measures (\textit{20 points})}
Consider an arbitrary impurity measure $\iota(q)$, where $q$ is a node in the decision tree and a feature $V$ that takes $|V|$ distinct values. For a split resulting in $|V|$ children, the gain for the impurity measure is:
\[
Gain(q,V) = \iota(q) - \sum_{i=1}^{|V|}{\frac{N_{i}}{N_{q}}\iota(i)}
\]
where $N_{i}$ is the number of instances in the $i^{th}$ child node.
Show that maximizing the $Gain(.,.)$ is equivalent to minimizing the impurity measure $\iota(.)$ over the $|V|$ children.
%%------------------------------------------------------------------%%
\section{Regression Trees (\textit{50 points})}
In this problem you will \textbf{implement} regression trees using a new dataset:
\begin{itemize}
\item \textbf{Housing}: This is a regression dataset where the task is to predict the value of houses in the suburbs of Boston based on thirteen features that describe different aspects that are relevant to determining the value of a house, such as the number of rooms, levels of pollution in the area, etc.
\end{itemize}
\begin{enumerate}[label=(\alph*)]
\item As this dataset has only numerical features we will be growing decision trees using only binary splits. Use \textbf{mean squared error (MSE)} to define the splits. Use an early stopping strategy similar to the previous decision tree problems and use $\eta_{min} \in \{5,10,15, 20\}$. Calculate the MSE using ten fold cross-validation for each value of $\eta_{min}$ and report the average and standard deviation across the folds (summarize your results in a table) (\textit{40 points}).
%\item Implement the normal equations to calculate the optimal weight vector for linear regression (see the lecture notes for linear regression). Calculate the MSE using ten fold cross-validation and report the average and standard deviation across the folds, summarizing your results in a table. \emph{Remember to re-estimate the weights for each fold, before calculating the MSE on test data}.
\item Does $\eta_{min}$ impact the results significantly? Briefly explain your answer (\textit{10 points}).
\end{enumerate}
\end{document}