CS6140 Machine Learning

HW5 - Features

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

SpamBase-Poluted dataset: the same datapoints as in the original Spambase dataset, only with a lot more columns (features) : either random values, or somewhat loose features, or duplicated original features.

SpamBase-Poluted with missing values dataset: train, test. Same dataset, only some values (picked at random) have been deleted.

PROBLEM 1 Adaboost with bad features[50 points]

A) Spambase (original dataset) . Implement feature analysis for Adaboost as presented in the notes.
B) Spambase polluted dataset. Run Adaboost - why does it still work?

PROBLEM 2 Regularized Regression for feature selection[50 points]

A) Spambase polluted dataset - run logistic regression - 0.85

B) Run Regularized Logistic regression using either a LASSO(link, link) -0.93 and RIDGE(link, link) package for regularization.
Compare with Logistic Regression performance. For example use the scikit-learn (Python) or Liblinear (C++) implement LASSO.

C) Implement your own RIDGE optimization for linear regression.
D) [Extra Credit] Implement your own LASSO optimization for linear regression.

PROBLEM 3 PCA [50 points]

Spambase polluted dataset.
A) Train and test Naive Bayes. Why the dramatic decrease in performance ? 0.62

B) Run PCA first on the dataset in order to reduce dimensionality to about 100 features. You can use a PCA package (link, link) or library.
Then train/test Naive Bayes on the PCA features. 0.73 Explain the performance improvement.

PROBLEM 4 Missing Values [50 points]

Spambase poluted dataset with missing values: train, test.
Run modified Naive Bayes to deal with the missing values (as described in notes following KMurphy 8.6.2). This essentially runs the independence product from Naive Bayes ignoring the factors corresponding to missing features.

PROBLEM 5 Image Features [50 points]

Extract Harr features for each image on the Digits Dataset (Training data, labels. Testing data, labels).
Train 10-class ECOC-Boosting on the extracted features and report performance.

(HINT: For parsing MNIST dataset, please see python Code:mnist.py; MATLAB code: MNIST_Dataset)

PROBLEM 6 Boosting with Dynamic Features [Extra Credit]

A) Run Boosting (Adaboost or Rankboost or Gradient Boosting) to text documents from 20 Newsgroups without extracting features in advance. Extract features for each round of boosting based on current boosting weights.

B) Run Boosting (Adaboost or Rankboost or Gradient Boosting) to image datapints from Digit Dataset without extracting features in advance. Extract features for each round of boosting based on current boosting weights. You can follow this paper.

PROBLEM 7 [Extra Credit]

Prove of the harmonic functions property discussed in class based on this paper. Specifically, prove that to minimize the energy function

f must be harmonic, i.e. for all unlabeled datapoints j, it must satisfy