Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

SpamBase-Poluted dataset:
the same datapoints as in the original Spambase dataset, only with
a lot more columns (features) : either random values, or somewhat
loose features, or duplicated original features.

SpamBase-Poluted with missing values dataset: train,
test.
Same dataset, only some values (picked at random) have been
deleted.

B) Spambase polluted dataset. Run Adaboost - why does it still work?

B) Run Regularized Logistic regression using either a LASSO(link, link) -0.93 and RIDGE(link, link) package for regularization.

Compare with Logistic Regression performance. For example use the scikit-learn (Python) or Liblinear (C++) implement LASSO.

C) Implement your own RIDGE optimization for linear regression.

D) [Extra Credit] Implement your own LASSO optimization for linear regression.

A) Train and test Naive Bayes. Why the dramatic decrease in performance ? 0.62

B) Run PCA first on the dataset in order to reduce dimensionality to about 100 features. You can use a PCA package (link, link) or library.

Then train/test Naive Bayes on the PCA features. 0.73 Explain the performance improvement.

Run modified Naive Bayes to deal with the missing values (as described in notes following KMurphy 8.6.2). This essentially runs the independence product from Naive Bayes ignoring the factors corresponding to missing features.

Extract Harr features for each image on the Digits Dataset
(Training data,
labels.
Testing data,
labels).

Train 10-class ECOC-Boosting on the extracted features and report
performance.

(HINT: For parsing MNIST dataset, please see python Code:mnist.py;
MATLAB code: MNIST_Dataset)

A) Run Boosting (Adaboost or Rankboost or Gradient Boosting) to
text documents from 20 Newsgroups without extracting features in
advance. Extract features for each round of boosting based on
current boosting weights.

B) Run Boosting (Adaboost or Rankboost or Gradient Boosting) to
image datapints from Digit Dataset without extracting features in
advance. Extract features for each round of boosting based on
current boosting weights. You can follow this paper.

Prove of the harmonic functions property discussed in class based on this paper. Specifically, prove that to minimize the energy function

f must be harmonic, i.e. for all unlabeled datapoints j, it must satisfy