Multivariate Normal, Multivariate - mariginal and conditional, Correlation Coefficient

You have to write your own code, but it is ok to discuss or look up specific algorithmic/language code details. Before starting coding, make sure to understand :

- how
2-dim Gaussians work

- Naive
Bayes independence of features, thus the product of
probabilities

- languages
like Python or Matlab or R can help immensely in dealing with
multidimensional data, including the means and covariances in
2-dim.

- E and M
steps for EM algorithm. You can look into existing
EM code for help.

Since you have 57 real value features, each of the 2gaussians (for + class and for - class) will have a mean vector with 57 components, and a they will have

- either a common (shared) covariance matrix size 57x57. This
covariance is estimated from all training data (both classes)

- or two separate covariance 57x57 matrices (estimated
separately for each class)

Looking at the training and testing performance, does it appear that the gaussian assumption (normal distributed data) holds for this particular dataset?

**Preliminary:**Download the Spambase dataset available from the UCI Machine Learning Repository.The Spambase data set consists of 4,601 e-mails, of which 1,813 are spam (39.4%). The data set archive contains a processed version of the e-mails wherein 57 real-valued features have been extracted and the spam/non-spam label has been assigned. You should work with this processed version of the data. The data set archive contains a description of the features extracted as well as some simple statistics over those features.

- Partition the data into 10 folds.
To estimate the generalization (testing) error of your classifier, you will perform cross-validation. In

*k*-fold cross-validation, one would ordinarily partition the data set randomly into*k*groups of roughly equal size and perform*k*experiments (the "folds") wherein a model is*trained*on*k*-1 of the groups and*tested*on the remaining group, where each group is used for testing exactly once. (A common value of*k*is 10.) The generalization error of the classifier is estimated by the*average*of the performance across all*k*folds.While one should perform cross-validation with random partitions, for consistency and comparability of your results, you should partition the data into 10 groups as follows: Consider the 4,601 data points in the order they appear in the processed data file. Each group will consist of every 10th data point in file order, i.e., Group 1 will consist of points {1,11,21,...}, Group 2 will consist of points {2,12,22,...}, ..., and Group 10 will consist of ponts {10,20,30,...}. Finally, Fold

*k*will consist of*testing*on Group*k*a model obtained by*training*on the remaining*k*-1 groups. **Step 1:**Create three Naive Bayes classifiers by modeling the features in three different ways.The 57 features are real-valued, and one can model the feature distributions in simple and complex ways. You will explore the effect of this modeling on overall performance.

- Model the features as simple Bernoulli (Boolean) random
variables. Consider a threshold such as the overall mean
value of the feature (available in the Spambase
documentation), and simply compute the fraction of the
time that the feature value is above or below the overall
mean value for each class. In other words, for feature
*f*with overall mean value_{i}*mu*, estimate_{i}- Pr[
*f*<=_{i}*mu*| spam]_{i} - Pr[
*f*>_{i}*mu*| spam]_{i} - Pr[
*f*<=_{i}*mu*| non-spam]_{i} - Pr[
*f*>_{i}*mu*| non-spam]_{i}

and use these estimated values in your Naive Bayes predictor, as appropriate. (I would suggest using Laplace or Laplace-like smoothing to avoid any issues with zero probabilities, and feel free to experiment with various theshold values, if you like.)

- Pr[
- Model the features as Gaussian random variables,
estimating the class conditional mean and variance as
appropriate. (Again, I would suggest smoothing your estimate
of the variance, to avoid any zero variance issues.)
- Model the feature distribution via a histogram. Bucket the
data into a number of
*bins*and estimate the class conditional probabilities of each bin. For example, for any feature, consider the following values: min-value, mean-value, max-value, and the class conditional (spam/non-spam) mean values, one of which will presumably be*lower*than the overall mean, and one of which will be higher than the overall mean. Order these values numerically, and create four feature value bins as follows- [min-value, low-mean-value]
- (low-mean-value, overall-mean-value]
- (overall-mean-value, high-mean-value]
- (high-mean-value, max-value]

and estimate the class conditional probabilities for each of these bins, in a manner similar to that used in the Bernoulli model above. (Again, use Laplace smoothing to avoid any issues with zero probabilities, and feel free to experiment with the number of bins and various theshold values, if you like.)

- [optional, no credit] Use 9-bins histogram instead of four

- Model the features as simple Bernoulli (Boolean) random
variables. Consider a threshold such as the overall mean
value of the feature (available in the Spambase
documentation), and simply compute the fraction of the
time that the feature value is above or below the overall
mean value for each class. In other words, for feature
**Step 2:**Evaluate your results.*Error tables*: For each of your three classifiers, create a table with one row per fold showing your false positive, false negative, and overall error rates, and add one final row per table corresponding to the average error rates across all folds. For this problem, the false positive rate is the fraction of non-spam testing examples that are misclassified as spam, the false negative rate is the fraction of spam testing examples that are misclassified as non-spam, and the overall error rate is the fraction of overall examples that are misclassified.*ROC Curves:*In many situations, there is a different cost associated with false positive (Type I) and false negative (Type II) errors. In spam filtering, for example, a false positive error is a legitimate e-mail that is misclassified as spam (and perhaps automatically redirected to a "spam folder" or, worse, auto-deleted) while a false negative is a spam message that is misclassified as legitimate (and sent to one's inbox). Most users are willing to accept some false negative examples so long as very few legitimate e-mails are misclassified.When using Naive Bayes, one can easily make such trade-offs. For example, in the usual Bayes formulation, one would predict "spam" if

Pr[spam | data] > Pr[non-spam | data] or equivalently, in a log-odds formulation,

ln (Pr[spam | data] / Pr[non-spam | data]) > 0. Note that one could choose to classify an e-mail as spam for any threshold tau

ln (Pr[spam | data] / Pr[non-spam | data]) > tau where positive values of tau effectively reduce the number of spam classifications, thus decreasing false positives at the expense of increasing false negatives, while negative values of tau have the opposite effect.

One mechanism for visualizing this trade-off is through an ROC curve. (See, for example, the Wikipedia page on ROC curves.) An ROC curve effectively visualizes the true positive (detection) rate vs. the false positive (false alarm) rate for all possible thresholds. (The true positive rate is the fraction of spam messages above threshold; the false positive rate is the fraction of non-spam messages above threshold. Note that the true positive rate is 1 minus the false negative rate.)

One simple way to generate an ROC curve is to sort all

*m*testing points by their log-odds, and for the top*k*testing points, for all*k*between 1 and*m*, calculate the true positive and false positive rates associated with the top*k*items; plot these points.Create an ROC curve for each of your classifiers for Fold 1. Preferable draw all three curves on the same plot so that you can compare them.

*AUC:*An ROC curve visualizes the tradeoff between false positive rate and true positive rate (which is 1 minus the false negative rate) at various operating points. One measure for summarizing this data is to compute the*area under the ROC curve*or AUC. The AUC, as its name suggests, is simply the area under the ROC curve, which for a perfect curve is 1 and for a random predictor is 1/2. This area has an interesting probabilistic interpretation: it is the chance that the predictor will rank a randomly chosen positive example above a randomly chosen negative example.The AUC can be calculated fairly simply using the trapezoidal rule for estimating the area under a curve: If our

*m*ROC data points are(x

_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{m}, y_{m})in order, where

(x

_{1}, y_{1}) = (0,0)and

(x

_{m}, y_{m}) = (1,1)then the AUC is calculated as follows:

(1/2) sum

_{k=2 to m}(x_{k}- x_{k-1}) (y_{k}+ y_{k-1}).Calculate the AUC of your three classifiers.

Annotate on the right the sections of the code with your explanation of what the code does. Submit as pdf.

mean_1 [3,3]); cov_1 = [[1,0],[0,3]]; n1=2000 points

mean_2 =[7,4]; cov_2 = [[1,0.5],[0.5,1]]; ; n2=4000 points

You should obtain a result visually like this (you dont necessarily have to plot it)

B) Same problem for 2-dim data on file 3gaussian.txt , generated using a mixture of three Gaussians. Verify your findings against the true parameters used generate the data below.

mean_1 = [3,3] ; cov_1 = [[1,0],[0,3]]; n1=2000

mean_2 = [7,4] ; cov_2 = [[1,0.5],[0.5,1]] ; n2=3000

mean_3 = [5,7] ; cov_3 = [[1,0.2],[0.2,1]] ); n3=5000

a) Prove that

b) You are given a coin which you know is either fair or double-headed. You believe

that the a priori odds of it being fair are F to 1; i.e., you believe that the a priori probability of the coin

being fair is F/(F+1) . You now begin to flip the coin in order to learn more. Obviously, if you ever see a tail,

you know immediately that the coin is fair. As a function of F, how many heads in a row would you need to see before becoming convinced that there is a better than even chance that the coin is double-headed?

A.Generate mixture data, aka coin flips [GR ONLY]. Pick a K = number of coins; D =#flips per datapoint; w = mixt coef; p= coin biases, as in the EM example discussed in class (or in notes). Generate the outcome of the coin experiment by iterating i=1:N

- first picking a coin k (with w_k probability)

- then flip that coin D times with probability of head p_k and write down a 1 if the head is seen, 0 for the tail

Repeat this N=1000 times or more; so your outcome is a stream of N sequences of D flips : (1001110001; 0001110001; 1010100101 etc)

B.Infer parameters from data. Assume these generated coin flips of part A. We used K=3; w= [.5 .3 .2]; p=[.6 .2 .9]. Now using the stream of 1 and 0 observed, recover vectors (w, p) using the EM algorithm; K is known in advance. Report in a table the parameter values found by comparison with the ones used to generate data. Here are some useful notes, and other readings (1 , 2 , 3 , 4) for the coin mixture.

Testing: for each testing point , apply the Naive Bayes classifier as before: take the log-odd of product of probabilities over features mixtures (and the prior), separately for positive side and negative side; use the overall probability ratio as an output score, and compute the AUC for testing set. Do this for a 10-fold cross validation setup. Is the overall 10-fold average AUC better than before, when each feature model was a single Gaussian?

a) Somebody tosses a fair coin and if the result is heads, you
get nothing, otherwise you get $5. How much would you be pay to
play this game? What if the win is $500 instead of $5?

b) Suppose you play instead the following game: At the beginning
of each game you pay an entry fee of $100. A coin is tossed until
a head appears, counting n = the number of tosses it took to see
the first head. Your reward is 2^{n} (that is: if a head
appears first at the 4th toss, you get $16). Would you be willing
to play this game (why)?

c) Lets assume you answered "yes" at part b (if you did not, you
need to fix your math on expected values). What is the probability
that you make a profit in one game? How about in two games?

d) [difficult] After
about how many games (estimate) the probability of making a profit
overall is bigger than 50% ?

a) DHS CH2, Pb 45

The standard method for calculating the probability of a sequence in a given HMM is to use the forward probabilities αi(t).