CS6220/DS5230 Unsupervised Data Mining

HW1 Data Features, Similarity, KNN

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.


DATATSET : Kosarak : click-stream data of a hungarian on-line news portal

(http://fimi.uantwerpen.be/data/kosarak.dat)

DATATSET : Aminer : public citation dataset

(https://lfs.aminer.cn/lab-datasets/citation/acm.v9.zip)

DATATSET : 20 NewsGroups : news articles

( http://qwone.com/~jason/20Newsgroups/)

DATATSET : MNIST : digit images

(http://yann.lecun.com/exdb/mnist/)

PROBLEM 1: Aminer : basic dataset analysis

This is a large dataset (about 2 million publications – it takes about a minute just to parse!). While your notebook must successfully work on the entire dataset, you may find it useful to work on a subset while getting your code to work. https://en.wikipedia.org/wiki/IPython#Notebook
https://aminer.org
https://en.wikipedia.org/wiki/ECML_PKDD
https://en.wikipedia.org/wiki/Quartile


PROBLEM 2 : Kosarak Association Rules

Your task is to take a dataset of nearly one million clicks on a news site16 and use the Weka Explorer to identify interesting association rules. Ordinarily this would be a point-and-click task; however, the input data format is a list of transactions (each line in the file includes a list of anonymized news item id’s), whereas Weka requires a tabular format. Specifically, each distinct news item id should be represented via a column/attribute, and each row/instance should be a sequence of binary values, indicating whether or not the user visited the corresponding news item.

PROBLEM 3 MNIST, 20 NG Preprocessing

Your goal in this problem is to *parse*, *normalize*, and otherwise *prepare* two common data sets (MNIST + 20NG) for classification. In this problem, that includes prepping the datasets for **[dis]similarity** computations.
Your first task is parsing. As this is the first assignment and as the parsers are very different for the two datasets (images vs. text), you may use any library/package to aid in the parsing here, however you are encouraged to write your own.

Your second task is normalization. The type of normalization used depends on the task and dataset. Common types of normalization include:
  • Shift-and-scale normalization: subtract the minimum, then divide by new maximum. Now all values are between 0-1
  • Zero mean, unit variance : subtract the mean, divide by the appropriate value to get variance=1
  • Term-Frequency (TF) weighting : map each term in a document with its frequency (text only; see the wiki page) It is up to you to determine the appropriate normalization.

    Your final task is to compute several types of pairwise similarities, for use in the next question. You are encouraged to write your own implementation of the pairwise similarity/distance matrix computation---but unless explicitly specified we will accept any code using a common library available in Matlab/Java/Python/R.
    Distance/similarity options to implement:
  • euclidian distance (required, library)
  • euclidian distance (required, your own - use batches if you run into memory issues)
  • edit distance (required for text) -or- cosine similarity (required for vectors)
  • jaccard similarity (optional)
  • Manhattan distance (optional)

    Some tips / hints:
  • For 20NG, try TF normalization on e.g. the rows. Note for text is critical to maintain a sparse format due to large number of columns
  • Make sure any value transformation retains the 0 values.
  • As MNIST is comprised of pixel images, they often come 'normalized' in a pre-formatted range [0-255], however their are advantages to having 0 mean.
  • Do not normalize labels.
  • When normalizing a column, make sure to normalize its values across all datapoints (train, test, validation, etc) for consistency Depending on what similarity/distance measure you use, computation of similarity might be easy but the size of the similarity matrix might present a challenge.

    Useful links include:
  • https://en.wikipedia.org/wiki/Feature_scaling
  • https://en.wikipedia.org/wiki/Distance_matrix
  • https://en.wikipedia.org/wiki/Distance
  • https://en.wikipedia.org/wiki/Category:Similarity_and_distance_measures
  • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.8446&rep=rep1&type=pdf
  • http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/



    PROBLEM 4: MNIST, 20 NG : Train and test KNN classification (supervised)

    Your goal in this problem is to write your own K-nearest neighbor (KNN) classifier.

    For each of the two datasets, now in matrix format and with pairwise similarity computed, train and test a KNN classifier. You are required to implement KNN classification model yourself, though you may use support libraries / data-structures for the neighbor searching.

    You should partition the datasets into (say) an 80/10/10 training/testing/validation sets. Note that the actual "training" here consists of simply identifying nearest neighbors---unlike other common classifiers, there is no iterative or gradient-based procedure.

    Report both training performance and testing performance. If using Python, you are encouraged (but not required) to write a scikit-learn compatible *estimator* class supporting a common API interface, e.g. *.fit(), *.predict(), *.transform(), etc. See https://scikit-learn.org/stable/developers/develop.html for more details.