MODULE 1 : NORMAL-EQ-REGRESSION, DECISION TREES
STICKY POINTS
* FEATURE NORMALIZATION
make sure to normalize the entire column
a good idea to look at your values after normalization
same normalization for both training and test
save normalization parameters?
* DATA PARTITION INTO TRAIN/TEST
randomly
typically 90/10 or 80/20
* DATA PARTITION: CROSS VALIDATION
k-folds, use k=10
train on each k-1 folds, test on the omitted one
requires 10 separate training runs
average error
*DT: INFORMATION GAIN (OR ENTROPY BEFORE AND AFTER SPLIT)
entropy formula
entropy is a measure of randomness
less entropy = less randomness = more consistency = better prediction = less need for splitting the data at the note
*DT: REGRESSION SPLIT CRITERIA
- least squares, same as the regression error or objective. Also the same as empirical variance
*DT: HOW TO LOOK FOR BEST THRESHOLD
without trying all possible feature values: sample by buckets or ranges
always try all features
*DT: CACHING “BEST FEATURE/THRESHOLD” NOT ACTUALLY USED
later if the same node is tried for splitting, the best feature/threshold is already calculated
*DT: AVOID DEEP TREES
deep trees can eventually work on very small/focused set of datapoints making generalization difficult - thus overfitting
not a problem with depth if dataset at node is still reasonable large
*REGRESSION/NORMALEQ: DERIVATION OPTIONAL
no big worries if student doesnt follow the matrix manipulations, its really for math-loving people
*REGRESSION/NORMALEQ: HOW TO COMPUTE THE PSEUDOINVERSE
definitely use a procedure/package, dont implement pseudoinverse yourself
try to run the pseudoinverse operation separately, perhaps first on smaller matrices
*REGRESSION/NORMALEQ: NUMERICAL STABILITY
look at eigen values: small eigenvalues can cause trouble
*REGRESSION/NORMALEQ: ADD BIAS “1” COLUMN
thus 1 more dimension for the “free regression coefficient”, total dimensions are d+1, so d+1 regression coefficients
make sure to also add the 1 column to test sets (better add it to all data before partitioning)