Classification and Decision Trees

What is machine learning?

Teaching a computer to find patterns in data so it can make predictions without being explicitly coded for every case.

Predicting a category/label (spam vs. not spam, churn vs. no churn, etc.).

Data collection: Gather the right, representative data.
Preprocessing: Clean it (fix missing values, encode text/labels, scale if needed).
Data splitting: Train/validation/test so you don’t “grade your own homework.”
Model selection: Pick a simple baseline first, then try stronger models.
Training: Fit the model on the training set.
Evaluation: Check performance on validation/test.
Tuning: Adjust hyperparameters/features to improve results without overfitting.
(We already did parts of this in Project 1 when we cleaned, explored, and split data.)

Accuracy: Percent of correct predictions. Easy to read, but can be misleading with imbalanced classes.
F1 score (uses precision & recall): Precision = how many predicted positives were actually positive. Recall = how many actual positives we caught. F1 balances them into one number—useful when classes are imbalanced.

Logistic Regression: Predicts probability of a class with a linear boundary. This works really well with clean features.

Decision Trees: Split data by simple rules (if/else). This can overfit without limits.