Classification and Decision Trees
What is machine learning?
Teaching a computer to find patterns in data so it can make predictions without being explicitly coded for every case.
What is classification?
Predicting a category/label (spam vs. not spam, churn vs. no churn, etc.).
Steps in the ML process
- Data collection: Gather the right, representative data.
- Preprocessing: Clean it (fix missing values, encode text/labels, scale if needed).
- Data splitting: Train/validation/test so you don’t “grade your own homework.”
- Model selection: Pick a simple baseline first, then try stronger models.
- Training: Fit the model on the training set.
- Evaluation: Check performance on validation/test.
- Tuning: Adjust hyperparameters/features to improve results without overfitting.
(We already did parts of this in Project 1 when we cleaned, explored, and split data.)
How do we evaluate a classifier?
- Accuracy: Percent of correct predictions. Easy to read, but can be misleading with imbalanced classes.
- F1 score (uses precision & recall): Precision = how many predicted positives were actually positive. Recall = how many actual positives we caught. F1 balances them into one number—useful when classes are imbalanced.
Examples of classification algorithms
Logistic Regression: Predicts probability of a class with a linear boundary. This works really well with clean features.
Decision Trees: Split data by simple rules (if/else). This can overfit without limits.
