Classification and Decision Trees

What is machine learning?


Teaching a computer to find patterns in data so it can make predictions without being explicitly coded for every case.

What is classification?


Predicting a category/label (spam vs. not spam, churn vs. no churn, etc.).

Steps in the ML process

  1. Data collection: Gather the right, representative data.
  2. Preprocessing: Clean it (fix missing values, encode text/labels, scale if needed).
  3. Data splitting: Train/validation/test so you don’t “grade your own homework.”
  4. Model selection: Pick a simple baseline first, then try stronger models.
  5. Training: Fit the model on the training set.
  6. Evaluation: Check performance on validation/test.
  7. Tuning: Adjust hyperparameters/features to improve results without overfitting.
    (We already did parts of this in Project 1 when we cleaned, explored, and split data.)

How do we evaluate a classifier?

  • Accuracy: Percent of correct predictions. Easy to read, but can be misleading with imbalanced classes.
  • F1 score (uses precision & recall): Precision = how many predicted positives were actually positive. Recall = how many actual positives we caught. F1 balances them into one number—useful when classes are imbalanced.

Examples of classification algorithms

Logistic Regression: Predicts probability of a class with a linear boundary. This works really well with clean features.

Decision Trees: Split data by simple rules (if/else). This can overfit without limits.