More Classification Models

High-Level Overview of the Algorithms

1. Naive Bayes

This is based on Bayes Theorem that has a “naive” assumption where features are independent. These tend to be much faster and works great for text classification problems like spam detected (using this for project 2). The main difference is that it relies on probability distributions (not distances or boundaries).

2. K-Nearest Neighbors:

This is a learning algorithm that classifies a point based on the majority class of it “k” closest neighbors. A major part of it’s performance is using Distance metrics. There is also no training phase for this one. Instead predictions are made directly from the data.

3. Support Vector Machine

This classification model finds the hyperplane that maximizes the margin between different classes. You can even use kernals to handle non-linear decision boundaries. The main difference is this one focuses on finding an optimal boundary rather than average neighbors or probabilites.

4. Random Forest

This model builds many decisions trees and basically makes a majority vote. It reduces overfitting compared to a single decision tree. This combined multiple models to get more stable results.


More Detailed Explanation: Support Vector Machine (SVM)

I decided to look more into this algorithm because I’m using it for my Project 2. This algorithm works by calculating the probability that a data point belongs to each class and then picking the class with the highest probability. It does this by combining the prior probability of a class with the likelihood of observing the given features if that class were true. The “naive” assumption comes in because the model treats all features as though they are independent, which allows it to multiply their individual probabilities together rather than deal with complicated joint distributions.


Pros and Cons of Each Algorithm

Naive Bayes

  • Pros: Fast, works well for high-dimensional text data, requires little training data.
  • Cons: Strong independence assumption often doesn’t hold, probabilities can be skewed.
  • Use case: Spam filtering because it’s great with text data.

KNN

  • Pros: Simple to understand, no training phase, adaptable with different distance metrics.
  • Cons: Slow at prediction time, sensitive to irrelevant features and choice of k.
  • Use case: Small datasets because you don’t need a large amount of training data.

SVM

  • Pros: Effective in high-dimensional spaces, works well when margin separation exists, versatile with kernels.
  • Cons: Computationally expensive, not as interpretable, tricky to tune hyperparameters.
  • Use case: Image recognition because they are high-dimensional and SVM works well with that.

Random Forest

  • Pros: Handles non-linearity well, less overfitting, works with both categorical and numerical features.
  • Cons: Less interpretable than single trees, can be slower with very large datasets.
  • Use case: General-purpose because it handles a lot of stuff like categorical AND numerical features.