What are Decision Trees?
A Decision Tree is a supervised machine learning algorithm that is used for both classification and regression tasks. It builds a flowchart-like tree structure where each internal node represents a "test" on an attribute (e.g., whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label or continuous value.
It is a highly interpretable model, meaning it is easy to understand exactly how the model arrived at a particular decision.
Visual Example
A simple decision tree to decide an activity based on the weather conditions.
Algorithm Strategy
1. Select the Best Attribute
Evaluate all features to find the one that best splits the data. This is typically done using metrics like Gini Impurity, Information Gain, or Variance Reduction.
2. Split the Dataset
Divide the dataset into subsets based on the selected attribute. Create a child node for each possible outcome of the condition.
3. Repeat Recursively
Repeat the process for each subset. Stop when the subset is pure (all examples belong to the same class) or a stopping criterion (like max depth) is reached.
Key Takeaways
- Decision Trees are prone to overfitting if they are allowed to grow too deep.
- Pruning is a technique used to remove sections of the tree that provide little power to classify instances, thus reducing overfitting.
- They handle both numerical and categorical data naturally.