Validation Generalization Drift
Discover why perfect accuracy on your sample data isn't enough and how models fail when real-world patterns start to drift.
A 4-out-of-4 accuracy on a four-user sample sounds impressive. It isn't. The dangerous question is: does the model actually understand what makes a refund legitimate, or did it just memorize four people?
In the previous post, we trained a Decision Tree to 100% accuracy. Now we find out if it generalizes.
Validation: Testing on Data the Model Never Saw
The standard practice in ML is to hold back a portion of your labeled data during training, then test the trained model against it. This is called validation data: examples where you already know the right answer, but the model has never seen them.
I'll introduce Sarah: 2 years active, $50 refund, 1 account on the IP. The historic label is 1 (Refund).
Running Sarah through our model:
- Years (2.0 > 0.9)? Yes.
- Amount ($50 < $201)? Yes.
- IP Accounts (1 < 3)? Yes.
Model output: 1. Correct. One validation case, one pass.
In practice, this would be 20,000 examples, not one. But the logic is the same.
Inference: Predicting Without a Label
Now the actual use case. Max just submitted a refund request. We don't have a historic label for him; we need to make a prediction.
Max has:
- 1 year active.
- $199 refund.
- 150 accounts on his IP.
Running Max through our model:
- Years (1.0 > 0.9)? Yes.
- Amount ($199 < $201)? Yes.
- IP Accounts (150 < 3)? No.
Two out of three pass. Our model's "2 of 3" rule approves Max. The model predicts: 1. Refund approved.
Max runs a bot farm. He has done this 150 times today. Our model just handed him a refund on each one.
ExpandTraining, validation, and inference pipeline showing the three stages of ML model lifecycle
Concept Drift
The failure has a name: Concept Drift. Our sample did not contain anyone who had 150 accounts on a single IP while still having a plausible account age and refund amount. Max deliberately engineered his requests to stay under our thresholds.
This is the gap between sample and population. Our sample encoded the patterns of legitimate users and obvious fraudsters. It did not encode the pattern of a sophisticated attacker who studies the boundary conditions.
Drift happens in two ways:
- Sample drift: The population was always this way, but our sample didn't reflect it.
- Concept drift: The population genuinely changes over time. Before the system was automated, people had to chat with a rep to get a refund. Once automation exists, fraudulent behaviour changes in response.
Both result in a model that is confidently wrong.
Overfitting vs. Generalization
The other failure mode is overfitting: tuning the model so precisely to the sample that it memorizes the training data instead of learning the underlying pattern.
When I changed the threshold from > 1 to > 0.9 to help Peter, I was already at risk of overfitting. That tiny boundary shift was calibrated to one specific user, not to a real insight about what makes a refund legitimate.
A model that generalizes well will sometimes be slightly wrong on the training data, and that is fine. Trying to achieve 100% on the sample is usually a sign of overfitting.
In the next post, we zoom out and look at the three major strategies ML uses to avoid these problems at scale.
The Essentials
- Validation Data: Labeled examples held back from training, used to test whether the model generalizes.
- Inference: Using a trained model to predict an outcome where no label exists yet.
- Concept Drift: The gap between what the sample represents and what the real-world population actually contains.
- Overfitting: Fitting the model too tightly to the sample, at the cost of accuracy on the population.
Further Reading and Watching
Keep reading