How can you make sure that overfitting situation is not arriving with a model you are using?
Imagine you’re planning on studying for your finals. Now your course content (lecture sildes, textbook) is your training data. There’s also a last year’s past exam. The ideal way to study for this exam would be to go over your training data, master it and try doing the past exam to ensure that you know your content ; hence your past exam is ‘cross validation’ .
Some students might be tempted to just breeze through the training data and then try the past exam. They realize that they don’t know many answers in the past exam and they look back to their training data to figure out the answers to these questions. Once they’ve mastered these questions they get a false sense security that they know their content. But, essentially they’ve overfit for the past exam. The next day if the prof decides to ask similar questions from the past exam, the students would have done well. If not, they’re pretty screwed. This is why over-fitting is important. You have to ensure that your algorithm generalizes well.