With Machine Learning becoming a buzzword,, it is often the case that we lose sight of the need of being careful with data. It is not enough to just split your data to get a testing set and use the performance to generalize. We need to be careful, really careful about how we use our data . Let us look at the end goal of machine learning to explore a problem that crop up if we don’t understand our data well. Any machine learning project aims to
- Learn a function that is able to predict the data at hand well
- Use this function to predict the data in real world
The salient point here is that we should be able to assume that the statistics we got on our data will be applicable to the real world too. Unfortunately, we are limited by the data that we have. There is no workaround. So we will split the data into two sets, train + test, then hide the test set from everyone. Only after all the learning and validation will we take out this data, get the statistics and proudly show it to the world. The aim here is that since we have not seen test data at any training phase, for the trained model it is essentially same as real world data which will come on deployment. However, this assumption may not hold in many cases when we iterate over models in the course of development. Unfortunately this kind of development is unavoidable in many cases. Let us look at one such example and see what we can do.
Fig 1 Usual workflow of hyper parameter optimization and training
For any project, we need to define metrics so that we will be able to compare different versions of the product as we develop it. To do so, we will keep aside some of the target data too . Lets us call this data the test data. Now every now and then, we make a new model, test against it, get the accuracy and choose the model accordingly.
Fig 2 Benchmarking of a model
Whenever we make a new model, we repeat this process and use that model which has the best performance. Let us reframe this process. You have your models m0, m1,m2, m3 then the process can be shown as,
Fig 3 benchmarking multiple models
Notice the similarity here with Fig1, wherein we have in effect turned our testing process into a validation process. Sadly along with it we have converted our test data into validation data !
The frequent iterations has used our test data to search for hyper parameter, in this case the model version. As a result of this, the results we report may not generalize to the real world data. This problem of bias in model selection illustrates one of the pitfalls in just following the recipe, rather than understanding statistics.
One way to remove this bias is to use a new dataset every time you evaluate your models. In our case, we have a steady stream of data coming in every day. Hence we augment our training data for each testing. How ever this is not feasible in many cases where fresh data is not available. Even if it is available, you may want it to use it for training. Hence there are no readily available solutions that will work for everyone, we have to devise different methods according to in the context of the whole project.
Harinarayanan K K
 M. R. Yousefi, J. Hua, and E. R. Dougherty, “Multiple-rule bias in the comparison of classification rules,” Bioinformatics, vol. 27, no. 12, pp. 1675–1683, Jun. 2011.
 A.-L. Boulesteix and C. Strobl, “Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction,” BMC Medical Research Methodology, vol. 9, p. 85, 2009