Many data science competitions suffer from a test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and using them as a validation set. The core of this idea is training a probabilistic classifier to distinguish train/test examples.
In part one, we inspect the ideal case: training and testing examples coming from the same distribution, so that the validation error should give good estimation of the test error and classifier should generalize well to unseen test examples.