Transfer Learning etc
I think this’d work best if I just updated my daily progress here than try giving comprehensive views of what I’m doing.
So you have data coming in that needs to be classified. Apparently the accuracy of most classifiers is abysmally low. We need to build a better classifier.
I took a month’s worth of data, and applied all possible classifiers on it, cross-validated it. Accuracy was roughly in the area of 85-90%. While that’s not excellent, it’s not bad, given the small amount of training data.
So what’s this low accuracy everyone’s talking of?
Turns out, the new data coming in turns out to be very different from the data you train on. You’ll train on June’s data, but July’s data’s going to be much different. The same words will end up reappearing in different class labels. Hence the low accuracy.
Also, you have not just one classifier, but many. It turns out that when you train many classifiers on subsets of the data, they perform better than training one classifier on the entire data.
Learning will have to keep evolving. I first thought of Active Learning in this context, where you’ll expect the user to label the stuff you are not sure about. But then, what if you confidently label stuff that is patently wrong?
The many classifiers bit of the problem helps us visualize the training data in a different way – Each category – class label – has many sub-categories. Now each classifier is trained on a month’s worth of data. It turns out that each month can be likened to a sub-category. You train on one sub-category, and test on another sub-category, and expect it to return the same class label. That’s like training a classifier on data that contains hockey-related terms for the Sports label, and then expecting it to recognize data that contains cricket-related terms as Sports too.
This would be transfer learning/domain adaptation – you learn on one distribution, and test on a different distribution. The class labels and features however, remain the same.
This would more specifically be Transductive Transfer Learning – you have a training set, from distribution D1, and a test set, from distribution D2. You have this unlabelled test data available during training, and you somehow use this to tweak the model you’ll learn from the training data.
Many ways exist to do this. You can apply Expectation-Maximization on a Naive Bayes classifier trained on the training data, to maximize the expectation of the test data, while still doing well on the training data. You can train an SVM, assign pseudo-labels to the test data, add those to the next iteration of training, until you get reasonable confidence measures on the test data, while still doing well on the training data.
All these approaches to Transductive Transfer Learning are fine. They assume that you have test data available during training time.
We have a slight twist on that. It might be too expensive to store the training data. Or you might have privacy concerns and hide your data, but just expose a classifier you’ve built on top of it. So, essentially, all you have is a classifier, and you need to tweak that when training data is available.
Let’s complicate it further. You have a set of classifiers. You can pick and choose classifiers you want to combine based on some criteria on the test data, create a superclassifier, and then try tweaking that based on the test data.
For starters, check this paper by Dai out. Here, you have access to the training data. What if you don’t? Can you then tweak the classifier without knowing the data underlying it?
Let’s assume it’s possible.
Then, on some criteria you pick, you choose a few classifiers from the many that you have. You merge them. And then tweak that superclassifier. Like for example, your test data contains data related to hockey and Indian films [Labels are Sport and Film]. You have one classifier on cricket and Indian films, one on hockey and Persian films, another on football and Spanish films. So C1 and C2 are the classifiers closest to your data. You combine C1 and C2 such that you get a classifier that’d be equivalent to one that’s trained on hockey, persian films, cricket and Indian films. Optimal classifier. And then tweak it.
That’s the architecture.
The questions we seek to answer are the ones regarding How to choose which classifiers to merge. And How to tweak the classifier given test data; and whether we’d need any extra data.
And… a more fundamental question.. given that the test data’s going to be from a distribution none of your ensemble would have seen before, is it worth the while to merge classifiers? Or we can vary the KL-divergence between the training and test distributions and see how having an ensemble helps.
Nascent ideas so far.