Monthly Archives: July 2010
I think this’d work best if I just updated my daily progress here than try giving comprehensive views of what I’m doing.
So you have data coming in that needs to be classified. Apparently the accuracy of most classifiers is abysmally low. We need to build a better classifier.
I took a month’s worth of data, and applied all possible classifiers on it, cross-validated it. Accuracy was roughly in the area of 85-90%. While that’s not excellent, it’s not bad, given the small amount of training data.
So what’s this low accuracy everyone’s talking of?
Turns out, the new data coming in turns out to be very different from the data you train on. You’ll train on June’s data, but July’s data’s going to be much different. The same words will end up reappearing in different class labels. Hence the low accuracy.
Also, you have not just one classifier, but many. It turns out that when you train many classifiers on subsets of the data, they perform better than training one classifier on the entire data.
Learning will have to keep evolving. I first thought of Active Learning in this context, where you’ll expect the user to label the stuff you are not sure about. But then, what if you confidently label stuff that is patently wrong?
The many classifiers bit of the problem helps us visualize the training data in a different way – Each category – class label – has many sub-categories. Now each classifier is trained on a month’s worth of data. It turns out that each month can be likened to a sub-category. You train on one sub-category, and test on another sub-category, and expect it to return the same class label. That’s like training a classifier on data that contains hockey-related terms for the Sports label, and then expecting it to recognize data that contains cricket-related terms as Sports too.
This would be transfer learning/domain adaptation – you learn on one distribution, and test on a different distribution. The class labels and features however, remain the same.
This would more specifically be Transductive Transfer Learning – you have a training set, from distribution D1, and a test set, from distribution D2. You have this unlabelled test data available during training, and you somehow use this to tweak the model you’ll learn from the training data.
Many ways exist to do this. You can apply Expectation-Maximization on a Naive Bayes classifier trained on the training data, to maximize the expectation of the test data, while still doing well on the training data. You can train an SVM, assign pseudo-labels to the test data, add those to the next iteration of training, until you get reasonable confidence measures on the test data, while still doing well on the training data.
All these approaches to Transductive Transfer Learning are fine. They assume that you have test data available during training time.
We have a slight twist on that. It might be too expensive to store the training data. Or you might have privacy concerns and hide your data, but just expose a classifier you’ve built on top of it. So, essentially, all you have is a classifier, and you need to tweak that when training data is available.
Let’s complicate it further. You have a set of classifiers. You can pick and choose classifiers you want to combine based on some criteria on the test data, create a superclassifier, and then try tweaking that based on the test data.
For starters, check this paper by Dai out. Here, you have access to the training data. What if you don’t? Can you then tweak the classifier without knowing the data underlying it?
Let’s assume it’s possible.
Then, on some criteria you pick, you choose a few classifiers from the many that you have. You merge them. And then tweak that superclassifier. Like for example, your test data contains data related to hockey and Indian films [Labels are Sport and Film]. You have one classifier on cricket and Indian films, one on hockey and Persian films, another on football and Spanish films. So C1 and C2 are the classifiers closest to your data. You combine C1 and C2 such that you get a classifier that’d be equivalent to one that’s trained on hockey, persian films, cricket and Indian films. Optimal classifier. And then tweak it.
That’s the architecture.
The questions we seek to answer are the ones regarding How to choose which classifiers to merge. And How to tweak the classifier given test data; and whether we’d need any extra data.
And… a more fundamental question.. given that the test data’s going to be from a distribution none of your ensemble would have seen before, is it worth the while to merge classifiers? Or we can vary the KL-divergence between the training and test distributions and see how having an ensemble helps.
Nascent ideas so far.
So this post will slightly deviate from the general tone of this blog. It is a tad more personal.
I’ve just come out of a phase of unstructured time where I really really wanted to fix short-term goals for myself, and failed miserably. At the end of it all, I watched Julie And Julia, where the lead protagonist uses her blog to set short-term goals for herself, while also using it to check off each goal achievement. I want to emulate that.
I am now interning at a research lab in the industry. I am deciding on a problem statement that will mostly involve some form of transductive transfer learning. I have a great work environment, and an awesome mentor who helps with the short-term goal setting.
In such a setting, I feel I should probably maintain a daily log of how things are progressing, so that I can refer back to these notes later when I want to know how to set goals and progress with research. I have a controlled environment now, and it’d be interesting as well as helpful to document my time here such that I can replicate it elsewhere.
Most of my work will involve previous work and data that’s in the public domain, so I don’t think it’ll be a breach of any contract or NDA to talk about them on a public forum. Though, I might choose to make the text unsearchable and hence make the posts hidden, while keeping the password public. Not many know of the existence of this blog, and this I guess would be a sane decision. I’ll anyway have to check with my superiors #TODO.
Alrighty. Next post possibly coming in another hour.