Category Archives: machine translation

Looking for a dataset for Kannada Machine Translation

Someone who works for the Govt of India told me about the Indian Gazette, which published a summary of all the activities of the government in English and Hindi. And there are state gazettes as well, which I assumed did the same. I found that the central government puts out the gazette with the same content in both English and Hindi. As perfect a sentence-alignment as you can expect.

Unfortunately, it doesn’t seem like the Karnataka government does that. They publish everything in only Kannada. The Kerala government publishes in only English. And the Tamil Nadu government publishes some bullet points in English and some in Tamil.

I’d not checked on this earlier, unfortunately. Now I’m back to square one, looking for a dataset for Kannada machine translation. Know of any?

Ideas in machine translation for Indian languages

For a while now, I’ve been pondering the problem of machine translation for Indian languages.

Given India has fourteen official languages, that are pretty damn closely related, and given there are so many enthusiastic people in the NLP domain, we should be at the forefront of machine translation. Unfortunately that is not the case. Yet.

It also bothered me that the current leader in a working machine translation system is Google. Google, while having some of the best scientists and engineers, is American in soul and legality. There are several reasons to have homegrown machine translation systems that are made in India, and which have a more Indian focus.

In any case, I haven’t worked personally on machine translation systems, though I have worked with colleagues who have, and it gives me a vague understanding of how it works. From what I’ve seen, Google is great in the generic case. But if you have a very specific focus, say, in the financial domain, or you want the translations to be conversational, or if you want to restrict yourself to the legal domain, you would need to improve and tweak what translations Google throws at you.

I’ve also seen that most of the machine translation work in Indian languages has been very academic. This is welcome, but in practice, these things don’t usually make it to the market. In my experience, approaching a problem like this from an academic perspective is very different from approaching it as an engineer. In academia, I have largely seen the approaches be technique based. The problem is just a setting to explore new techniques of solving it. This works brilliantly in uncovering new approaches to solving a problem. When I was at UCI, this defined my approaches. To find a newer, more improved technique to do something. As an engineer however, you want to find and implement a ‘good enough’ solution. You want whatever works. You don’t care if you need to have humans in the loop, or if your training data isn’t perfect. I haven’t seen (m)any Indian language translation systems with this approach of using whatever works, getting humans in the loop, and giving imperfect outputs.

I want to try this.

There are so many cool questions I want answered. Like, how easy will it be to translate between closely-related languages like Tamil-malayalam, Hindi-Punjabi, Assamese-Oriya-Bengali, Kannada-Telugu…. and so on? How well will Jason Baldridge’s two-hour-tagging-required POS-tagger work on Indian languages? What happens if I use Sanskrit as interlingua?

I also found that the largest corpus of cross-language translation for Indian languages is the Gazette of India. It is a Govt of India communication, that is posted in English as well as other languages. I think Google uses this for its statistical machine translation heavily. Unsurprisingly, there is a very formal tilt to the translations. This is way more pronounced in Indian languages where formal style is very, very different from casual conversation. Detecting the formality level of an English sentence and translating it appropriately into an Indian language seems like an interesting problem.

Use cases for translation in India are also something I wonder very hard about.

The obvious use case is a generic translation app. This is not something I’m inclined to go head-to-head with Google on. Not right away at least 😉 But it ought to be something we keep coming back to.

The next obvious criterion is an API stack of some sort that others can use to build their apps and other regionalization needs. Google translate API seems to be a clear winner here as well. It will take a while to build something with that level of reliability and generic nature. But not too long, I’d wager.

A good start however would be a niche need. Like maybe translating legal documents from one language to another. Or to English (but then English is an Indian language too 😉 ).  I can use Google’s API to generate training data cheaply, and then tweak my built model around for my specific usecase.

Another niche need would be to translate from one Indian language to another in an app that tourists/visitors can use to navigate around town. The kicker here is, how much more useful would your app be as compared to a phrasebook? A more useful app in this context would be one that can read signboards and translate them for you.

Yet another is to help the diaspora and other Indians learn a language through simple translated sentences. Again, this falls into the trap of how much better this would be if it was done like a phrasebook or the app version of “Learn Kannada in 30 days” manuals.

Another idea is to make the Government of India your customer, and help them with their regionalization needs. But then, the government has more bilingual people in the IAS itself than they need, and simple translation is probably not at all an issue when you’re operating at the scale of the government of India.

The dark side of me is thinking up an exciting novel/movie, though. Two idealistic US-educated scientists get inspired by Make In India and go back to make a simple translation app. After a whole lot of failures in monetizing their work, they are suddenly approached by the Govt of India, by the same officials who laughed their idea off earlier. Picture a Paresh Rawal at his droll politician best telling these meek urban types how much their idea will never work in the ‘real India’, and right after the interval coming back with a more serious professional look and demeanor along with the head of R&AW. Now I want this guy to be played by someone who radiates quiet power. Maybe Atul Kulkarni, but he’s got to look a decade older, and a bit more better built. And they find they can instantly become rich if they sell their code to NTRO, to use on the NETRA program (kind of like PRISM). They say thanks but no thanks, but heck, the head of NTRO tells them it’s an offer they can’t refuse. This guy’s got to be a persuasive shades-of-grey sort of wizened spy who used to work in ATT and NASA before he got recruited and had to fake his death and everything and now works under a new name. The two protagonists have a ‘Gasp! It’s him!’ moment of recognition because they’ve actually used a lot of his research to make their software. This NTRO guy can easily be played by Madhavan. And the rest of the plot is about how they decide on what to do with their software, whether they join NTRO, and whether they can sleep at night knowing they are being used to spy on billions of little online conversations every day.

Hmm. I gotta write that.

%d bloggers like this: