Category Archives: ideas
For a while now, I’ve been pondering the problem of machine translation for Indian languages.
Given India has fourteen official languages, that are pretty damn closely related, and given there are so many enthusiastic people in the NLP domain, we should be at the forefront of machine translation. Unfortunately that is not the case. Yet.
It also bothered me that the current leader in a working machine translation system is Google. Google, while having some of the best scientists and engineers, is American in soul and legality. There are several reasons to have homegrown machine translation systems that are made in India, and which have a more Indian focus.
In any case, I haven’t worked personally on machine translation systems, though I have worked with colleagues who have, and it gives me a vague understanding of how it works. From what I’ve seen, Google is great in the generic case. But if you have a very specific focus, say, in the financial domain, or you want the translations to be conversational, or if you want to restrict yourself to the legal domain, you would need to improve and tweak what translations Google throws at you.
I’ve also seen that most of the machine translation work in Indian languages has been very academic. This is welcome, but in practice, these things don’t usually make it to the market. In my experience, approaching a problem like this from an academic perspective is very different from approaching it as an engineer. In academia, I have largely seen the approaches be technique based. The problem is just a setting to explore new techniques of solving it. This works brilliantly in uncovering new approaches to solving a problem. When I was at UCI, this defined my approaches. To find a newer, more improved technique to do something. As an engineer however, you want to find and implement a ‘good enough’ solution. You want whatever works. You don’t care if you need to have humans in the loop, or if your training data isn’t perfect. I haven’t seen (m)any Indian language translation systems with this approach of using whatever works, getting humans in the loop, and giving imperfect outputs.
I want to try this.
There are so many cool questions I want answered. Like, how easy will it be to translate between closely-related languages like Tamil-malayalam, Hindi-Punjabi, Assamese-Oriya-Bengali, Kannada-Telugu…. and so on? How well will Jason Baldridge’s two-hour-tagging-required POS-tagger work on Indian languages? What happens if I use Sanskrit as interlingua?
I also found that the largest corpus of cross-language translation for Indian languages is the Gazette of India. It is a Govt of India communication, that is posted in English as well as other languages. I think Google uses this for its statistical machine translation heavily. Unsurprisingly, there is a very formal tilt to the translations. This is way more pronounced in Indian languages where formal style is very, very different from casual conversation. Detecting the formality level of an English sentence and translating it appropriately into an Indian language seems like an interesting problem.
Use cases for translation in India are also something I wonder very hard about.
The obvious use case is a generic translation app. This is not something I’m inclined to go head-to-head with Google on. Not right away at least 😉 But it ought to be something we keep coming back to.
The next obvious criterion is an API stack of some sort that others can use to build their apps and other regionalization needs. Google translate API seems to be a clear winner here as well. It will take a while to build something with that level of reliability and generic nature. But not too long, I’d wager.
A good start however would be a niche need. Like maybe translating legal documents from one language to another. Or to English (but then English is an Indian language too 😉 ). I can use Google’s API to generate training data cheaply, and then tweak my built model around for my specific usecase.
Another niche need would be to translate from one Indian language to another in an app that tourists/visitors can use to navigate around town. The kicker here is, how much more useful would your app be as compared to a phrasebook? A more useful app in this context would be one that can read signboards and translate them for you.
Yet another is to help the diaspora and other Indians learn a language through simple translated sentences. Again, this falls into the trap of how much better this would be if it was done like a phrasebook or the app version of “Learn Kannada in 30 days” manuals.
Another idea is to make the Government of India your customer, and help them with their regionalization needs. But then, the government has more bilingual people in the IAS itself than they need, and simple translation is probably not at all an issue when you’re operating at the scale of the government of India.
The dark side of me is thinking up an exciting novel/movie, though. Two idealistic US-educated scientists get inspired by Make In India and go back to make a simple translation app. After a whole lot of failures in monetizing their work, they are suddenly approached by the Govt of India, by the same officials who laughed their idea off earlier. Picture a Paresh Rawal at his droll politician best telling these meek urban types how much their idea will never work in the ‘real India’, and right after the interval coming back with a more serious professional look and demeanor along with the head of R&AW. Now I want this guy to be played by someone who radiates quiet power. Maybe Atul Kulkarni, but he’s got to look a decade older, and a bit more better built. And they find they can instantly become rich if they sell their code to NTRO, to use on the NETRA program (kind of like PRISM). They say thanks but no thanks, but heck, the head of NTRO tells them it’s an offer they can’t refuse. This guy’s got to be a persuasive shades-of-grey sort of wizened spy who used to work in ATT and NASA before he got recruited and had to fake his death and everything and now works under a new name. The two protagonists have a ‘Gasp! It’s him!’ moment of recognition because they’ve actually used a lot of his research to make their software. This NTRO guy can easily be played by Madhavan. And the rest of the plot is about how they decide on what to do with their software, whether they join NTRO, and whether they can sleep at night knowing they are being used to spy on billions of little online conversations every day.
Hmm. I gotta write that.
I use Twitter quite some. A lot of the people I follow share quite a lot of links. When I browse twitter on my mobile in the morning, I can’t check out all the links. I usually ‘Favorite’ the links that seem interesting and then browse them later. I’d actually prefer a better interface to this, which enables me to tag these links privately so that I can look for them later as well.
I found one such webapp whose name I now forget. The problem with it was it had a sucky interface and didn’t let me preview all the links properly. Then there’s also Tweetree which offers previews of shared links. I also like the Google Reader/Gmail sort of interface which keeps track of new links and already read links. And also, when multiple people share the same link, I’d like to see it all collapsed as one with “X, Y and Z shared this” next to it. Or something.
So this is one thing I’d like to build using Google App Engine.
The steps to do so would be as follows:
- Find a nice Twitter API interface for Python which can preferably be integrated with Google App Engine.
- Write code to get tweets from your Twitter timeline.
2(a) Learn how to use Twitter OAuth.
- Detect tweets with links. When they do, extract the unshortened link.
- By now, you have a set of links, and can choose to display them as you wish.
- Use the App Engine datastore to store previously viewed links. Possible attributes to be stored along with link can include users who shared this link, timestamps of tweets which shared these links, viewed-or-not (when dropping into database after extraction, this attribute should have the value ‘No’), title of linked page. Also store time of last login.
- Workflow: On login, extract links from timeline and drop into database until the timestamp of the tweet you’re reading is lesser than the time of last login. Then display those links with ‘viewed-or-not’ value as ‘No’ as ‘Unread items’ and the rest as ‘read’ items. On clicking each link, mark them as read. Also provide checkboxes to mass-markAsRead.
- Basic interface: Gmail HTML sorts. Previews and stuff can come later.
- Link-extract-and-drop-in-database. This in turn includes Link extractor, unshortener, title-getter, database interface.
- Database queries to view links and mark them as read/unread.
- User interface.
I’ve had an extremely nice two weeks, and have come back fully rejuvenated. Not once during the two weeks did I think of what’s on during my quarter. Now I guess I can restart all that.
Now I have an idea for a webapp. Something quite easily implementable on Google Appengine, I guess….
Let’s call it Don’tLiftMyContent!
It’s primarily supposed to be a service that checks if your blog’s or website’s content is being plagiarized elsewhere. Like, you give in your blog’s URL, and it gives you a list of pages that use your images and your text. And for this, it can use existing stuff like Google/Yahoo/Bing for text and TinEye for images. While the web search engines are reasonably good for text, TinEye doesn’t yet have such a comprehensive database of images, and this would probably be the limiting factor of the webapp.
I guess timestamps can be compared in order to eliminate sources your blog has plagiarized/borrowed from 🙂
Since this idea occurred to me just a few minutes back, all the existing work I could find are websites which enable teachers to check if their students are plagiarizing. I haven’t yet found a website which does this for blogs, and will be very glad to know if there is.
What say about the idea? Interested? We can code this together, if you want.
Food for thought: how will I know if someone gets the idea from this post of mine and goes on to create this webapp and not give me any credit at all? 🙂