Category Archives: google

Ideas in machine translation for Indian languages

For a while now, I’ve been pondering the problem of machine translation for Indian languages.

Given India has fourteen official languages, that are pretty damn closely related, and given there are so many enthusiastic people in the NLP domain, we should be at the forefront of machine translation. Unfortunately that is not the case. Yet.

It also bothered me that the current leader in a working machine translation system is Google. Google, while having some of the best scientists and engineers, is American in soul and legality. There are several reasons to have homegrown machine translation systems that are made in India, and which have a more Indian focus.

In any case, I haven’t worked personally on machine translation systems, though I have worked with colleagues who have, and it gives me a vague understanding of how it works. From what I’ve seen, Google is great in the generic case. But if you have a very specific focus, say, in the financial domain, or you want the translations to be conversational, or if you want to restrict yourself to the legal domain, you would need to improve and tweak what translations Google throws at you.

I’ve also seen that most of the machine translation work in Indian languages has been very academic. This is welcome, but in practice, these things don’t usually make it to the market. In my experience, approaching a problem like this from an academic perspective is very different from approaching it as an engineer. In academia, I have largely seen the approaches be technique based. The problem is just a setting to explore new techniques of solving it. This works brilliantly in uncovering new approaches to solving a problem. When I was at UCI, this defined my approaches. To find a newer, more improved technique to do something. As an engineer however, you want to find and implement a ‘good enough’ solution. You want whatever works. You don’t care if you need to have humans in the loop, or if your training data isn’t perfect. I haven’t seen (m)any Indian language translation systems with this approach of using whatever works, getting humans in the loop, and giving imperfect outputs.

I want to try this.

There are so many cool questions I want answered. Like, how easy will it be to translate between closely-related languages like Tamil-malayalam, Hindi-Punjabi, Assamese-Oriya-Bengali, Kannada-Telugu…. and so on? How well will Jason Baldridge’s two-hour-tagging-required POS-tagger work on Indian languages? What happens if I use Sanskrit as interlingua?

I also found that the largest corpus of cross-language translation for Indian languages is the Gazette of India. It is a Govt of India communication, that is posted in English as well as other languages. I think Google uses this for its statistical machine translation heavily. Unsurprisingly, there is a very formal tilt to the translations. This is way more pronounced in Indian languages where formal style is very, very different from casual conversation. Detecting the formality level of an English sentence and translating it appropriately into an Indian language seems like an interesting problem.

Use cases for translation in India are also something I wonder very hard about.

The obvious use case is a generic translation app. This is not something I’m inclined to go head-to-head with Google on. Not right away at least 😉 But it ought to be something we keep coming back to.

The next obvious criterion is an API stack of some sort that others can use to build their apps and other regionalization needs. Google translate API seems to be a clear winner here as well. It will take a while to build something with that level of reliability and generic nature. But not too long, I’d wager.

A good start however would be a niche need. Like maybe translating legal documents from one language to another. Or to English (but then English is an Indian language too 😉 ).  I can use Google’s API to generate training data cheaply, and then tweak my built model around for my specific usecase.

Another niche need would be to translate from one Indian language to another in an app that tourists/visitors can use to navigate around town. The kicker here is, how much more useful would your app be as compared to a phrasebook? A more useful app in this context would be one that can read signboards and translate them for you.

Yet another is to help the diaspora and other Indians learn a language through simple translated sentences. Again, this falls into the trap of how much better this would be if it was done like a phrasebook or the app version of “Learn Kannada in 30 days” manuals.

Another idea is to make the Government of India your customer, and help them with their regionalization needs. But then, the government has more bilingual people in the IAS itself than they need, and simple translation is probably not at all an issue when you’re operating at the scale of the government of India.

The dark side of me is thinking up an exciting novel/movie, though. Two idealistic US-educated scientists get inspired by Make In India and go back to make a simple translation app. After a whole lot of failures in monetizing their work, they are suddenly approached by the Govt of India, by the same officials who laughed their idea off earlier. Picture a Paresh Rawal at his droll politician best telling these meek urban types how much their idea will never work in the ‘real India’, and right after the interval coming back with a more serious professional look and demeanor along with the head of R&AW. Now I want this guy to be played by someone who radiates quiet power. Maybe Atul Kulkarni, but he’s got to look a decade older, and a bit more better built. And they find they can instantly become rich if they sell their code to NTRO, to use on the NETRA program (kind of like PRISM). They say thanks but no thanks, but heck, the head of NTRO tells them it’s an offer they can’t refuse. This guy’s got to be a persuasive shades-of-grey sort of wizened spy who used to work in ATT and NASA before he got recruited and had to fake his death and everything and now works under a new name. The two protagonists have a ‘Gasp! It’s him!’ moment of recognition because they’ve actually used a lot of his research to make their software. This NTRO guy can easily be played by Madhavan. And the rest of the plot is about how they decide on what to do with their software, whether they join NTRO, and whether they can sleep at night knowing they are being used to spy on billions of little online conversations every day.

Hmm. I gotta write that.

Advertisements

RIP, Reader

Yeah, this is yet another one of the funeral dirges for Google Reader. And I post it here instead of on my personal blog because I need to get into the habit of writing about technology here. Google Reader is hardly ‘technology’ as I intend it to be… I want to use this place for research updates and paper summaries.  But the anxiety about ‘not being good enough’ when it comes to all that is so much that I don’t want to write anything even remotely geeky. I need to snap out of that. And it’s NaNoWriMo, it’s about quantity more than quality. So here we go 🙂

So basically there are two main arguments against Google Reader’s integration with Google Plus. First is about how the user interface is sucky. And the second is about how the removal of sharing has killed the whole spirit of Reader. A third, if I may add, is that the platform/API is so bad, and everything is so messed up at first look that I can’t seem to wrap my mind around how to write a wrapper that makes things better.  Oh wait, there’s a fourth as well – the ‘stream’ format, as opposed to the folders-and-tags format, is the very antithesis of what Reader is supposed to be.

Let’s start with the appearance. Yes, white space is good. It makes things look ‘clean’. But that’s only when you have very specific things you want your user to see on your page. It works great for the Google.com homepage, for instance… all you want is a search bar. But when it’s a feed reader, it doesn’t work at all. When I log in, I don’t want to see half my screen space taken up by needless headers and whatnot. The bar with ‘Refresh’, ‘Mark as read’ and ‘Feed settings’ are needlessly large and prominent instead of being smaller and not taking up much space. They aren’t used all that much, to start with, that justifies their large font size. The focus here shouldn’t be on the options, but on the thing I’m reading. Fail.

Then everything’s gray, including links. If something’s not blue or purple, my mind doesn’t consider it a link. Sorry, but those are unwritten conventions on the Web. There’s no reason to change that now, and gray is a horrible color to show that something’s different from the rest of the black text. And the only spots of color on the page are a tiny dab of red to show the feed you’re currently reading, and a large button on the top left that says ‘Subscribe’. Dab of red, seriously? I much rather preferred the entire line showing the current feed highlighted instead of that little red bar. And I don’t add new feeds to read everyday that I need a large ‘Subscribe’ button. And when I do add feeds, I don’t add them using google.com/reader… I’m on the website I want to add, usually, and add feeds by clicking on the RSS icon, and then adding to reader.

Then the UI for sharing. It’s a lot more clicks to share something now. And yeah, the gripe is that whatever I share will be shared only on G+, but we’ll get to that in a moment. My problem with having to pick what circles I share with each time I share a feed is that it’s too much decision making too often. Atleast give me a set of check boxes of my circles so that all I need to do is two clicks instead of having to start typing my circle names.

It turned out, if you wanted to share something without publicly +1-ing something, you’d have to go to the top-right corner and click on ‘Share’. Well, how is that intuitive? And why would anyone design it that way, especially when the previous way to do that was by clicking on ‘share’ right below the post? Surely, it could have just had the Circles thing appear when you clicked the ‘share’ button, and +1-ing it could be a different button? And keep the top-right Share button if you like?

Now about sharing. I can share something with folks from Google Reader, yes, but they can only read it from Google Plus. Someone said that’s like retweeting something on Twitter from your client, like say Tweetdeck, but those who follow you can see your RT’s using only twitter.com. How retarded is that? I want a one-stop shop where I can do all my reading instead of having it spread over a zillion other places.

Due to which one of the things I wanted to do was build a wrapper website that integrated links shared on your G+ stream with your Reader feeds. I can’t seem to wrap my mind around how exactly it would work, but that’s one thing I certainly want to do.

The ‘stream’ format sucks for reading shared links. I have this problem with Twitter too, but on Twitter, you can ‘Favorite’ tweets which contain links and then read them one by one later. In fact, I was wondering about a platform that takes links on your Twitter timeline and puts them together for easy reading, feed reader style. Google Plus however has no such feature which you can use to tuck away stuff for later. If you’re too busy, you skip over a shared link and it’s lost. I much preferred the model where your feeds would all accumulate and if it got too much to handle, you could always mark all as read. Even better when your feeds would be properly organized.

And then Google Plus does a bad job of displaying shared links. It shows a small preview, but that’s more often than not insufficient. Buzz was better in this respect… atleast your images could be expanded, and posts could be expanded so that you could read it right there. Ha, one positive of this would however be that people would get a lot more hits on their websites. And it is not immediately apparent as to how inconvenient this sort of a visual format is, because people don’t share so much on Google Plus yet, and they don’t yet use it as a primary reader or such extensive use that it gets on their nerves.

And finally about the thing that has had the largest impact. Sharing.

Previously, in 2007, when Reader didn’t yet have sharing, we’d all come across nice links we’d want to share with our friends, and then either ping them on IM with it, or mail them the link. Needless to say, it was irksome. For both us and our friends. But somehow, when you shared it on Reader, the intrusiveness of sending links went away. It was just there, and if you liked it, you said so on the comments or by resharing it or referencing it in conversation. It stopped feeling like you were shoving it down someone’s throat, or someone shoved it down yours.

Sharing was also a nice way to filter content. For example, I loved reading Mental Floss’s feeds, but couldn’t stand the feed-puke that were feeds like TechCrunch and Reddit, whereas it was the other way for some of my friends. So we just followed each other, and I read the TechCrunch and Reddit content they deemed good enough to share, while I shared the interesting tidbits from Mental Floss.

Google Reader, I remember feeling, was a nice incubator for observing social network dynamics and introducing social features. It was my first first-hand exposure to recommender systems, before I moved to the USA and could actually shop on Amazon or watch movies on Netflix. It was interesting seeing how the recommendations incorporated stuff from your GTalk chats, your searches, stuff you ‘liked’… I remember freaking out about how after chatting often with a friend in LA my recommended feeds included a lot of LA-related blogs. And there was a search engine based treasure hunt at my undergrad college, and a friend and I remember saying “Oh man, googling stuff for this contest is so going to affect our Reader recommendations”.

It was also where I was recommended tons of blogs on ML and NLP and IR, due to which I went to grad school where I did, and did my thesis in what I did.

Also fun was the ‘Share as a note in reader’ bookmarklet. That way, I could share stuff from anywhere on the Internet with people who I knew would appreciate it.

Now it seems as if the Plus team wants to go and prove right that ex-Amazon Googler who said Google can’t do platforms well. Instead of providing services which can be used in a variety of ways to provide ‘just right’ experiences for a variety of people, Plus is trying to do it right all by itself. And failing miserably at that. The reason for Twitter’s success is the sheer variety of ways you can tweet – from your browser, from your smartphone, from your not-so-smart phone using Snaptu, from your dumb phone via text, your tablet, your desktop…. and I just don’t see that happening with Plus yet.

Maybe I wouldn’t be so mad if all the folks I share with on Reader were on Plus, but actually, hardly anyone is. And I don’t check my Plus feed on a regular basis either. I wouldn’t mind going on Plus to just read what everyone’s sharing, but the user experience is so bad I wouldn’t want that.

Google should have learnt from when it integrated Reader with Buzz and a lot of people found that irksome and simply silenced others’ Reader shares from their Buzz feed, that the Reader format doesn’t go well with the stream format.

There’s so much quite obviously broken with the product that you wonder if the folks who design and code this up actually use it as extensively as you do. Dogfooding is super-important in products like Google’s where there are a wide variety of users and user surveys can’t capture every single aspect.

But given that doing this to Google Reader seems just like when they cancelled Arrested Development, you begin to think they are probably aware of everything, and just don’t care about you the user and your needs anymore.

PS: Can anyone help me get the Google Plus Python API up and running on Google App Engine? I want to play with it, see what it does, and am not able to get it up and running.

PPS: Does a Greasemonkey script to make G+ more presentable sound like a good idea?

PPPS: Check out the folks at HiveMined. They are building a replacement for Google Reader 🙂

The Google Earth API

I’ve been using it for the past couple of months, for visualizations.

Here, go on and read the documentation. It’s rather well-written.

The short of it: You can access the API using Javascript. But the fun doesn’t begin until you’ve begun with KML.

More coming up. I’ve been working on this quite a bit.

A new idea for Captchas.

Folks from Google have come up with a new sort of captcha. I find it a brilliant idea – Image orientation! Read all about it here[pdf].

Here’s the short version – Making captchas more complicated makes it harder for even humans to decipher them… I’ve faced that issue many times. So pick on a task that’s easy for humans, but hard for machines. Image orientation is one such task that’s AI-hard.

But not always… some images are easy for both machines and people… some are hard for machines as well as people… and some are easy for people and hard for machines. So part of the task is to detect such images, while discarding the other sorts.

Easy for machines is easy to pick. Hard for machines, too. But what about those that are hard for people? Here’s where Google makes use of its large number of users. They give a second captcha along with one that’s proven to work. If there’s a large amount of variance in the way users orient the image, it’s deemed hard for people. And they also correct their own default orientations of images this way… sometimes images are wrongly oriented because of camera angles or various other reasons.

Brilliant this is. I’m thankful. The captchas were getting so crazy I began to doubt if I was human.

Bacn for Gmail

Guess all of us are used to deleting those news alerts, Orkut/Facebook alerts, stuff from mailing lists… in order to keep a clean and relevant mailbox.

I often wondered why Gmail didn’t provide an option for ‘Friendly Spam’ – stuff that you don’t want marked spam totally, but stuff you don’t want cluttering your inbox.

Apparently, there’s a term to describe such mail – Bacn.

So.. what do we do with bacn? How do we deal with it? We need a holding mechanism for it… a folder or something, whose contents autodelete after you’ve seen it once? What else can we do with bacn?

%d bloggers like this: