Sunday, June 10, 2012

What Is Re-Captcha and Why do we have to type it?

 Luis Von Ahn, developer and founder of the dreaded Captcha realized that about 200 million Captchas were being typed everyday at 10 seconds per Captcha. He decided that was a lot of time being wasted (500 thousand hours every day) so he developed Re-Captcha. What does Re-Captcha do? Not only are you proving you are human and supporting Internet security but you are also digitizing books! I never knew did you? I saw a story on this on Public Access TV and was so intrigued I looked it up on YouTube, the story I found is from 2009 so the numbers I have are much lower then it would be 3 years later.

So here is what happens, they take a book written before computers existed and scan it, then a computer program called OCR tries to decide what the text is saying in the scan and then digitize it. But the OCR program is not perfect which is where Re-Captcha comes in. So what they are doing is taking the words that are wrong because the computer can not recognize them (the unrecognized words are highlighted like spell check does in the example to the left). What they have been doing is taking a word they know is right and placing it with a word the OCR program cannot recognize distort it more and then they get

humans to decipher the two words. They have a select number of people decipher each set of words and if an agreed number type the unrecognized word the same they know what it should be and it is corrected! They give the companies the words for free, currently there are over 120 thousand web sites including the White House use Re-Captcha! They are digitizing over 35 million words a day Wow! The books they are digitizing are books from the Internet Archive which is digitizing books that were written before 1923 (these books are out of copyright). They were able to transcribe The New York Times Archives(1851 to 1980 something)in 4 months in 2009.Another startling statistic is that over 400 Million people worldwide are working to digitize books for free lmao. I still don't like Captchas for my site, I am not in need of that high of security but I don't think I will resent having to type them nearly as bad in the future!

Because the words are randomly paired here is a couple funny pairing mistakes that have happened:
On John Edwards web site:

On The Embassy of the Kingdom of God web site: