Friday, October 5, 2012

Sneaky CAPTCHA images contribute to the digitization effort



Recognize this image?  Of course  you do.



These goofy CAPTCHA images serve as the gatekeeper on many websites such as Facebook, Amazon, and Ticketmaster. Part of their function is to prove that the user is a human rather than a spamming computer, but thanks to the work of Luis von Ahm, a computer scientist at Carnegie Mellon, the CAPTCHA is also contributing to the digitization of  books and periodicals. Somehow I hadn't heard about this additional function, although I've filled out perhaps hundreds of these little boxes in the past several years... I blame my dissertator tunnel vision for this.

Despite advances in Optical Character Recognition (OCR), computers are not yet able to match the human mind's amazing ability to recognize symbols such as text, even when they are inconsistent, distorted, or poorly reproduced. Von Ahm has developed a version of the CAPTCHA program, called reCAPTCHA, in which the user is asked to type in two words instead of one. One of these words serves to confirm that the user is human, but the other is an image from a book or periodical, and our response helps to translate the image into text. Several users are given the same image, and if they consistently interpret the image as the same word, it is considered successfully converted to text.

Here's an article from the NPR website with more information about this program: http://www.npr.org/templates/story/story.php?storyId=93605988

and also an article in <i>Science</i>:
http://www.sciencemag.org/content/321/5895/1465.full?sid=e16c1bda-edda-462d-9198-baa2096672f9

3 comments:

  1. Neat stuff! If only they would speed up so I can OCR handwritten letters... But wait, even my typewritten sources come out as gobbledygook! Oh well.

    ReplyDelete
  2. This is cool, Scott! I didn't realize that the second image had that function.

    Megan ~ I'm surprised that your typewritten letters are not working. Have you considered scanning them and uploading them into Adobe Acrobat Pro ( if you have it ) and then using Adobe's recognize text function? I've done that with several 19th century typewritten letters to good effect.

    As a side note, many of my docs from the Wisconsin Historical Society are copies of typewritten letters on carbon paper with very characteristic blue ink that were produced through the hectograph process. I should write a blog post about that :)

    ReplyDelete
  3. Megan, I know that you're really wishing for an OCR that recognizes handwriting, but the program I use for converting type is ABBYY FineReader. Since part of what I'm doing is textual analysis, I convert about an average of one book or article a day, and I'm generally pleased with it. It really sucks up the processing power of a computer while it's running, though, so I do OCR conversion on my laptop while I'm doing other things on my desktop.

    ReplyDelete