As Life Goes Digital

As Life Goes Digital

Technology, Cricket, Deals, Immigration etc …


Google indexes images text with OCRopus

In simple terms this is what Google is doing – If you are a photographer and take a picture with your creativity and with some concept that you have in your mind. Google does the rest of the job for you in searching that image to find clues of any text embeded in that image may be say like “road signs” or human readable text for indexing.

Until now Google did indexing of only text saved PDF documents only and with this technology any pdf document saved as image can be indexed with the help of its open-source OCRopus technology.

OCRopus development is sponsored by Google and is free document analysis and OCR system released under the Apache License. OCRopus uses HP developed Tesseract for character recognition but has its own layout analysis system that is optimized for accuracy. It is intended for high-throughput, high-volume document conversion efforts with image preprocessing and layout analysis; it chops up the scanned document before passing it to Tesseract for line-by-line or character-by-character recognition.Currently, OCRopus uses Tesseract as its only character recognition plugin, but others are expected to be added in the future.

The benefits are many with this technology imaging those dusted government agencies years of paper data that can be scanned and indexed by Google.

If you want to try hands-on — You can download the OCRopus source distribution by following the Downloads link and then follow the installation instructions linked to from the left menu.

Sources:

> code.google.com/p/ocropus/
> wikipedia/OCRopus
> 2007 article of arstechnica.com/hands-on-with-googles-ocropus-open-source-scanning-software
> sites.google.com/site/ocropus/

Reblog this post [with Zemanta]

EMAIL NEWSLETTER

Your Email Address:

Delivered by FeedBurner


Posted By: Kalyan | Date: November 2, 2008 | Categories: Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *