Monday, February 28, 2011

Optical Character Recognition (OCR) in 34 languages

This is a cross-post from the Google Docs Blog.


Last June, we introduced the ability to upload documents into Google Docs using Optical Character Recognition (OCR). OCR analyses images and PDF files, typically produced by a scanner (or the camera of a mobile phone), extracts text and some formatting and allows you to edit the document in Google Docs.


We’ve gotten a lot of feedback on this feature, and one of the biggest requests was to add support for additional languages. Today, we’re happy to announce that we’ve added support for 29 additional character sets, including those used in most European languages, Russian, Chinese Simplified and some other Asian languages. See the upload page for the full list.



How does it work? When uploading your images and PDF files using Google Docs, tell us what language your documents are in:


































Hit upload, and we’ll use this information to search for the right characters in your file. As usual, you will get best results with sharp, high-resolution images or PDF files. This update will also result in an improvement in OCR quality for languages that we’ve supported previously (English, French, Italian, German, Spanish). We’ve also made improvements to the way we import formatting from your documents, and are now doing a better job in preserving font and alignment information.


























We’ll keep adding languages and at at the same time will continue to improve speed and accuracy for the existing ones. In the meantime, we hope you take advantage of this new way to import your data into Google Docs.


Posted by Jaron Schaeffer, Software Engineer