Technical Translation from English into Russian in Computer and Telecommunication Industries
The author Articles Recourses Useful links По-русски

Recognition of PDF documents

Recognition (OCR, Optical Character Recognition) is necessary for transformation the text elements from PDF document which are represented by graphic to the true text. Adobe Acrobat has the built-in recognition function (the Recognize Text Using OCR command on Document menu), but it is intended only for recognition of the scanned documents or pages in PDF document that are completely presented as graphic.

For recognition the "graphical" text elements in PDF document it is recommended ABBYY FineReader 8.0 OCR-program. This program not only recognizes the PDF document, but also extracts the true text from it, so you can get �right� editable document from mixed graphic-text file with preserved design. When processing PDF files, FineReader determines whether or not text is embedded, examines the integrity of the text layer and makes a decision whether to extract the text or apply OCR. It examines each block individually, and selects the most appropriate method to apply to each block. Also FineReader will recreate internal links and hyperlinks within a PDF file. For example, if the table of contents in the PDF file has links to document pages, these internal links will be reconstructed in the Microsoft Word document.

FineReader�s previous versions allowed opening the encrypted PDF files, but since version 8 this option is not supported. However the delivery complete set of ABBYY FineReader includes ABBY Screenshot Reader utility, which makes a screenshot of encrypted PDF�s opened in other application for direct transferring them into FineReader and start recognition.

OCR software enables you to convert the Adobe PDF file into one of supported file formats. But depending on how well the software works with your source document, it may recognize all the text, some of the text, or no text at all. When the software cannot recognize an area of the original, its fallback strategy is to represent that area as an image, instead of text. So in the general case what you now have is a mixture of text and images. Because the text recognition is not ideal, in final text there can be mistakes. The level of the successful symbol recognition of modern OCR-programs is high enough, but mistakes nevertheless arise. You will find out how well the recognition worked after you have saved the document in your desired format.

Check the file produced by recognition process, comparing the result to the original to see how well it worked. There will probably be some cleanup necessary. For example, if you exported the file to Word, you may see some parts of the document represented as images. You'll need to retype those. Also, Microsoft Word representations of the text may be full of strange paragraph formatting and spurious changes to text fonts, styles and size.

Next page