67.12 PDF OCR

20191010

A pdf document may simply be a container for an image of a text document rather than containing the text of the document itself. A typcial example is when a document is scanned and saved as a pdf. An image is what is actually saved within the pdf.

The ocrmypdf command will use optical character recognition (ocr) to extract the text from the image encapsulated within an image pdf, and then adds an invisible text layer to the document. This is then useful to compare pdf documents as in Section 67.4.

ocrmypdf doc.pdf doc_ocr.pdf
evince doc_ocr.pdf

The pdf should now be text searchable and it is possible to use diffpdf to compare pdfs as in Section 67.4. Typically the comparison of two almost identical documents that are processed using ocrmypdf will highlight more differences than actually exist, simply due to the nature of how the original documents might have been scanned.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0