Go to TogaWare.com Home Page. GNU/Linux Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

OCR of PDF

20191010 A pdf document may simply be a container for an image of a text document rather than containing the text of the document itself. A typcial example is when a document is scanned and saved as a pdf. An image is what is actually saved within the pdf.

The ocrmypdf command will use optical character recognition to extract the text from the image encapsulated within an image pdf, and then adds an invisible text layer to the document.

$ ocrmypdf doc.pdf doc_ocr.pdf
$ evince doc_ocr.pdf

The pdf should now be text searchable and it is possible to use diffpdf to compare pdfs as in Section 66.3. Typically the comparison of two almost identical documents that are processed using ocrmypdf will highlight more differences than actually exist, simply due to the nature of how the original documents might have been scanned.


Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
Graham Williams is the developer of open source software including rattle and wajig.
He is the author of Data Mining with Rattle and Essentials of Data Science.
This web site is hosted by Togaware, free and open source software since 1984.