5.5 File Conversions
Converting from one file format to another is a complex yet common task. Here we collect together a variety of conversion tasks and the tools to perform them.
We can often use pandoc for much of the hard work. File types supported include docx, md, org, pdf, rst, tex, though some, like docx are only supported for output.
pandoc mydoc.tex -o mydoc.md # LaTeX to Markdown
pandoc mydoc.md -o mydoc.docx # Markdown to Microsoft Word
To convert Microsoft and LibreOffice documents to pdf, as also covered in Section 97.5, we can use libreoffice itself, invoked in headless mode from the command line:
libreoffice --headless --convert-to pdf input.docx # Microosft Word to PDF
libreoffice --headless --convert-to pdf input.xlsx # Microsoft Excel to PDF
For multiple file conversion, to avoid restarting libreofficemultiple times, we can install and use unoserver.
Jupyter notebook conversions are provided by jupyter-nbconvert:
jupyter-nbconvert --to markdown doc.ipynd --stdout > doc.md
jupyter-nbconvert --to python doc.ipynd --stdout > doc.py
jupyter-nbconvert --to python doc.ipynd --stdout > doc.R
To extract structured information as Markdown or JSON from various document formats like PDF, DOCX, Images, etc, for RAG/QA applications, see docling.
Another type of conversion results from MS/Windows using a different line ending convention for text files to GNU/Linux (and Unix). Originally dos2unix was used for this task which is now accomplished by flip which will convert the file in-place:
