Working with binary file types like the Microsoft Word XML Format Document
docx
, the OpenDocument Text
odt
format and the Portable Document
Format
pdf
in combination with git has its difficulties. Out of the box,
git only provides diffing for plain text formats. Comparing binary files in
textual format is not supported.
With a simple configuration change and some open source, cross-platform tools, git can be adapted to diff those formats as well.
Installing the tools
First, one needs the tools which can convert the binary files to plain text
formats. For most formats like
docx
and
odt
, the open source
tool Pandoc [1] will do the trick. It can even export those files to Markdown
format, or (my personal choice) reStructuredText [2]. A markup language like
reStructuredText makes it possible to make a detailed comparison between
structured documents, for instance when the heading level changed.
For PDF, there's the open source tool
pdftotext
, which is part of the
Poppler [3] utils package and available for (almost) all operating systems. This can
convert a PDF file to plain text.
There's a tiny catch with
pdftotext
, as it has issues using stdout as
output, instead of writing to files. This is necessary, as git will expect the
output on stdout.
This can be fixed by creating a tiny wrapper named
pdftostdout
around
pdftotext
, which will execute the program with the correct parameters. A
dash as last parameter will instruct
pdftotext
to use stdout:
echo "pdftotext $1 -" > /usr/bin/pdftostdout
Of course this wrapper can be stored anywhere, as long as it can be executed and found by git.
Add new text conversion handlers to git
After installing both programs and the wrapper, next git needs to be instructed how to convert the binary file types to text format. This can be accomplished by modifying the global git configuration:
git config --global diff.docx.textconv pandoc --to=rst
git config --global diff.odt.textconv pandoc --to=rst
git config --global diff.pdf.textconv pdftostdout
This creates new diff handlers for each of the file types.
Note
Using the parameter
--to=rst
specifies pandoc to use the
reStructuredText format. This makes comparing hierarchies easier than just
using the plain text format.
Instruct git to apply the correct handlers per file type
Finally, git needs to know which conversion handler to use for which file type. That can be accomplished by modifying the global gitattributes [4] file.
The gitattributes file defines attributes per path, or per file. That means that you can specify handlers per file _type_, which will automatically convert the binary format to text format, using the correct tool.
The gitattributes file can be specified locally (per git repository), per
system, or globally. Globally is usually the preferred choice, as this means
configure once per user, and use everywhere, with each repository. The global
gitattributes file can be found under
$HOME/.config/git/attributes
.
Note
As the global and system git attributes files have the lowest precedence,
they can easily be overridden on a local base. This can be done by creating a
.gitattributes
file in the root of a repository.
The following code-snippet adds the correct conversion handlers per file type to the global git configuration:
echo "*.docx diff=docx" >> ~/.config/git/attributes
echo "*.odt diff=odt" >> ~/.config/git/attributes
echo "*.pdf diff=pdf" >> ~/.config/git/attributes
And that's all there is to it. Now
git diff
will show all changes in
plain text format for the binary file types
docx
,
odt
and
pdf
.
Any binary format can be diffed with git, as long as there's a tool which converts the binary format to plain text. One just needs to add the conversion handlers and attributes in the same way.
[1] | https://pandoc.org/ |
[2] | http://docutils.sourceforge.net/rst.html |
[3] | https://poppler.freedesktop.org/ |
[4] | https://git-scm.com/docs/gitattributes |
Comments
comments powered by Disqus