Import Wikipedia page history to git

I’ve written a small tool which downloads the history of a Wikipedia article, converts it and imports it into a new git repository. The main motivation behind writing it is being able to perform a per-line blame of the article’s history. I had tried levitation, but that tool seemed to be oriented towards large imports (or it might just be buggy), as it attempted to create huge binary files and ran longer than my patience would allow when I gave it the history of just one article. Also, I wanted the tool to take care of the downloading and importing part – so I could be one command away from a git repository of any WP article.

The tool can be made faster (all the XML and string management stuff adds an overhead), but right now it’s fast enough for me. One thing that can be optimized is making it not load the entire input XML into memory – it’s possible to do the conversion by “streaming” the XML. Another current limitation is that it’s currently hard-wired to the English Wikipedia.

Requires curl and (obviously) git. You’ll need a D1 D2 compiler to compile the code.

August 2013 update: Updated to D2. Now creates the directory automatically. Added --keep-history switch.

Source, Windows binary.

3 thoughts on “Import Wikipedia page history to git

  1. Johan Sundström

    Thank you for this handy tool!

    I was a bit surprised it didn’t create a directory for the article but started turned the present directory into the repository (which first caused it to abort with errors, before I realized it got conflicts from a pre-existing $PWD/.git), but that sorted, it works pretty well.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Current day month ye@r *