.

Import Wikipedia page history to git

I’ve written a small tool which downloads the history of a Wikipedia article, converts it and imports it into a new git repository. The main motivation behind writing it is being able to perform a per-line blame of the article’s history. I had tried levitation, but that tool seemed to be oriented towards large imports (or it might just be buggy), as it attempted to create huge binary files and ran longer than my patience would allow when I gave it the history of just one article. Also, I wanted the tool to take care of the downloading and importing part – so I could be one command away from a git repository of any WP article.

The tool can be made faster (all the XML and string management stuff adds an overhead), but right now it’s fast enough for me. One thing that can be optimized is making it not load the entire input XML into memory – it’s possible to do the conversion by “streaming” the XML. Another current limitation is that it’s currently hard-wired to the English Wikipedia.

Requires curl and (obviously) git. You’ll need a D1 D2 compiler to compile the code.

August 2013 update: Updated to D2. Now creates the directory automatically. Added --keep-history switch.

Source, Windows binary.

Comments