Import Wikipedia page history to git
by CyberShadow on Jun.16, 2010, under Code
I’ve written a small tool which downloads the history of a Wikipedia article, converts it and imports it into a new git repository. The main motivation behind writing it is being able to perform a per-line blame of the article’s history. I had tried levitation, but that tool seemed to be oriented towards large imports (or it might just be buggy), as it attempted to create huge binary files and ran longer than my patience would allow when I gave it the history of just one article. Also, I wanted the tool to take care of the downloading and importing part – so I could be one command away from a git repository of any WP article.
The tool can be made faster (all the XML and string management stuff adds an overhead), but right now it’s fast enough for me. One thing that can be optimized is making it not load the entire input XML into memory – it’s possible to do the conversion by “streaming” the XML. Another current limitation is that it’s currently hard-wired to the English Wikipedia.
Requires curl and (obviously) git. You’ll need a D1 compiler to compile the code.
Get it here: http://github.com/CyberShadow/wp2git
December 27th, 2012 on 12:00 pm
Thank you for this handy tool!
I was a bit surprised it didn’t create a directory for the article but started turned the present directory into the repository (which first caused it to abort with errors, before I realized it got conflicts from a pre-existing $PWD/.git), but that sorted, it works pretty well.
March 18th, 2013 on 9:19 pm
Awesome, exactly what I was looking for!