29 Dec 2010

Convert Word documents (with pictures) to Mediawiki

Update (2011-06-14)! Magnus done a cleaner 'python-only' implementation of this tool - found here: https://github.com/mhagander/word2mediawiki/

This post will explain how to automatically convert MS Word files (with images) to Mediawiki pages. Any filetype OpenOffice supports can be converted.

Short explanation: We use OpenOffice to convert the Word files to wiki-syntax, but some voodoo is needed to fetch and upload any images included in the Word-file (the "voodoo" is depicted yellow in the flowchart below):


More detailed explanation: The perl script word2mediawiki.pl take a Word file as input. After some rudimentary checks, it calls the python script DocumentConverter.py which calls OpenOffice to do the actual conversion. This is done twice; we convert to both .wiki and .xml files. Since the .wiki file DO NOT contain any images (it only adds empty [[Image:]] wiki-tags where the images are supposed to be), we convert to .xml that DO include images. Here the images are base64 encoded, so we parse the .xml file, fetch all base64-images, decode and save as ordinary images files. We re-write the .wiki file to update all empty [[Image:]] wiki-tags with the correct image file just decoded. Finally we upload the original Word file (for reference), all images and create a wiki-page based on the .wiki files using the pywikipediabot. Se example below.

Prerequisite and install:

A) Linux - but may work on other platforms as well (not tested)
B) Install Perl and Python
C) Install the Python-UNO bridge. This enable Python to talk to the OpenOffice API (and do the conversion)
 
  # apt-get install python-uno

D) Install OpenOffice. We run OpenOffice "headless", so X is not required.

E) Install the OpenOffice "Sun Wiki Publisher" extension. This adds support for .wiki and .xml export.

  # unopkg add sun-wiki-publisher.oxt

F) Create a word2mediawiki directory. Download the word2mediawiki.pl script, and the PyODConverter script. Note! I've modified the PyODConverter script to support .wiki and .xml. You can download the modified version below:

  $ mkdir word2mediawiki
  $ cd word2mediawiki
  $ wget http://www.larsstrand.no/code/word2mediawiki/word2mediawiki.pl

G) Install pywikipediabot:

  $ svn co http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia

H) Configure pywikipediabot. Use the testbox2_family.py file as template for your Mediawiki installation. The file should be self-explanatory:

  $ cd pywikipedia/families 

Add username and password:

  $ cd ..
  $ cat user-config.py
  # -*- coding: utf-8  -*-
  family = 'testbox2'
  mylang = 'en'
  usernames['testbox2']['en'] = u'Wiki-USERNAME'
  password_file = "user-password"
  $ cat user-password
  ("Wiki-USERNAME", "Wiki-PASSWORD")

I) Test pywikipediabot: 

  $ python ./login.py -force -all
  unicode test: triggers problem #3081100
  Logging in to testbox2:en as Wiki-USERNAME via API.
  Should be logged in now

Great! We're ready for our first test.


Convert:

We find a word-file and execute:

  $ ./word2mediawiki.pl ../Testfile.doc
  #############################
  ## Converting /export/home/lks/tmp/Testfile.doc to .wiki and .xml using soffice..
  #############################
  Converting image: ../converted/Testfile_1.jpg
  Converting image: ../converted/Testfile_2.jpg
  Rewrote wiki page with new Image tag: [[Image:Testfile_1.jpg]]
  Rewrote wiki page with new Image tag: [[Image:Testfile_2.jpg]]
  #############################
  ## Conversion complete: ../converted/Testfile.wiki
  ... >>>> Skipping a bunch of output here.

  ## Uploading the wiki page
  ## Exec: python ./pywikipediabot/pagefromfile.py -start:XZXZ42 -end:YZYZ42 -safe -file:../converted/Testfile.wiki
  Reading '../converted/Testfile.wiki'...
  >>> Testfile <<<
  Logging in to testbox2:en as Wiki-USERNAME via API.
  Should be logged in now
  Sleeping for 8.3 seconds, 2010-12-29 18:03:38
  Creating page [[Testfile]] via API
  End of file.
  ## Conversion and upload complete

Note 1: You might get a warning when the pywikipediabot tries to upload the images/.doc file or create the wiki-page. This can happen if the same image/.doc file already have been uploaded. If the a wiki-page with the same name already exists, the bot issue a warning and abort.

Note 2: When OpenOffice convert the .doc file, it might spew out a bunch of warning and/or error messages. These can be ignored. OpenOffice complains a lot. 

Note 3: If the the script exits with a complain about "Can't connect to soffice on port" - just re-run the script. OpenOffice can be a little slow to start. (It will fork into the background the first time).

Note 4: The conversion is nowhere near perfect, and you might want to look over the wiki-page to ensure correct formatting.

Note 5: The filename of the Word file is used as name of the Wikipage. Example: "Testfile.doc" result in "mediawiki/index.php/Testfile"

Example: