29 Dec 2010

Convert Word documents (with pictures) to Mediawiki

Update (2011-06-14)! Magnus done a cleaner 'python-only' implementation of this tool - found here: https://github.com/mhagander/word2mediawiki/

This post will explain how to automatically convert MS Word files (with images) to Mediawiki pages. Any filetype OpenOffice supports can be converted.

Short explanation: We use OpenOffice to convert the Word files to wiki-syntax, but some voodoo is needed to fetch and upload any images included in the Word-file (the "voodoo" is depicted yellow in the flowchart below):


More detailed explanation: The perl script word2mediawiki.pl take a Word file as input. After some rudimentary checks, it calls the python script DocumentConverter.py which calls OpenOffice to do the actual conversion. This is done twice; we convert to both .wiki and .xml files. Since the .wiki file DO NOT contain any images (it only adds empty [[Image:]] wiki-tags where the images are supposed to be), we convert to .xml that DO include images. Here the images are base64 encoded, so we parse the .xml file, fetch all base64-images, decode and save as ordinary images files. We re-write the .wiki file to update all empty [[Image:]] wiki-tags with the correct image file just decoded. Finally we upload the original Word file (for reference), all images and create a wiki-page based on the .wiki files using the pywikipediabot. Se example below.

Prerequisite and install:

A) Linux - but may work on other platforms as well (not tested)
B) Install Perl and Python
C) Install the Python-UNO bridge. This enable Python to talk to the OpenOffice API (and do the conversion)
 
  # apt-get install python-uno

D) Install OpenOffice. We run OpenOffice "headless", so X is not required.

E) Install the OpenOffice "Sun Wiki Publisher" extension. This adds support for .wiki and .xml export.

  # unopkg add sun-wiki-publisher.oxt

F) Create a word2mediawiki directory. Download the word2mediawiki.pl script, and the PyODConverter script. Note! I've modified the PyODConverter script to support .wiki and .xml. You can download the modified version below:

  $ mkdir word2mediawiki
  $ cd word2mediawiki
  $ wget http://www.larsstrand.no/code/word2mediawiki/word2mediawiki.pl

G) Install pywikipediabot:

  $ svn co http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia

H) Configure pywikipediabot. Use the testbox2_family.py file as template for your Mediawiki installation. The file should be self-explanatory:

  $ cd pywikipedia/families 

Add username and password:

  $ cd ..
  $ cat user-config.py
  # -*- coding: utf-8  -*-
  family = 'testbox2'
  mylang = 'en'
  usernames['testbox2']['en'] = u'Wiki-USERNAME'
  password_file = "user-password"
  $ cat user-password
  ("Wiki-USERNAME", "Wiki-PASSWORD")

I) Test pywikipediabot: 

  $ python ./login.py -force -all
  unicode test: triggers problem #3081100
  Logging in to testbox2:en as Wiki-USERNAME via API.
  Should be logged in now

Great! We're ready for our first test.


Convert:

We find a word-file and execute:

  $ ./word2mediawiki.pl ../Testfile.doc
  #############################
  ## Converting /export/home/lks/tmp/Testfile.doc to .wiki and .xml using soffice..
  #############################
  Converting image: ../converted/Testfile_1.jpg
  Converting image: ../converted/Testfile_2.jpg
  Rewrote wiki page with new Image tag: [[Image:Testfile_1.jpg]]
  Rewrote wiki page with new Image tag: [[Image:Testfile_2.jpg]]
  #############################
  ## Conversion complete: ../converted/Testfile.wiki
  ... >>>> Skipping a bunch of output here.

  ## Uploading the wiki page
  ## Exec: python ./pywikipediabot/pagefromfile.py -start:XZXZ42 -end:YZYZ42 -safe -file:../converted/Testfile.wiki
  Reading '../converted/Testfile.wiki'...
  >>> Testfile <<<
  Logging in to testbox2:en as Wiki-USERNAME via API.
  Should be logged in now
  Sleeping for 8.3 seconds, 2010-12-29 18:03:38
  Creating page [[Testfile]] via API
  End of file.
  ## Conversion and upload complete

Note 1: You might get a warning when the pywikipediabot tries to upload the images/.doc file or create the wiki-page. This can happen if the same image/.doc file already have been uploaded. If the a wiki-page with the same name already exists, the bot issue a warning and abort.

Note 2: When OpenOffice convert the .doc file, it might spew out a bunch of warning and/or error messages. These can be ignored. OpenOffice complains a lot. 

Note 3: If the the script exits with a complain about "Can't connect to soffice on port" - just re-run the script. OpenOffice can be a little slow to start. (It will fork into the background the first time).

Note 4: The conversion is nowhere near perfect, and you might want to look over the wiki-page to ensure correct formatting.

Note 5: The filename of the Word file is used as name of the Wikipage. Example: "Testfile.doc" result in "mediawiki/index.php/Testfile"

Example:

6 comments:

Anonymous said...

Nice work - thanks.

One suggested change: If a line has multiple images, then only the first is rewritten. This can be fixed by changing

if ($line =~ /\[\[Image\:\]\]/) {

to

while ($line =~ /\[\[Image\:\]\]/) {

Lars said...

Thanks! Fixed.

Evan said...

Very interesting, unfortunately can't make it work:
------------------------------------
[root@wikifone pywikipedia]# pwd
/home/ebah/word2mediawiki/pywikipedia
[root@wikifone pywikipedia]# python ./login.py -force -all
Traceback (most recent call last):
File "./login.py", line 58, in ?
import re, os, query
File "/home/ebah/word2mediawiki/pywikipedia/query.py", line 28, in ?
import wikipedia, time
File "/home/ebah/word2mediawiki/pywikipedia/wikipedia.py", line 143, in ?
from pywikibot import *
File "/home/ebah/word2mediawiki/pywikipedia/pywikibot/__init__.py", line 15, in ?
from exceptions import *
File "/home/ebah/word2mediawiki/pywikipedia/pywikibot/exceptions.py", line 13, in ?
import config
File "/home/ebah/word2mediawiki/pywikipedia/config.py", line 551, in ?
execfile(_filename)
File "/home/ebah/word2mediawiki/pywikipedia/user-config.py", line 2
family = 'wikifone'
^
SyntaxError: invalid syntax
------------------------------------

Running under CentOS 5.
I installed all the python package, dev, lib etc...

Any suggestion?

Thanks a lot,
Evan

Anonymous said...

Hey Lars,

Can you help me with this, please?
root@puffin:~/word2mediawiki/pywikipedia# python ./login.py -force -all
Traceback (most recent call last):
File "./login.py", line 58, in
import re, os, query
File "/root/word2mediawiki/pywikipedia/query.py", line 28, in
import wikipedia, time
File "/root/word2mediawiki/pywikipedia/wikipedia.py", line 143, in
from pywikibot import *
File "/root/word2mediawiki/pywikipedia/pywikibot/__init__.py", line 15, in
from exceptions import *
File "/root/word2mediawiki/pywikipedia/pywikibot/exceptions.py", line 13, in
import config
File "/root/word2mediawiki/pywikipedia/config.py", line 551, in
execfile(_filename)
File "/root/word2mediawiki/pywikipedia/user-config.py", line 2
family = 'ventura'
^
IndentationError: unexpected indent
root@puffin:~/word2mediawiki/pywikipedia#

server is ubuntu ....

Thanks and regards

Anonymous said...

Only two different script engines? I think this is far too easy!

Olivier said...

I was expecting this had a way to put base64 images in wiki markup along with the rest, but this is a complete html imported as a page without mediawiki markup.

You can achieve the same result using the LiveDocx service from the Zend Framework (also possible without Zend). The latter is PHP, which is the language MediaWiki is written in. This also does not require you to install any mods on apache. The Zend Framework can just be uploaded without (but can be done to embed it) any .ini modifications.

phpDocx is another alternative, but requires you to pay for anything decent, although there is a free version.

... and while we're at it, a less secure option: the fileIndexer extension is able to index documents if you don't want to convert anything at all.

Nice as an additional option in Perl & Python, but took me a while to set up and does the same as more simple alternatives.