This post will explain how to automatically convert MS Word files (with images) to Mediawiki pages. Any filetype OpenOffice supports can be converted.
Short explanation: We use OpenOffice to convert the Word files to wiki-syntax, but some voodoo is needed to fetch and upload any images included in the Word-file (the "voodoo" is depicted yellow in the flowchart below):
Prerequisite and install:
A) Linux - but may work on other platforms as well (not tested)
B) Install Perl and Python
C) Install the Python-UNO bridge. This enable Python to talk to the OpenOffice API (and do the conversion)
# apt-get install python-uno
D) Install OpenOffice. We run OpenOffice "headless", so X is not required.
E) Install the OpenOffice "Sun Wiki Publisher" extension. This adds support for .wiki and .xml export.
# unopkg add sun-wiki-publisher.oxt
F) Create a word2mediawiki directory. Download the word2mediawiki.pl script, and the PyODConverter script. Note! I've modified the PyODConverter script to support .wiki and .xml. You can download the modified version below:
$ mkdir word2mediawiki
$ cd word2mediawiki
$ wget http://www.larsstrand.no/code/word2mediawiki/word2mediawiki.pl
G) Install pywikipediabot:
$ svn co http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia
H) Configure pywikipediabot. Use the testbox2_family.py file as template for your Mediawiki installation. The file should be self-explanatory:
$ cd pywikipedia/families
Add username and password:
$ cd ..
$ cat user-config.py
# -*- coding: utf-8 -*-
family = 'testbox2'
mylang = 'en'
usernames['testbox2']['en'] = u'Wiki-USERNAME'
password_file = "user-password"
$ cat user-password
family = 'testbox2'
mylang = 'en'
usernames['testbox2']['en'] = u'Wiki-USERNAME'
password_file = "user-password"
$ cat user-password
("Wiki-USERNAME", "Wiki-PASSWORD")
I) Test pywikipediabot:
$ python ./login.py -force -all
unicode test: triggers problem #3081100
Logging in to testbox2:en as Wiki-USERNAME via API.
Should be logged in now
Great! We're ready for our first test.
Convert:
We find a word-file and execute:
$ ./word2mediawiki.pl ../Testfile.doc
############################### Converting /export/home/lks/tmp/Testfile.doc to .wiki and .xml using soffice..
#############################
Converting image: ../converted/Testfile_1.jpg
Converting image: ../converted/Testfile_2.jpg
Rewrote wiki page with new Image tag: [[Image:Testfile_1.jpg]]
Rewrote wiki page with new Image tag: [[Image:Testfile_2.jpg]]
#############################
## Conversion complete: ../converted/Testfile.wiki
... >>>> Skipping a bunch of output here.
## Uploading the wiki page
## Exec: python ./pywikipediabot/pagefromfile.py -start:XZXZ42 -end:YZYZ42 -safe -file:../converted/Testfile.wiki
Reading '../converted/Testfile.wiki'...
>>> Testfile <<<
Logging in to testbox2:en as Wiki-USERNAME via API.
Should be logged in now
Sleeping for 8.3 seconds, 2010-12-29 18:03:38
Creating page [[Testfile]] via API
End of file.
## Conversion and upload complete
Note 1: You might get a warning when the pywikipediabot tries to upload the images/.doc file or create the wiki-page. This can happen if the same image/.doc file already have been uploaded. If the a wiki-page with the same name already exists, the bot issue a warning and abort.
Note 2: When OpenOffice convert the .doc file, it might spew out a bunch of warning and/or error messages. These can be ignored. OpenOffice complains a lot.
Note 3: If the the script exits with a complain about "Can't connect to soffice on port" - just re-run the script. OpenOffice can be a little slow to start. (It will fork into the background the first time).
Note 4: The conversion is nowhere near perfect, and you might want to look over the wiki-page to ensure correct formatting.
Note 5: The filename of the Word file is used as name of the Wikipage. Example: "Testfile.doc" result in "mediawiki/index.php/Testfile"
Example: