This post will explain how to automatically convert MS Word files (with images) to Mediawiki pages. Any filetype OpenOffice supports can be converted.
Short explanation: We use OpenOffice to convert the Word files to wiki-syntax, but some voodoo is needed to fetch and upload any images included in the Word-file (the "voodoo" is depicted yellow in the flowchart below):
Prerequisite and install:
A) Linux - but may work on other platforms as well (not tested)
B) Install Perl and Python
C) Install the Python-UNO bridge. This enable Python to talk to the OpenOffice API (and do the conversion)
# apt-get install python-uno
D) Install OpenOffice. We run OpenOffice "headless", so X is not required.
E) Install the OpenOffice "Sun Wiki Publisher" extension. This adds support for .wiki and .xml export.
# unopkg add sun-wiki-publisher.oxt
F) Create a word2mediawiki directory. Download the word2mediawiki.pl script, and the PyODConverter script. Note! I've modified the PyODConverter script to support .wiki and .xml. You can download the modified version below:
$ mkdir word2mediawiki
$ cd word2mediawiki
$ wget http://www.larsstrand.no/code/word2mediawiki/word2mediawiki.pl
G) Install pywikipediabot:
$ svn co http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia
H) Configure pywikipediabot. Use the testbox2_family.py file as template for your Mediawiki installation. The file should be self-explanatory:
$ cd pywikipedia/families
Add username and password:
$ cd ..
$ cat user-config.py
# -*- coding: utf-8 -*-
family = 'testbox2'
mylang = 'en'
usernames['testbox2']['en'] = u'Wiki-USERNAME'
password_file = "user-password"
$ cat user-password
family = 'testbox2'
mylang = 'en'
usernames['testbox2']['en'] = u'Wiki-USERNAME'
password_file = "user-password"
$ cat user-password
("Wiki-USERNAME", "Wiki-PASSWORD")
I) Test pywikipediabot:
$ python ./login.py -force -all
unicode test: triggers problem #3081100
Logging in to testbox2:en as Wiki-USERNAME via API.
Should be logged in now
Great! We're ready for our first test.
Convert:
We find a word-file and execute:
$ ./word2mediawiki.pl ../Testfile.doc
############################### Converting /export/home/lks/tmp/Testfile.doc to .wiki and .xml using soffice..
#############################
Converting image: ../converted/Testfile_1.jpg
Converting image: ../converted/Testfile_2.jpg
Rewrote wiki page with new Image tag: [[Image:Testfile_1.jpg]]
Rewrote wiki page with new Image tag: [[Image:Testfile_2.jpg]]
#############################
## Conversion complete: ../converted/Testfile.wiki
... >>>> Skipping a bunch of output here.
## Uploading the wiki page
## Exec: python ./pywikipediabot/pagefromfile.py -start:XZXZ42 -end:YZYZ42 -safe -file:../converted/Testfile.wiki
Reading '../converted/Testfile.wiki'...
>>> Testfile <<<
Logging in to testbox2:en as Wiki-USERNAME via API.
Should be logged in now
Sleeping for 8.3 seconds, 2010-12-29 18:03:38
Creating page [[Testfile]] via API
End of file.
## Conversion and upload complete
Note 1: You might get a warning when the pywikipediabot tries to upload the images/.doc file or create the wiki-page. This can happen if the same image/.doc file already have been uploaded. If the a wiki-page with the same name already exists, the bot issue a warning and abort.
Note 2: When OpenOffice convert the .doc file, it might spew out a bunch of warning and/or error messages. These can be ignored. OpenOffice complains a lot.
Note 3: If the the script exits with a complain about "Can't connect to soffice on port" - just re-run the script. OpenOffice can be a little slow to start. (It will fork into the background the first time).
Note 4: The conversion is nowhere near perfect, and you might want to look over the wiki-page to ensure correct formatting.
Note 5: The filename of the Word file is used as name of the Wikipage. Example: "Testfile.doc" result in "mediawiki/index.php/Testfile"
Example:
6 comments:
Nice work - thanks.
One suggested change: If a line has multiple images, then only the first is rewritten. This can be fixed by changing
if ($line =~ /\[\[Image\:\]\]/) {
to
while ($line =~ /\[\[Image\:\]\]/) {
Thanks! Fixed.
Very interesting, unfortunately can't make it work:
------------------------------------
[root@wikifone pywikipedia]# pwd
/home/ebah/word2mediawiki/pywikipedia
[root@wikifone pywikipedia]# python ./login.py -force -all
Traceback (most recent call last):
File "./login.py", line 58, in ?
import re, os, query
File "/home/ebah/word2mediawiki/pywikipedia/query.py", line 28, in ?
import wikipedia, time
File "/home/ebah/word2mediawiki/pywikipedia/wikipedia.py", line 143, in ?
from pywikibot import *
File "/home/ebah/word2mediawiki/pywikipedia/pywikibot/__init__.py", line 15, in ?
from exceptions import *
File "/home/ebah/word2mediawiki/pywikipedia/pywikibot/exceptions.py", line 13, in ?
import config
File "/home/ebah/word2mediawiki/pywikipedia/config.py", line 551, in ?
execfile(_filename)
File "/home/ebah/word2mediawiki/pywikipedia/user-config.py", line 2
family = 'wikifone'
^
SyntaxError: invalid syntax
------------------------------------
Running under CentOS 5.
I installed all the python package, dev, lib etc...
Any suggestion?
Thanks a lot,
Evan
Hey Lars,
Can you help me with this, please?
root@puffin:~/word2mediawiki/pywikipedia# python ./login.py -force -all
Traceback (most recent call last):
File "./login.py", line 58, in
import re, os, query
File "/root/word2mediawiki/pywikipedia/query.py", line 28, in
import wikipedia, time
File "/root/word2mediawiki/pywikipedia/wikipedia.py", line 143, in
from pywikibot import *
File "/root/word2mediawiki/pywikipedia/pywikibot/__init__.py", line 15, in
from exceptions import *
File "/root/word2mediawiki/pywikipedia/pywikibot/exceptions.py", line 13, in
import config
File "/root/word2mediawiki/pywikipedia/config.py", line 551, in
execfile(_filename)
File "/root/word2mediawiki/pywikipedia/user-config.py", line 2
family = 'ventura'
^
IndentationError: unexpected indent
root@puffin:~/word2mediawiki/pywikipedia#
server is ubuntu ....
Thanks and regards
Only two different script engines? I think this is far too easy!
I was expecting this had a way to put base64 images in wiki markup along with the rest, but this is a complete html imported as a page without mediawiki markup.
You can achieve the same result using the LiveDocx service from the Zend Framework (also possible without Zend). The latter is PHP, which is the language MediaWiki is written in. This also does not require you to install any mods on apache. The Zend Framework can just be uploaded without (but can be done to embed it) any .ini modifications.
phpDocx is another alternative, but requires you to pay for anything decent, although there is a free version.
... and while we're at it, a less secure option: the fileIndexer extension is able to index documents if you don't want to convert anything at all.
Nice as an additional option in Perl & Python, but took me a while to set up and does the same as more simple alternatives.
Post a Comment