Many Pies

Many Pies

Friday, July 16, 2010

Lotus Notes to mediawiki

Notes document and media wiki page header
I'm in the middle of moving a Lotus Notes document database to a mediawiki wiki. (Or is that a media wiki?)

I did it by means of a Perl script using the HTML::WikiConverter module and a Python script.

The starting point is to save each Notes document as a web page (using Firefox, and saving as "complete" so you get all the images). The Perl script (reproduced at the end) then converts each web page to a text file. I did a bit of custom processing to remove some tags: font, div, center, and some attributes: width, border, valign, bgcolor.

(In order to get the WikiConverter module to work I had to fix a bug in either the module, or the media wiki specific module, sorry I can't remember which. However a workaround was listed in the bug report and involved a script to rebuild the grammer in the CSS::Parse module.)

The Python script then takes the text files, now in mediawiki format, and converts to XML file(s). For testing I use an xml file per text file. For the real thing I put them all in one xml file. The Python puts the appropriate XML around the text so that the pages have titles. I haven't included the Python, as it's quite specific to what I want. However here's a clue for you - the title is two lines after a line with the word "Subject:" in it.

As well as parsing the text to find the title at the top, I also converted a document history table at the bottom of each file (part of our documents, not part of the Notes template) into a series of mediawiki "revisions", so that the information on what, who and when each document was changed wasn't lost. This is useful even though I don't have the actual revisions.

Each page does need a bit of attention, because this three stage conversion isn't perfect. For example, successive bullet points have blank lines between them, which is fine until you have indented bullets, when they don't render properly in mediawiki.

One thing I wish I'd done with hindsight, is put a category onto each page, which I could remove once I'd tidied it up, to see what remains to be done. As it is I've used a category once I've tidied it, but eventually every page will have that category, which will be meaningless. To remove it would mean editing every page, unless there's some global change plugin I'm not aware of.

# Convert saved pages from Notes Documents to media wiki format
use HTML::WikiConverter;

sub DropTag($) {
my ($page, $tag) = @_;
my @tags = $page->look_down("_tag",$tag); # Font tags
foreach my $element (@tags) {
$element->replace_with_content();
}
}

sub DropAttr($$) {
my ($page, $attr) = @_;
my @attrs = $page->look_down($attr,qr/.*/); # Tags with appropriate attribute set to anything
foreach my $element (@attrs) {
$element->attr($attr, undef);
}
}

sub ExtraProcessing ($) {
# Does various extra things that we need:
my($page) = @_;

DropTag($page, "font");
DropTag($page, "div");
DropTag($page, "center");

DropAttr($page, "width");
DropAttr($page, "border");
DropAttr($page, "valign");
DropAttr($page, "bgcolor");
}


my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
opendir(DIR, "saved html files");
@FILES= readdir(DIR);
foreach my $path (@FILES) {
if ($path =~ m/\.htm/) {
print $path."\n";
open FILE, ">output text files".$path or die("Could not open file for output\n");
print FILE $wc->html2wiki( file => "saved html files".$path, strip_tags => [ '~comment', 'head', 'script', 'style' ], preprocess => \&ExtraProcessing);
close FILE;
}
}

No comments: