Many Pies

Many Pies
Showing posts with label perl. Show all posts
Showing posts with label perl. Show all posts

Friday, July 16, 2010

Lotus Notes to mediawiki

Notes document and media wiki page header
I'm in the middle of moving a Lotus Notes document database to a mediawiki wiki. (Or is that a media wiki?)

I did it by means of a Perl script using the HTML::WikiConverter module and a Python script.

The starting point is to save each Notes document as a web page (using Firefox, and saving as "complete" so you get all the images). The Perl script (reproduced at the end) then converts each web page to a text file. I did a bit of custom processing to remove some tags: font, div, center, and some attributes: width, border, valign, bgcolor.

(In order to get the WikiConverter module to work I had to fix a bug in either the module, or the media wiki specific module, sorry I can't remember which. However a workaround was listed in the bug report and involved a script to rebuild the grammer in the CSS::Parse module.)

The Python script then takes the text files, now in mediawiki format, and converts to XML file(s). For testing I use an xml file per text file. For the real thing I put them all in one xml file. The Python puts the appropriate XML around the text so that the pages have titles. I haven't included the Python, as it's quite specific to what I want. However here's a clue for you - the title is two lines after a line with the word "Subject:" in it.

As well as parsing the text to find the title at the top, I also converted a document history table at the bottom of each file (part of our documents, not part of the Notes template) into a series of mediawiki "revisions", so that the information on what, who and when each document was changed wasn't lost. This is useful even though I don't have the actual revisions.

Each page does need a bit of attention, because this three stage conversion isn't perfect. For example, successive bullet points have blank lines between them, which is fine until you have indented bullets, when they don't render properly in mediawiki.

One thing I wish I'd done with hindsight, is put a category onto each page, which I could remove once I'd tidied it up, to see what remains to be done. As it is I've used a category once I've tidied it, but eventually every page will have that category, which will be meaningless. To remove it would mean editing every page, unless there's some global change plugin I'm not aware of.

# Convert saved pages from Notes Documents to media wiki format
use HTML::WikiConverter;

sub DropTag($) {
my ($page, $tag) = @_;
my @tags = $page->look_down("_tag",$tag); # Font tags
foreach my $element (@tags) {
$element->replace_with_content();
}
}

sub DropAttr($$) {
my ($page, $attr) = @_;
my @attrs = $page->look_down($attr,qr/.*/); # Tags with appropriate attribute set to anything
foreach my $element (@attrs) {
$element->attr($attr, undef);
}
}

sub ExtraProcessing ($) {
# Does various extra things that we need:
my($page) = @_;

DropTag($page, "font");
DropTag($page, "div");
DropTag($page, "center");

DropAttr($page, "width");
DropAttr($page, "border");
DropAttr($page, "valign");
DropAttr($page, "bgcolor");
}


my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
opendir(DIR, "saved html files");
@FILES= readdir(DIR);
foreach my $path (@FILES) {
if ($path =~ m/\.htm/) {
print $path."\n";
open FILE, ">output text files".$path or die("Could not open file for output\n");
print FILE $wc->html2wiki( file => "saved html files".$path, strip_tags => [ '~comment', 'head', 'script', 'style' ], preprocess => \&ExtraProcessing);
close FILE;
}
}

Friday, October 12, 2007

Perl programmer wanted

I'm working on a project which is a website contain material for those translating the Old Testament. We are looking for a volunteer (i.e. you don't get any money for it) to help us out next year (2008).

We would like someone who has Perl experience. The sort of time commitment we would like is a week a month, say, from January, for up to a year. Alternatively a three month block of time, if someone were between jobs for example, would be great.

You can contact me at Paul underscore Morriss at wycliffe dot org.

Friday, August 31, 2007

Perl/catalyst blog

Despite being a new blog, Jamie has got the name perldev.blogspot.com, and in it describes some woes with Catalyst installation.