Sunday, November 1, 2009

C++ Version of nmerge

One problem with the current design of nmerge is that it is written in Java. The commandline tool is a thin C wrapper around that, and if you want to process larger files you can't pass in arguments to increase available memory. So it just fails to work on large files. Also if you want to run it on servers that don't have, or won't allow, Java (true of many commercial hosting sites) you're also out of luck. Since the Digital Variants people and probably a large number of humanities projects will have these problems also, I have decided to convert it into pure C++. This should be relatively easy, and the benefits are:

  1. memory usage will be limited only by what is available on the machine, not to that allocated to the Java Virtual machine (JVM)
  2. nmerge-c++ will be callable from PHP or another scripting language without requiring installation of a JVM.
  3. nmerge can optionally write to a database instead of directly to disk. This is usually the only way you can save changes on a commercial hosting site.
  4. The C++ version will use far less memory than the Java version and should be a bit faster.

Overall, these changes will facilitate the building of a practical web application or plugin, which can be added to existing sites. Initially, my intention is to produce a Joomla! plugin that other people can use.

Some changes that will be possible in this revision include:

  1. Grouped transpositions. By assessing individual transposition candidates as a group it will be possible to detect larger transpositions that contain small corrections.
  2. Proper multi-tasking of the merging process in C++ will hopefully speed up the algorithm considerably.

That's the plan. I thought I'd let you know where I'm taking this, and it is to turn it into a generally usable tool.

There is at least one drawback, of course. C++ is cumbersome to write code in, compared to the relative heaven of Java. It's like painting a room with a brush instead of a roller.

Friday, October 23, 2009

Whoops!

A favourite quotation of Edgar Dijkstra is that 'testing shows the presence, not the absence of bugs'. This is very true. In Australian homes too you can squash a few cockroaches and think you've got them all, but how do you know there isn't a whole colony hiding in the skirting boards? I'm guilty of putting in a '!' when I shouldn't have. My only explanation was that I was jetlagged in Montreal and fed up with preparing my presentation. For some reason I put in that 'not', which prevented nmerge from finding any left-side transpositions at all. All I can say is: 'Whoops!'

I'll fix it in the next hour or so and upload the new version as 1.0.2, and update Alpha too. The transposition algorithm is not perfect - I never said so, if you read the Balisage paper, particularly at the end - but it is workable. One thing you should keep in mind is that this is a unique program in its field. Several people have written merging programs for humanistic texts, and a couple have even included transpositions (MEDITE, JNDiff). But only between two texts at a time. I merge N texts into one digital representation.

One thing I'd like to do soon is make it find transpositions in groups (a flaw that Peter Robinson rightly pointed out). And it could be even faster, if I can work out how to parallelise the algorithm. That's why I 'built' this fancy i7 computer.

The good thing about computing variants automatically rather than manually is that it is not final. Any improvements in the algorithm are immediately visible. Whereas making systematic changes to a manually coded set of texts with complex variants is not trivial.

Tuesday, September 1, 2009

New Versions of nmerge and Alpha posted

The difference here is that nmerge now includes the full source code, released under the GPL v3, and also contains a single example text that I can give away under the same license. It is the first scene of Shakespeare's King Lear. I have tried to make it as true to the source texts as I can but it's a lot of work getting markup to look like a manuscript. I never realised before how much the tags interfere with that. It's very annoying. Anyway, let me know if there are any mistakes. Or any ideas on how Alpha can be improved. I'm sure there are lots.

Of course it is full of markup hacks, mainly lines split over speeches, but I couldn't fix that without introducing another layer for each MS. I'd prefer using some other technology other than markup for the content but there isn't one yet. Oh well!

Here's the link.