Sunday, November 1, 2009

C++ Version of nmerge

One problem with the current design of nmerge is that it is written in Java. The commandline tool is a thin C wrapper around that, and if you want to process larger files you can't pass in arguments to increase available memory. So it just fails to work on large files. Also if you want to run it on servers that don't have, or won't allow, Java (true of many commercial hosting sites) you're also out of luck. Since the Digital Variants people and probably a large number of humanities projects will have these problems also, I have decided to convert it into pure C++. This should be relatively easy, and the benefits are:

  1. memory usage will be limited only by what is available on the machine, not to that allocated to the Java Virtual machine (JVM)
  2. nmerge-c++ will be callable from PHP or another scripting language without requiring installation of a JVM.
  3. nmerge can optionally write to a database instead of directly to disk. This is usually the only way you can save changes on a commercial hosting site.
  4. The C++ version will use far less memory than the Java version and should be a bit faster.

Overall, these changes will facilitate the building of a practical web application or plugin, which can be added to existing sites. Initially, my intention is to produce a Joomla! plugin that other people can use.

Some changes that will be possible in this revision include:

  1. Grouped transpositions. By assessing individual transposition candidates as a group it will be possible to detect larger transpositions that contain small corrections.
  2. Proper multi-tasking of the merging process in C++ will hopefully speed up the algorithm considerably.

That's the plan. I thought I'd let you know where I'm taking this, and it is to turn it into a generally usable tool.

There is at least one drawback, of course. C++ is cumbersome to write code in, compared to the relative heaven of Java. It's like painting a room with a brush instead of a roller.

No comments: