I have just completed the move from one type of NoSQL database (couchdb) to another (mongodb). The speed increase is exhilarating, though I don't know by how much. The comparisons are not cached, they are computed on the fly: and the speed is the fruit of good design and careful implementation. As I said all along, conventional approaches based on XML are just too inefficient and inadequate. This design, based on multi-version documents, and hosted on a Jetty/Java service with a C core, is faster than any other such service I know of, and it is more flexible. There is also the new C-version of nmerge to add, which should increase capacity and speed further. Try out the test interface for yourself.
Sunday, February 24, 2013
MVDs now support any text encoding. I've used ICU, the IBM-supplied conversion library for to/from UniCode conversions. It's very good. So texts may now be added to an MVD from any textual source. All you do is specify the encoding. Internally, merging is done on 16-bit UTF-16, not 8-bit UTF-8 any more. I don't believe any rival text-comparison programs can do this. This is a big improvement over the previous version of nmerge.
Tuesday, February 5, 2013
Extending multi-version documents (MVDs) to properly support languages like Chinese and Bengali, which use 16-bit characters, turns out to be easier than I thought. Currently the nmerge tool, which produces MVDs, works only with 8-bit bytes internally, so that individual characters may be split over several bytes, as in UTF-8 encoding. Things get complicated whenever differences are detected between parts of characters. Making everything 16-bit will facilitate the comparison of texts in any living language and avoid such complications (unless you want to compare dead languages like Phoenician or Lydian, and even then UTF-16 can encode them). I don't have any Chinese examples, but my friends in India have provided me with some interesting Bengali texts, which I'll be using for testing.