Sunday, March 11, 2012

A Better Way to do Transpositions

One of the weaknesses of nmerge is its handling of block-size transpositions. It does the small ones OK, but large transposed blocks pose a problem because they are rarely contiguous. They tend to break up into short strings of literal similarity, punctuated by small differences in spelling. For example, if you transposed a paragraph in Shakespeare between two editions, differences in spelling would make it hard for the program to see that all the small similarities add up to an entire transposed block. Every idea I had to get around this limitation threatened to make nmerge much slower. Until now.

At the moment, when you add a version to the variant graph (as shown above) it aligns it with the graph directly opposite. It does this recursively, by merging sections of identical text and then gradually making the leftover sub-graphs and sub-sections of the new version smaller and smaller until all the new text is merged into the main graph. So in the drawing above the left-over sub-graphs are "The quick red/brown" and "lazy dog." and the new version fragments are "The lazy grey" and "quick dog." Transpositions are looked for to the left and right of the opposite graph-section and replace the direct alignment if a longer match is found. This can only find short contiguous sections of transposed text, and if they are far enough away nmerge simply ignores them. The longest match between "The lazy grey" and its opposite sub-graph is "The", but there is a longer match "lazy" with the other sub-graph. NMerge might miss this because it is too far away, relatively speaking.

The new algorithm is much simpler. Rather than align a section of new text with its directly opposite sub-graph and then look to either side for transpositions, it aligns it with all the remaining subgraphs equally. So if there is transposition of an entire block – and we found this quite often in the Tagore poems – nmerge will simply choose the best subgraph to align with for each new section of the text. The problem now is to stop it making trivial transpositions like "the" between the start and end of the work. Some kind of weighting based on previous alignments between blocks as well as distance and length might be the way to go.

No comments: