Sunday, October 26, 2008

Transpositions

Transpositions are undeniably part of real world texts, and must be included in any practical solution to the technical problems of how to represent overlapping structures in digital texts. The MVD or variant graph model described on this website includes transpositions, but until now there has been no way to calculate them automatically.

Transpositions can't be supported in any markup scheme. In principal, any section of the document can be transposed, not only small bits of text. This means that a feature like the ‘copyof’ attribute (Smith, 1999), if used to record transpositions, would have to allow any element to be contained in any other element – thus destroying the whole idea of a document schema or DTD. Also, the transposed sections might not even contain well-formed markup, i.e. they might contain unmatched start or end-tags. So this approach doesn’t work.

The method I have been using until now is that suggested by Lopresti and Tomkins (1997): that one should do all the alignments, variants, deletions and insertions first, and then consider pairs of insertions/deletions as candidates for transposition afterwards. The problem is that, while this works well for two versions, it leads to problems caused by ‘contamination’ between multiple versions. You end up with the same word being recorded as a variant of itself in another version. What is left over after the completion of alignment is a lot of noise that does not form a suitable basis for the calculation of transpositions.

The French Connection

Alternatively, the approach adopted by Bourdaillet (2007) in his thesis is the exact opposite: do the transpositions (and alignments) first, then what is left over can be considered as candidate variants, insertions and deletions. His method is, however, still tied to just two versions, but there are some useful ideas that can be applied to the case of aligning N versions too.

One reason why I suspect he avoided trying to merge N versions into one document was that he didn't have a data structure to record it, but also because, until now, this has been considered to be too hard a problem. However, I believe that it is solvable, by combining his general approach with the MVD document structure. I will try to describe how it will work, by illustrating a simple example step by step.

The Case for Automatic Detection of Transpositions

You might think that automatically detecting transpositions would be a bad idea. If the author transposed some text in a holograph, this should be clear from the manuscript and can simply be encoded as such by the editor. But what about the case where there are several versions of the same work – perhaps redrafts of a single text or independent versions created by copying? Spotting such transpositions visually is very hard work. In these cases calculating transpositions is a good idea, and saves the editor a lot of trouble. Even in the holograph case, calculating transpositions can still be useful. So long as the computer gets it right we don't have to encode it manually (which saves work); only when it gets it wrong do we have to do anything.

Outline of the Proposed Method

Imagine you have already aligned N versions into an MVD. Now you want to add the N+1th version to the structure. (This is the standard inductive formulation: if we can prove that it works in this case, then it will always work.) The Longest-Common-Substring or LCS is the longest section of text shared by the MVD and the new version where successive characters are all the same. The basic algorithm uses this property to merge the entire graph. In essence the algorithm is simply:

  1. Merge the variant graph and the new version where the LCS occurs.
  2. Call the algorithm recursively on the two unaligned sections before and after the LCS.

The Challenge

As an example consider the three versions:

1. The quick brown fox jumps over the lazy dog.
2. The quick white rabbit jumps over the lazy dog.
3. The quick brown ferret leaps over the lazy dog.
Imagine we have already built an MVD out of this. Now we want to merge it with:
4. The white quick rabbit jumps over the dog.

There is a small transposition here: version 4 has ‘white quick’ instead of ‘quick white’ in version 2. Let’s see if the algorithm can detect it.

Following the Algorithm as it Works

We can add the fourth version to the graph very easily, by creating one big arc with the text of the version and attaching it to the start and end of the graph, like this:

This is already a valid variant graph, but it is full of redundancy. Every word of the new version can also be found in the ‘old’ graph. Apart from wasting storage, this redundancy fails to inform us of the relationship between the various parts of the new version and the rest of the document. The algorithm will correct this problem by gradually removing all of the copies.

The ‘longest-common-substring’ (or LCS) between the first three versions and the 4th one is ‘rabbit jumps over the’ from version 2, even though bits of that string are shared by other versions. What we do is align version 4 with the LCS, leaving two bits at either end non-aligned. The two bits are ‘The white quick’ and ‘dog.’:

Note that ‘rabbit jumps over the’ has now acquired version D. We then call the same routine (this is called recursion) on these two bits left over, but align them with the corresponding parts of the MVD that precede OR follow the LCS in version 2.

We now have two subgraphs containing arcs that definitely precede or follow the LCS. All the other arcs that are effectively parallel to the LCS are left in place but are not considered further. Now we have to try to align Arc 1 ‘The white quick’ with Graph 1 and Graph 2, and likewise align Arc 2 ‘dog.’ with graphs 1 & 2. Because we now have more than one subgraph we will have to consider transpositions. However, when we compare the LCS between Arc 2 and Graph 1 (nothing) with that calculated between Arc 2 and Graph 2 (‘dog.’) it is clear that no transpositions are possible. Instead, the best alignment is the direct one of ‘dog.’ between versions ABC and D:

We have no more D-text to align here so the process stops. The empty D-arc becomes a deletion in version D. On the other hand, there is still work to do on the left, in Graph 1. Here comparison between Graph 1 and Arc 1 suggests either ‘white’ or ‘quick’ as the LCS. We will choose ‘quick’ because it is more central:

This is still not quite right. The instance of ‘white’ on the left is clearly a transposition of the ‘white’ on the right. Again the merging of the LCS leads to two arcs and two graphs:

Calculation of the LCS between Arc 1 and Graph 2 yields ‘white’, shown here in bold. So we merge the LCS into the graph, except that this is a transposition, and so we must leave the text where it is and point to the target of the transposition. This leaves only one copy of ‘white’ in the graph, and another copy, shown in grey, that points to it.

We are still not finished. ‘The’ appears twice. So we need one further LCS calculation between the ‘The’ D-arc and the ‘The’ ABC-arc:

Now the two ‘The’s are merged, all that remains is to introduce an empty ABC-arc to indicate that ‘white’ only appears in that position in version D.

Taking Stock

We have been recursing into smaller and smaller portions of the graph. That does not mean that these portions or subgraphs are in any way detached from the rest of the graph. The other parts were simply omitted for clarity. Overall the graph now looks like this:

The text of version D has been fully assimilated. It has been aligned with ALL the versions of the graph, not just with the one that was most similar to the version we were trying to add. (This is what biologists do in their ‘progressive’ alignment technique, and they don't even attempt transpositions). The result is a much better alignment with a lot less redundancy. To add further versions to the MVD we simply repeat the above steps, with the new variant graph as our starting point.

Time Complexity

I believe this routine may eventually be O(N log N), that is as fast as the famous ‘quicksort’ routine of Hoare from 1961, which it resembles. At the moment it is somewhat slower because my current calculation of the LCS takes O(N2) time in the worst case. The LCS between two strings can be calculated in O(N) time according to Gusfield. But that requires the construction of two suffix trees using Ukkonen's 1995 algorithm. I have implemented that for the text of the new version but I can't generate a suffix tree for the variant graph because it is too difficult, and may not be possible in O(N) time. To calculate the LCS I just traverse the graph, looking for runs of matching characters in the new version which has been converted into a suffix tree using Ukkonen's algorithm. Overall I think this is O(N2). However, even as it now stands, the algorithm is still very fast because expected performance is usually much better than that.

References

D. Lopresti and A. Tomkins, ‘Block edit models for approximate string matching’, Theoretical Computer Science 1997, 181, 159–179.
J. Bourdaillet, Alignment textuel monolingue avec recherche de déplacements: algorithmique pour la critique génétique PhD Thesis, Université Paris 6 Pierre et Marie Curie, 2007.
C. Hoare, ‘Partition: Algorithm 63’, ‘Quicksort: Algorithm 64,’ Communications of the ACM 4(7), 321–322, 1961
D. Smith, ‘Textual Variation and Version Control in the TEI’ Computers and the Humanities, 33.1, 1999, 103–112.
E. Ukkonen, ‘On-line Construction of Suffix Trees’ Algorithmica 14 (1995), 249--260.
D. Gusfield Algorithms on Strings, Trees and Sequences, Cambridge: Cambridge University Press, 1997.

No comments: