The final version of my MVD paper has now appeared online. This hyperlink is permanent and can be used in citations. The paper reference is Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514.
Also I have now submitted my thesis. The final title was 'Multiple Versions and Overlap in Digital Text'. Here's the abstract:
This thesis is unusual in that it tries to solve a problem that exists between two widely separated disciplines: the humanities (and to some extent also linguistics) on the one hand and information science on the other.
Chapter 1 explains why it is essential to strike a balance between study of the solution and problem domains.
Chapter 2 surveys the various models of cultural heritage text, starting in the remote past, through the coming of the digital era to the present. It establishes why current models are outdated and need to be revised, and also what significance such a revision would have.
Chapter 3 examines the history of markup in an attempt to trace how inadequacies of representation arose. It then examines two major problems in cultural heritage and linguistics digital texts: overlapping hierarchies and textual variation. It assesses previously proposed solutions to both problems and explains why they are all inadequate. It argues that overlapping hierarchies is a subset of the textual variation problem, and also why markup cannot be the solution to either problem.
Chapter 4 develops a new data model for representing cultural heritage and linguistics texts, called a 'variant graph', which separates the natural overlapping structures from the content. It develops a simplified list-form of the graph that scales well as the number of versions increases. It also describes the main operations that need to be performed on the graph and explores their algorithmic complexities.
Chapter 5 draws on research in bioinformatics and text processing to develop a greedy algorithm that aligns n versions with non-overlapping block transpositions in O(MN) time in the worst case, where M is the size of the graph and N is the length of the new version being added or updated. It shows how this algorithm can be applied to texts in corpus linguistics and the humanities, and tests an implementation of the algorithm on a variety of real-world texts.