Accueil medite



The MEDITE Project

MEDITE is a powerful text comparison software that is issued from a collaboration between literary and AI scholars.

More precisely, funded by the CNRS information society program, the EDITE project made the (ITEM) - Institut des Textes et Manuscrits Modernes - collaborate with the ACASA team, member of the LIP6 laboratory. MEDITE, i.e. EDITE Machine, has originally been achieved to solve the needs of textual genetic criticism. It is nothing more than an efficient uni-lingual aligner. But it appears now to be useful in many applications (scholar publishing, automatic translation, computerized epistemology, etc.).

Textual genetic criticism is a discipline that studies drafts led by authors during the writing process. MEDITE’s first aim was to align two linearized transcriptions of such drafts (two texts) in order to expose invariants and differences between them. We discovered that for texts with a lot of repetitions, existing aligners (version comparison tools) failed to perform correct alignments. This is due to masking phenomena which appear when the pairing of two text blocks masks and therefore avoids the pairing of other identical blocks. MEDITE addresses this problem.

MEDITE is built on an original sequence alignment algorithm, based on the edit distance with moves conceptual frame. It detects deleted, inserted, replaced, moved and invariant character blocks and aligns pairwise these last three block types. The first algorithm step detects maximal exact matches (MEM): homologies between the two texts which can’t be extended to the left or to the right without losing identity. MEMs are either invariant or moved blocks and are identified by browsing the space of possible alignments by an A* procedure which minimize the size of moved blocks. This whole process is then applied recursively between each pair of aligned invariant blocks in order to detect smaller blocks and to avoid masking phenomena. Finally, as deleted, inserted and replaced blocks are non repeated blocks, they are deduced from the alignment.

For results visualization, the two texts are presented side-by-side in a two panel GUI. Deletions, insertions and replacements are overlined in a specific color. Moves are underlined: it enables to visualize, for instance, moves inside insertions. Invariants stay black on white. Pairwise aligned blocks are linked together and a simple mouse click aligns them side-by-side.

MEDITE has been compared with other version comparison tools, the most famous being the one inside Microsoft Word. None of them was able to align correctly hard texts and to overcome masking phenomena but MEDITE. Further the visualization interface is often very bad-suited which leads to more difficulties in the results understanding.

Because our algorithm is sequence based, it is language independent and can process any language without specific resources. For instance, it can process Arabic texts. Further, it is character based and so it detects intra-words modifications which is very useful for flexional languages. The moves detection is a kind of knowledge discovery as it exposes block correlations between the texts.

MEDITE is now used by philologists in textual genetic criticism and epistemologists in ideas’ story. It enables them to study longer texts and to discover, more systematically, transformations between authors’ draft. They can then establish diachronic corpus of an author’s oeuvre. We now plan to embed MEDITE in digital library platforms.

References

  1. Fenoglio I., Ganascia J.-G.: "Le logiciel MEDITE: approche comparative de documents de genèse", in L'édition du manuscrit - De l'archive de création au scriptorium électronique, Aurèle Crasson, Academia A|B Bruylant, col. Au coeur des textes, n°10, pp. 209-228, (2008).

  2. Fenoglio I., Ganascia J-G. : "MEDITE: un logiciel pour l'approche comparative de documents de genèse", Revue Genesis, pp. 166-168, 2007 (in French) (pdf)

  3. Ganascia, J.-G., Bourdaillet, J. Alignements unilingues avec MEDITE.. Actes des Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles, 2006. (in French)
  4. Ganascia J.G., Fenoglio I., Lebrave J-L, Manuscrits, genèse et documents numérisés. EDITE : une étude informatisée du travail de l’écrivain, revue Document numérique, special issue on « temps et document » 2005 (in French)( pdf, online)

  5. Ganascia J.G. On the Supposed Neo-Structuralism of Hypertext, Diogenes N°196, September 2002, Issue 4, Blackwell Publishing Ltd.
  6. Ganascia J-G, EDITE-MEDITE, un passage des versions aux variantes, actes du XIVième congrès International de Linguistique et de Philologie Romanes, August 2004, Aberystwyth, Wales, United Kingdown, Max Niemeyer Verlag, septembre 2007 (pdf)

  7. Bourdaillet J., Ganascia J.-G.: "Alignements monolingues avec déplacements", 14e Conférence sur le Traitement Automatique des Langues Naturelles. (in French)
  8. Bourdaillet J., Ganascia J.-G., Fénoglio I. : "Machine Assisted Study of Writers' Rewriting Processes", 4th International Workshop on Natural Language Processing and Cognitive Science (NLPCS), Madeire, Portugal(pdf)

  9. Bourdaillet J., Ganascia J-G, Practical block sequence alignment with moves, LATA 2007, International Conference on Language and Automata Theory and Applications, 30 mars – avril 2007. (pdf)

  10. Bourdaillet J., Ganascia J.-G., Alignement of Noisy Unstructured Text Data, IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India - January 8, 2007