A Corpus-based Approach to Philological Issues (Boschetti)

  • Author: Federico Boschetti
  • PhD in Cognitive and Brain Sciences, University of Trento
  • Completed: 2010


The aim of this work is the application of techniques developed in the domain of corpus linguistics to a collection of ancient Greek texts, taking into account not only the canonical text established by modern editors, but also the variant readings recorded in the critical apparatus or in the repertories of conjectures. The dissertation is divided in three connected parts: construction, mapping and analysis of the corpus. The first part is devoted to corpus construction and it is focused on the techniques to improve the OCR accuracy on classical critical editions. This task is challenging because critical editions are multilingual, the set of characters to recognize is wide and the quality of last centuries paper is variable. Three OCR engines are applied to the same texts and a Bayesian classifier, joint to a specific spell-checker, evaluates the most probable output. It is demonstrated that the improvement is significative and, in the best cases, it is more than 3%. The second part is devoted to the alignment of the contents extracted from critical apparatus and repertories of conjectures to the reference text. A parser has been developed to classify the chunks of information (verse number, Greek word sequences, textual operation, scholar that suggested the conjecture). Alignment algorithms used to find the precise position of the conjecture in its context are illustrated in detail. The third part is devoted to the study of the semantic spaces of ancient Greek. The chapter is focused on the specificity of the corpus, that is morphologically complex, literary (both poetry and prose) and diachronical (from VIII century B.C. to XV century A.D.). The word senses in documents belonging to different genres are explored, and the diachronical change of meaning is observed. Finally, a couple of meaningful conjectures extracted in the first part is analysed, evaluating the most interesting reciprocal relations in the semantic space.

