Deucalion and Pie lemmatizers

From The Digital Classicist Wiki
Revision as of 17:39, 4 June 2019 by ThibaultClerice (talk | contribs) (Reorganization proposal)
Jump to navigation Jump to search



  • Enrique Manjavas
  • Mike Kestemont
  • Thibault Clérice


Pie is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.


Deucalion is :

  • a model for the lemmatizer Pie (.tar file on github)
  • a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
  • a Docker Image that makes running it even simpler

In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the information folder of the image but we can note the following accuracies:

  • Lemmatization : 97,52 %
  • Part-Of-Speech: 96.55 %
  • Morphology
    • Voice : 99.18 %
    • Mood : 98.36 %
    • Degree : 98.30 %
    • Number : 97.88 %
    • Person : 99.18 %
    • Tense : 98.75 %
    • Tense : 93.74 %
    • Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)

A version is hosted at the École des Chartes


  • D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
  • D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
  • D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
  • D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo.
  • Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo.