Difference between revisions of "Deucalion and Pie lemmatizers"

From The Digital Classicist Wiki
Jump to navigation Jump to search
m (cat)
 
Line 2: Line 2:
  
 
* Pie: https://github.com/emanjavacas/pie
 
* Pie: https://github.com/emanjavacas/pie
* Deucalion (with LASLA data): https://github.com/chartes/deucalion-model-lasla
+
* Latin Model: https://github.com/PonteIneptique/latin-lasla-models
 +
* Pie-Extended: https://github.com/hipster-philology/nlp-pie-taggers
  
 
== Author ==
 
== Author ==
Line 14: Line 15:
 
'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.
 
'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.
  
=== Deucalion ===
+
=== Pie Extended ===
  
Deucalion is :
+
Pie-Extended an extension built on top of Pie to ease its use as a tagger: it handles downloading of models, tokenization and post-/pre-processing. It requires python > 3.6 and just enough knowledge about installing libraries in Python as well as using a Command Line Interface.
  
* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
+
=== Deucalion (now Flask Pie) ===
* a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
 
* a [https://hub.docker.com/r/ponteineptique/deucalion-model-lasla Docker Image ] that makes running it even simpler
 
  
In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the [https://github.com/chartes/deucalion-model-lasla/tree/master/information information] folder of the image but we can note the following accuracies:
+
Flask-Pie (previously known as Deucalion) provides adapters to server Pie models over HTTP servers.
 
 
* Lemmatization : 97,52 %
 
* Part-Of-Speech: 96.55 %
 
* Morphology
 
** Voice : 99.18 %
 
** Mood : 98.36 %
 
** Degree : 98.30 %
 
** Number : 97.88 %
 
** Person : 99.18 %
 
** Tense : 98.75 %
 
** Tense : 93.74 %
 
** Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)
 
 
 
A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]
 
  
 
== Bibliography ==
 
== Bibliography ==

Latest revision as of 11:40, 22 September 2020

Available

Author

  • Enrique Manjavas
  • Mike Kestemont
  • Thibault Clérice

Description

Pie is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

Pie Extended

Pie-Extended an extension built on top of Pie to ease its use as a tagger: it handles downloading of models, tokenization and post-/pre-processing. It requires python > 3.6 and just enough knowledge about installing libraries in Python as well as using a Command Line Interface.

Deucalion (now Flask Pie)

Flask-Pie (previously known as Deucalion) provides adapters to server Pie models over HTTP servers.

Bibliography

  • D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
  • D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
  • D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
  • D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
  • Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847