Deucalion and Pie lemmatizers: Difference between revisions

From The Digital Classicist Wiki
Jump to navigation Jump to search
(Deucalion and Pie page creation)
 
(Add link for Deucalion as on online service)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Pie ==
== Available ==


[https://github.com/emanjavacas/pie Pie] is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.
* Pie: https://github.com/emanjavacas/pie
* Latin Model: https://github.com/PonteIneptique/latin-lasla-models
* Pie-Extended: https://github.com/hipster-philology/nlp-pie-taggers
* Deucalion, a Web interface for Flask-Pie: https://dh.chartes.psl.eu/deucalion/ (Ancient Greek, and Latin, as well as Old French, Modern French, Early Modern French, and Middle Dutch)


== Deucalion ==
== Author ==


[https://github.com/chartes/deucalion-model-lasla Deucalion (with LASLA data)] is :
* Enrique Manjavas
* Mike Kestemont
* Thibault Clérice


* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
== Description ==
* a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
* a [https://hub.docker.com/r/ponteineptique/deucalion-model-lasla Docker Image ] that makes running it even simpler


In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the [https://github.com/chartes/deucalion-model-lasla/tree/master/information information] folder of the image but we can note the following accuracies:
'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.


* Lemmatization : 97,52 %
=== Pie Extended ===
* Part-Of-Speech: 96.55 %
 
* Morphology
Pie-Extended an extension built on top of Pie to ease its use as a tagger: it handles downloading of models, tokenization and post-/pre-processing. It requires python > 3.6 and just enough knowledge about installing libraries in Python as well as using a Command Line Interface.
** Voice : 99.18 %
 
** Mood : 98.36 %
=== Deucalion (now Flask Pie) ===
** Degree : 98.30 %
 
** Number : 97.88 %
Flask-Pie (previously known as Deucalion) provides adapters to server Pie models over HTTP servers.
** Person : 99.18 %
** Tense : 98.75 %
** Tense : 93.74 %
** Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)


A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]
== Bibliography ==
== Bibliography ==


* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
Line 35: Line 32:
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537  
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537  
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847
[[category:lemmatisation]]
[[category:tools]]
[[category:linguistics]]

Latest revision as of 16:23, 28 May 2023

Available

Author

  • Enrique Manjavas
  • Mike Kestemont
  • Thibault Clérice

Description

Pie is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

Pie Extended

Pie-Extended an extension built on top of Pie to ease its use as a tagger: it handles downloading of models, tokenization and post-/pre-processing. It requires python > 3.6 and just enough knowledge about installing libraries in Python as well as using a Command Line Interface.

Deucalion (now Flask Pie)

Flask-Pie (previously known as Deucalion) provides adapters to server Pie models over HTTP servers.

Bibliography

  • D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
  • D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
  • D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
  • D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
  • Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847