Deucalion and Pie lemmatizers: Difference between revisions

From The Digital Classicist Wiki
Jump to navigation Jump to search
(Deucalion and Pie page creation)
 
m (exposed URLs in available)
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Pie ==
== Available ==


[https://github.com/emanjavacas/pie Pie] is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.
* Pie: https://github.com/emanjavacas/pie
* Deucalion (with LASLA data): https://github.com/chartes/deucalion-model-lasla


== Deucalion ==
== Author ==


[https://github.com/chartes/deucalion-model-lasla Deucalion (with LASLA data)] is :
* Enrique Manjavas
* Mike Kestemont
* Thibault Clérice
 
== Description ==
 
'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.
 
=== Deucalion ===
 
Deucalion is :


* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
Line 26: Line 37:


A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]
A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]
== Bibliography ==
== Bibliography ==


* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
Line 35: Line 46:
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537  
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537  
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847
[[category:lemmatisation]] [[category:tools]]

Revision as of 16:42, 4 June 2019

Available

Author

  • Enrique Manjavas
  • Mike Kestemont
  • Thibault Clérice

Description

Pie is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

Deucalion

Deucalion is :

  • a model for the lemmatizer Pie (.tar file on github)
  • a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
  • a Docker Image that makes running it even simpler

In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the information folder of the image but we can note the following accuracies:

  • Lemmatization : 97,52 %
  • Part-Of-Speech: 96.55 %
  • Morphology
    • Voice : 99.18 %
    • Mood : 98.36 %
    • Degree : 98.30 %
    • Number : 97.88 %
    • Person : 99.18 %
    • Tense : 98.75 %
    • Tense : 93.74 %
    • Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)

A version is hosted at the École des Chartes

Bibliography

  • D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
  • D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
  • D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
  • D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
  • Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847