The Digital Classicist Wiki - User contributions [en-gb]

Pyrrha

2020-09-22T11:48:28Z

ThibaultClerice:

==Available==

* https://github.com/hipster-philology/pyrrha
* https://dh.chartes.psl.eu/pyrrha

==Authors==

* Julien Pilla
* Thibault Clérice

==Description==

Pyrrha is a webapp built to fasten and secure morphological/lemmatization post-correction or annotation. It features:

* Corpus creation and sharing: you can work with multiple partner, with a fine-grain history which allows for review
* Control Lists, which are sets of allowed value for each known tasks (POS, lemma and Morphology)
* Serial Edition: the software looks for similar tokens when an edit is performed, in case the same error appears more than once in the corpus. End user has the final choice.

The application is free and open-source, and is continuously evolving. Only one instance is known at the moment, hosted by the Ecole Nationale des Chartes: https://dh.chartes.psl.eu/pyrrha

[[category:tools]]
[[category:lemmatisation]]
[[category:morphology]]

Deucalion and Pie lemmatizers

2020-09-22T11:40:13Z

ThibaultClerice:

== Available ==

* Pie: https://github.com/emanjavacas/pie
* Latin Model: https://github.com/PonteIneptique/latin-lasla-models
* Pie-Extended: https://github.com/hipster-philology/nlp-pie-taggers

== Author ==

* Enrique Manjavas
* Mike Kestemont
* Thibault Clérice

== Description ==

'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

=== Pie Extended ===

Pie-Extended an extension built on top of Pie to ease its use as a tagger: it handles downloading of models, tokenization and post-/pre-processing. It requires python > 3.6 and just enough knowledge about installing libraries in Python as well as using a Command Line Interface.

=== Deucalion (now Flask Pie) ===

Flask-Pie (previously known as Deucalion) provides adapters to server Pie models over HTTP servers.

== Bibliography ==

* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
* D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
* D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
* D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847

[[category:lemmatisation]]
[[category:tools]]
[[category:programming]]

Greek and Latin texts in digital form

2020-09-22T11:33:16Z

ThibaultClerice: /* Literary texts */ Quick edit to add a resource (digilibLT): the text is directly drawn from their website, as I felt it states quite well what it is.

== Where can I find collections of Greek and Latin texts? ==

There are a few aspects to this question:

# searchable databases of Greek and Latin texts which one can query to find instances of words in context, statistical and linguistic examples, etc.
# collections of Greek and Latin texts available for downloading and/or copy and pasting into articles, handouts, etc.
# Greek and Latin texts with translations, useful for translation and contrastive linguistic studies.

The list on this page includes some major corpora of ancient Greek and Latin texts.

== Literary texts ==

=== TLG ===

The [[TLG]] is a huge collection of encoded ancient and mediaeval Greek texts. There is a (larger and more updated) online version and an older CD-Rom version:

# By far the best way to use the TLG is to buy a license for the ''TLG Online'', but an institutional license is expensive and not all departments will be willing to pay for one. (See http://www.tlg.uci.edu/lic.html for information.) A personal license is more affordable, but cannot be shared, mounted on a department machine, etc. If you have the site license, you can use these from any fixed IP machines (i.e. on-campus, e.g. in your office, a computer lab, etc.) that you have registered with the TLG. I think the way this is calculated is that the more machines you register, the more the license costs.
# Departments who still have the old ''CD-Rom #E'' (last updated in 2000) find that this is cheaper, but it is not as good: older texts, less coverage (mostly confined to the "classical" age), no updates. Plus you have to acquire third-party software (though these are not necessarily expensive) See also [[Search_the_TLG_and_PHI_databases]]

=== Perseus ===

[[Perseus]] have a fair collection of canonical Greek and Latin texts, limited in number, but very richly enhanced by sophisticated tools and search engines, parallel original and translated versions, dictionary and search tools, statistics, morphological parsing, mythological encyplopaedia, etc.

=== PHI 5 ===

[[Packard Humanities Institute]] (PHI) Latin Library texts now in version 5.3 is a CD-Rom with full Latin texts and Bible versions up to the Second Century AD. This is probably the standard research tool as it is readily available in libraries and departments. Like TLG it also needs search software to make it work, like [http://www.musaios.com/ Musaios] or [http://www.dur.ac.uk/p.j.heslin/Software/Diogenes/ Diogenes] (the latter is free of charge and open source). There is now an online version of [[PHI Latin Texts]].

=== IntraText ===

An Italian project, [[IntraText Digital Library]] (http://www.intratext.com/LAT/), has a quite extensive collection of freely accessible, searchable Latin texts (ancient, medieval and newer), linked to its concordances, enhanced with basic text-analytical data; simpler and more static, but also also faster to load than Perseus (at least from Europe), the [[IntraText Digital Library]] is somewhat more sophisticated than the Latin Library, whose texts it often re-uses.

=== Bibliotheca Teubneriana Latina (BTL) ===

The [[Bibliotheca Teubneriana Latina Online]] is the electronic version of the ''Bibliotheca scriptorum Romanorum Teubneriana''. [http://www.brepols.net/Pages/Search.aspx?subject=543 Versions 1 to 4] were on CD. The current BTL Online database provides electronic access (by subscription) to all editions of Latin texts published in the Bibliotheca Teubneriana (without preface or critical apparatus), ranging from antiquity and late antiquity to medieval and neo-Latin texts, for a total of approximately 13 million word forms.

=== Library of Latin Texts (LLT) ===

The [[Library of Latin Texts]] started in 1991 as CETEDOC Library of Christian Latin texts (CLCLT), in CD-ROM and then DVD-ROM. In 2002 the name was changed to LLT as it included classical and post-classical Latin texts. In 2009 it split into two different online resources. LLT-A is the direct continuation of the previous project, including digital editions of Latin texts mostly taken from Teubner editions and published with an accurate philological revision by [http://www.corpuschristianorum.org/centres/turnhout.html CLTLO] (formerly CETEDOC) under the direction of Paul Tombeur. LLT-B is meant as a more fast-growing supplement to LLT-A. The texts of LLT-B are drawn directly from printed scholarly editions while much of the revision work is dropped and precedence is given to large, homogeneous corpora of texts. The two collections do not differ in time scope (spanning from Classical antiquity to Neolatin texts until 1965, including decrees from the Vatican II Council), but in publication practices and philological standards. The quantity of texts and their overall philological quality are outstanding. Access to the collections, however, is by paying subscription through the [http://www.brepolis.net/ Brepolis platform] (more information on the collections is in their [http://www.brepolis.net/BRP_Info_En.html?show=info Database information page]).

=== Digital Library of Late Antique Latin Texts (digilibLT) ===

The [https://digiliblt.uniupo.it/ Digital Library of Late-Antique Latin Texts — DigilibLT —] publishes secular prose texts written in Latin in late antiquity (from the second to the seventh century AD). The texts are annotated according to the XML-TEI standards and are offered free of charge to the public for reading and research. Since its creation in 2010, DigilibLT has achieved significant goals: more than 300 texts have already been uploaded, a number that is constantly increasing. Texts are searchable, browsable and downloadable (requires a free account).

== Inscriptions and papyri ==

=== PHI 7 ===

The Packard Humanities Institute also has a CD-Rom (7.0) of Greek inscriptions and documentary papyri: this is in the same format as the TLG CD Rom, and needs the same third-party software to search. However, the Greek inscriptions are also available freely online at http://epigraphy.packhum.org/inscriptions/, which is good.

=== Papyri.info ===

The documentary papyri (also on the PHI CD Rom) can be searched freely online at [[Papyri.info]], with TEI EpiDoc XML available for download on GitHub at https://github.com/papyri/idp.data/.

=== EDH for Latin inscriptions ===

For Latin inscriptions the [http://www.uni-heidelberg.de/institute/sonst/adw/edh/ Epigraphische Datenbank Heidelberg] is probably the largest searchable corpus, although there are others, some connected to the [[EAGLE]] project, others not.

=== Epigraph CD-Rom ===

[[Epigraph]] - a CD database of Roman inscriptions of Vol VI of Corpus Inscriptionum Latinarum. This fully searchable allowing searches to be made on inscription numbers, text strings, cognomina, greek text, numerals, Claudian letters, ligatures, reversed letters, short letters and tall letters.

=== Mycenaean documents ===

[https://www2.hf.uio.no/damos/ DAMOS - Database of Mycenaean at Oslo] provides a searchable corpus of all the published Mycenaean texts in transcription.

=== Other epigraphical and papyrological collections ===

In the foreseeable future [[Papyri.info]] should constitute a hub to most of the digital collections of papyri available. For more collections, see the pages listed in the [[:Category:Epigraphy]] and [[:Category:Papyrology]] sections of this wiki.

== Other collections ==

=== Some specialist collections ===

# [[Aristoteles Latinus]] (ALD), which is an electronic version of the printed series containing the complete corpus of the medieval translations of the works of Aristotle;
# [[Archive of Celtic-Latin Literature]] (ACLL), "A full-text database of the corpus of Latin literature produced in Celtic-speaking Europe from the period 400-1200 A.D." ([http://www.brepols.net/publishers/pdf/Brepolis_ACLL_EN.pdf 2010 Brepolis Flyer], PDF file). Access to both resources is granted (by paying subscription) by the online platform [http://apps.brepolis.net/BrepolisPortal/default.aspx Brepolis] ([http://www.brepolis.net/BRP_Info_En.html?show=info more information] on the collections).
# [[Biblioteca Iuris Antiqui]], a CD of all the main Latin juridic texts see Biblioteca Iuris Antiqui. It includes editions of the texts and a bibliography on Roman law. Also useful is its thesaurus of over 8000 terms relating to ancient law.

=== Downloadable (not searchable) texts ===

# The [[Latin Library]] (http://www.thelatinlibrary.com/) has a simple to find and easy to download comprehensive collection of Latin texts. These are all texts collected from the public domain, have no critical apparatus or other indications of editions etc and so are not intended for research but nevertheless are convenient and available. This is made clear if you read the notes at the bottom of the home page.
# [http://www.perseus.tufts.edu/ Perseus] (see above A.(2)) have a considerable range of both Greek and Latin texts - some with multiple editions. When downloading texts, remember to switch off all the hyperlinks (go to 'Configure display' / Word Study Links select no) otherwise they will be downloaded as well. Translations are also available as well although sometimes in antiquated and stilted English. See also the copyright notice linked at the top of each page which says these materials are "provided for the personal use of students, scholars, and the public" but are copyrighted and not in the Public Domain.
# [[Bibliotheca Augustana]] (http://www.hs-augsburg.de/~harsch/augustana.html) by Prof. em. Ulrich Harsch is an extensive collection of Greek, Latin (also Medieval and Neo-Latin), and other texts for reading (individually or with students). In the Greek and Latin section editor's notes on periods and authors are in Greek [http://www.hs-augsburg.de/~harsch/augustana.html#gr] and Latin [http://www.hs-augsburg.de/~harsch/augustana.html#la] respectively, which adds didactic value. Harsch's design of [http://www.hs-augsburg.de/~harsch/Chronologia/Lspost01/Persius/per_satu.html Persius' Satyres' page] is especially attractive, as it mimics a papyrus scroll.
# [http://papyri.info/ The Duke Databank of Documentary Papyri] (DDbDP) makes all data and version history available for download on GitHub: https://github.com/papyri/idp.data

=== Texts with translations ===

# [[Romulus Bulgaricus]] (http://romulusbg.net/?page=library) is an interesting collection of texts insofar as it contrasts, side by side, classical Latin texts and its Bulgarian translations. Although not finished yet (many texts are to be added), and with little searching and interlinking capability, it presents a provocative starting point for translation studies research and teaching.
# [https://github.com/papyri/idp.data/tree/master/HGV_trans_EpiDoc A subset of DDbDP texts] have English and/or German translations available.
# [[Attic Inscriptions Online]] offer English translations of Greek inscriptions from Athens and Attica.

=== Thesauri ===

# [[Thesaurus Linguae Latinae]] - the third edition is now out there. For the Bryn Mawr Classical Review on this see; http://ccat.sas.upenn.edu/bmcr/2006/2006-02-19.html (blogged on the Stoa).

==See also==
* [[Digital Critical Editions of Texts in Greek and Latin]]
* [[Classical texts on Google Book Search]]
* The [http://wiki.digitalclassicist.org/Category:Projects projects page] includes other digital editions of Greek and Latin texts

[[category:FAQ]]
[[category:OSCE]]
[[category:Papyrology]]
[[category:Epigraphy]]
[[category:Opensource]]
[[category:corpora]]

User:ThibaultClerice

2019-06-04T16:56:36Z

ThibaultClerice: Created page with "== Bio == I am the head of the MA "Digital Technologies Applied to History" (Technologies Numériques Appliquées à l’Histoire) at the École Nationale des Chartes (Paris,..."

== Bio ==

I am the head of the MA "Digital Technologies Applied to History" (Technologies Numériques Appliquées à l’Histoire) at the École Nationale des Chartes (Paris, France). I am a classicist who served as an engineer both at the Centre for eResearch (Kings College London, UK) and the Humboldt Chair for Digital Humanities (Leipzig, Germany) where I developed the data backbone of the future Perseus 5 (under the CapiTainS.org project). My main interests lie in data and software sustainability and Latin data mining.

== Contact ==

Thibault Clérice
École Nationale des Chartes
65, rue de Richelieu
75002 Paris

* Email: thibault.clerice ==at== chartes.psl.eu
* Twitter: https://twitter.com/ponteineptique

CapiTainS

2019-06-04T16:50:47Z

ThibaultClerice: /* Available */ Formatting

== Available ==

* http://capitains.org/
* https://github.com/capitains

== Authors ==

* Bridget Almas
* Thibault Clérice
* Matt Munson

== Description ==

CapiTainS is an informal open-source organization which aims at providing a suite of tools and guidelines for Citable Text APIs standards.

It provides XML TEI guidelines for encoding text that can be then consumed or served over different tools :

* [https://github.com/Capitains/HookTest Capitains HookTest], a tool that check compliances of a Corpus with the guidelines for TEI
** [https://capitains-validator.herokuapp.com/ WebApp Version]
* [https://github.com/Capitains/flask-capitains-nemo Capitains Nemo], an application that can help quickly put together a website based on Capitains Guidelines
** [https://github.com/Capitains/tutorial-nemo Tutorial]
* [https://github.com/Capitains/Nautilus Capitains Nautilus], an application that provides different textual APIs (like CTS and DTS) for corpora following Capitains Guidelines
* [https://github.com/Capitains/MyCapytain Capitains MyCapytain], the shared library used for parsing corpora. It can be used in corpus analysis pipeline to parse local repositories, to consume APIs but it can also be used further for building web application.

== Bibliography ==

* Almas & Clérice, Continuous Integration and Unit Testing of Digital Editions, Digital Humanities Quaterly, Volume 11.4, Link
* Clérice, Les outils CapiTainS, l’édition numérique et l’exploitation des textes, Médiévales, Volume 73, p. 114 Editor Pre-print

[[category:projects]] [[category:tools]]

CapiTainS

2019-06-04T16:50:36Z

ThibaultClerice: /* Authors */ Formatting

== Available ==

- http://capitains.org/
- https://github.com/capitains

== Authors ==

* Bridget Almas
* Thibault Clérice
* Matt Munson

== Description ==

CapiTainS is an informal open-source organization which aims at providing a suite of tools and guidelines for Citable Text APIs standards.

It provides XML TEI guidelines for encoding text that can be then consumed or served over different tools :

* [https://github.com/Capitains/HookTest Capitains HookTest], a tool that check compliances of a Corpus with the guidelines for TEI
** [https://capitains-validator.herokuapp.com/ WebApp Version]
* [https://github.com/Capitains/flask-capitains-nemo Capitains Nemo], an application that can help quickly put together a website based on Capitains Guidelines
** [https://github.com/Capitains/tutorial-nemo Tutorial]
* [https://github.com/Capitains/Nautilus Capitains Nautilus], an application that provides different textual APIs (like CTS and DTS) for corpora following Capitains Guidelines
* [https://github.com/Capitains/MyCapytain Capitains MyCapytain], the shared library used for parsing corpora. It can be used in corpus analysis pipeline to parse local repositories, to consume APIs but it can also be used further for building web application.

== Bibliography ==

* Almas & Clérice, Continuous Integration and Unit Testing of Digital Editions, Digital Humanities Quaterly, Volume 11.4, Link
* Clérice, Les outils CapiTainS, l’édition numérique et l’exploitation des textes, Médiévales, Volume 73, p. 114 Editor Pre-print

[[category:projects]] [[category:tools]]

CapiTainS

2019-06-04T16:49:45Z

ThibaultClerice: Created page with "== Available == - http://capitains.org/ - https://github.com/capitains == Authors == - Bridget Almas - Thibault Clérice - Matt Munson == Description == CapiTainS is an i..."

== Available ==

- http://capitains.org/
- https://github.com/capitains

== Authors ==

- Bridget Almas
- Thibault Clérice
- Matt Munson

== Description ==

CapiTainS is an informal open-source organization which aims at providing a suite of tools and guidelines for Citable Text APIs standards.

It provides XML TEI guidelines for encoding text that can be then consumed or served over different tools :

* [https://github.com/Capitains/HookTest Capitains HookTest], a tool that check compliances of a Corpus with the guidelines for TEI
** [https://capitains-validator.herokuapp.com/ WebApp Version]
* [https://github.com/Capitains/flask-capitains-nemo Capitains Nemo], an application that can help quickly put together a website based on Capitains Guidelines
** [https://github.com/Capitains/tutorial-nemo Tutorial]
* [https://github.com/Capitains/Nautilus Capitains Nautilus], an application that provides different textual APIs (like CTS and DTS) for corpora following Capitains Guidelines
* [https://github.com/Capitains/MyCapytain Capitains MyCapytain], the shared library used for parsing corpora. It can be used in corpus analysis pipeline to parse local repositories, to consume APIs but it can also be used further for building web application.

== Bibliography ==

* Almas & Clérice, Continuous Integration and Unit Testing of Digital Editions, Digital Humanities Quaterly, Volume 11.4, Link
* Clérice, Les outils CapiTainS, l’édition numérique et l’exploitation des textes, Médiévales, Volume 73, p. 114 Editor Pre-print

[[category:projects]] [[category:tools]]

Deucalion and Pie lemmatizers

2019-06-04T16:39:02Z

ThibaultClerice: Reorganization proposal

== Available ==

* [https://github.com/emanjavacas/pie Pie]
* [https://github.com/chartes/deucalion-model-lasla Deucalion (with LASLA data)]

== Author ==

* Enrique Manjavas
* Mike Kestemont
* Thibault Clérice

== Description ==

'''Pie''' is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

=== Deucalion ===

Deucalion is :

* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
* a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
* a [https://hub.docker.com/r/ponteineptique/deucalion-model-lasla Docker Image ] that makes running it even simpler

In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the [https://github.com/chartes/deucalion-model-lasla/tree/master/information information] folder of the image but we can note the following accuracies:

* Lemmatization : 97,52 %
* Part-Of-Speech: 96.55 %
* Morphology
** Voice : 99.18 %
** Mood : 98.36 %
** Degree : 98.30 %
** Number : 97.88 %
** Person : 99.18 %
** Tense : 98.75 %
** Tense : 93.74 %
** Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)

A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]

== Bibliography ==

* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
* D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
* D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
* D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847

[[category:lemmatisation]] [[category:tools]]

Deucalion and Pie lemmatizers

2019-06-04T16:31:16Z

ThibaultClerice: Added categories

== Pie ==

[https://github.com/emanjavacas/pie Pie] is a language independant lemmatizer implemented in python and built for "variation-rich languages" which includes Latin. It's a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

== Deucalion ==

[https://github.com/chartes/deucalion-model-lasla Deucalion (with LASLA data)] is :

* a model for the lemmatizer Pie ([https://github.com/chartes/deucalion-model-lasla/blob/master/lemma.split-morph.tar .tar file on github])
* a web-application that can be easily deployed for running a lemmatization service. It runs on Python3 and flask
* a [https://hub.docker.com/r/ponteineptique/deucalion-model-lasla Docker Image ] that makes running it even simpler

In terms of statistics, the corpus was trained over around 1.3 million tokens (June 2019). The accuracy are described in the [https://github.com/chartes/deucalion-model-lasla/tree/master/information information] folder of the image but we can note the following accuracies:

* Lemmatization : 97,52 %
* Part-Of-Speech: 96.55 %
* Morphology
** Voice : 99.18 %
** Mood : 98.36 %
** Degree : 98.30 %
** Number : 97.88 %
** Person : 99.18 %
** Tense : 98.75 %
** Tense : 93.74 %
** Gender : 97.27 % (Note that not all words were annotated in genders in the LASLA data, specifically not the nouns)

A version is hosted at [https://dev.chartes.psl.eu/deucalion/models/lasla/ the École des Chartes]
== Bibliography ==

* D. Longrée, C. Philippart de Foy & G. Purnelle. « Structures phrastiques et analyse automatique des données morphosyntaxiques : le projet LatSynt », in S. Bolasco, I. Chiari & L. Giuliano (eds), Statistical Analysis of Textual Data, Proceedings of 10th International Conference Journées d'Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, Rome, LED, pp. 433-442.
* D. Longrée & C. Poudat, « New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA », in P. Anreiter & M. Kienpointner (éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 683-694.
* D. Longrée & C. Philippart de Foy & G. Purnelle, « Subordinate clause boundaries and word order in Latin: the contribution of the L.A.S.L.A. syntactic parser project LatSynt », in P. Anreiter & M. Kienpointner, éd.), Proceedings of the 15th International Colloquium on Latin Linguistics, (Innsbrucker Beiträge zur Sprachwissenschaft), Innsbruck, 2010, pp. 673-681.
* D. Longrée & Poudat C., « Variations langagières et annotation morphosyntaxique du latin classique », TAL, 50 – n° 2/2009, Special issue on "Natural Language Processing and Ancient Languages", pp. 129-148.
* Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537
* Thibault Clérice. (2019, February 1). chartes/deucalion-model-lasla: LASLA Latin Lemmatizer - Alpha (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.2554847

[[category:lemmatization]] [[category:tools]]

Morphological parsing or lemmatising Greek and Latin

2019-06-04T16:30:44Z

ThibaultClerice: /* Tools */ Added Deucalion, Pie, Pyrrha

==Lemmatisation and morphological analysis ==

See: [http://en.wikipedia.org/wiki/Lemmatisation Wikipedia page on lemmatisation]

Typically when implementing a search engine for a digital corpus, one wants to enable discovery not only of occurrences of exact word forms in the query but also of other inflections of the search terms. For example if you search Google for "digital classicism", your results will include [[Digital Classicist]] and even though "classicist" is not the exact word "classicism", you may be interested in the result. The same applies even more to highly flective languages such as Greek and Latin (this is, after all, how people are taught to use the dictionaries --- you have to know, or predict, the lemma of a word to be able to look up its meaning and other information on it).

The lemma dictionaries typically connect many occurrences of inflected word forms to their lemma form, and act as a mediator between a query (or the one who asks it) and a database, a corpus, or a text collection.

For Greek and Latin, the foremost freely available lemma dictionaries are included in the [[Morpheus]] source as XML files.

A related problem is that of parsing an inflected form, that is of performing a morphological analysis of that word. For example, saying that 'hominis' is genitive singular of lemma 'homo, -inis'. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. This is why lemmatisation software and online services typically also provide a morphological analysis of the inflected form, so they act both as lemmatisers and parsers.

Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger] trained to your language with a [http://en.wikipedia.org/wiki/Treebank Treebank] (such as the [http://perseusdl.github.io/treebank_data/ Perseus Treebanks]).

The [http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html Archimedes Project Morphology Service] also provides an XML-RPC web interface --- a script which forwards queries to the Morpheus lemmatiser/parser. Such a script can be included in pages of other text collections, enabling lemmatizing searches via a "third-party" service.

==Stemming==

Another approach often used for expanding search results is [http://en.wikipedia.org/wiki/Stemming stemming], which typically tries to use an algorithmic approach to normalize inflected words and "chop off" the inflections to produce a "stem" word. An example for Latin is the [http://snowball.tartarus.org/otherapps/schinke/intro.html Schinke Latin Stemmer]. The search engine Egothor also has [http://www.egothor.org/book/bk01ch01s06.html a trainable stemmer component].

==Orthographic Variation==

Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. [[XTF]] has [http://xtf.cdlib.org/documentation/under-the-hood/#Spelling a good introduction] to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).

== Curated Lexico-morphological Data ==

Numerous services and tools provide for any word in a given ancient text the possible lexico-morphological combinations (e.g., τῶν could be 6 possibilities, lexeme ὁ or ὅς, in masculine, feminine, or neuter forms). Such data is useful in many contexts, especially pedagogical. But such data will include many forms that are, for the context, incorrect. Some scholarly research questions require well-curated data sets, where alternative lexico-morphological forms are eliminated, weighted, or qualified. To curate a lexicon-morphological dataset can be time consuming (due in part to interpretive difficulties), but enormously profitable, since such data can be queried in sophisticated ways (e.g., in this corpora, how much more frequent are first-person aorists than third-person indicatives?). Further, such curated data can help refine other sets of lexico-morphological data, by priming an algorithm with likelihood of forms.

Listed here are published datasets of lexico-morphological data for ancient texts.

=== Coptic ===

* New Testament: The [[Coptic SCRIPTORIUM]] is in the process of curating lexico-morphological data for the New Testament, data as yet unpublished.

=== Greek ===

* The collection of the EPIDOC-compliant texts of the Open Greek and Latin Project [https://github.com/OpenGreekAndLatin] and PerseusDL [https://github.com/PerseusDL/canonical-greekLit] has been automatically analyzed morphologically and lemmatized [https://github.com/gcelano/LemmatizedAncientGreekXML].
* Classical corpora: Perseus Ancient Greek Dependency Treebank [https://github.com/PerseusDL/treebank_data/tree/master/v2.0/Greek version 2.0]. Data is semi-automatically annotated. See also [https://perseusdl.github.io/treebank_data/ Ancient Greek and Latin Dependency Treebank].
* New Testament: Morphological tagging of the SBL Greek New Testament [https://github.com/morphgnt/sblgnt (plain text UTF-8)] [https://github.com/Arithmeticus/TAN-bible/tree/master/TAN-LM (TAN-LM XML format)]
* Septuagint: CCAT tagging of Rahlfs's edition of the Septuagint [http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/ (source CCAT files, UTF-8; text in Betacode)] [https://unbound.biola.edu/index.cfm?method=downloads.showDownloadMain (UTF-8, Biola Unbound Bible]; derivative from the CCAT files, with Betacode converted to Unicode). NB, the CCAT opted to segment off verbal prefixes in the lexeme field, e.g., A)/GW E)K in Gen. 1.24. The Biola-converted data has fused these elements together, ἐκἄγω, without reconciliation (ἐξάγω).

=== Latin ===

* Classical texts: Perseus Treebank Data [https://github.com/PerseusDL/treebank_data/tree/master/v2.0/Latin version 2.0] XML data, without the cover annotation for its Greek counterpart. See also [https://perseusdl.github.io/treebank_data/ Ancient Greek and Latin Dependency Treebank].

=== Syriac ===

* New Testament: [https://sedra.bethmardutho.org/about/sedra Beth Mardutho]. Sedra version 3 available for download at [http://syrcom.cua.edu/Projects/Complete.html CUA]. A version 4 is under development as of April 2016.

== Tools ==

* [http://outils.biblissima.fr/collatinus/ Collatinus]: lemmatisation and morphological analysis tool for Latin (available source code and packages for Windows, Mac OS and Debian GNU/Linux, developed by Yves Ouvrard). [http://outils.biblissima.fr/collatinus-web Collatinus-web] is the web version of this software
* [http://outils.biblissima.fr/eulexis Eulexis]: lemmatisation tool for ancient Greek
* [http://www.ilc.cnr.it/lemlat/lemlat/index.html LemLat Latin Wordform Lemmatizer] (Istituto di Linguistica Computazionale "Antonio Zampolli" - Consiglio Nazionale delle Ricerche - Area della Ricerca di Pisa)
* Tufts Morphology service (using Morpheus for Latin): see [http://sites.tufts.edu/perseusupdates/2012/11/01/morphology-service-beta/ Morphology Service Beta] and [https://wikihub.berkeley.edu/display/pbamboo/Morphological+Analysis+Service+Contract+Description+-+v1.1.1 Morphological Analysis Service Contract Description - v1.1.1], [https://github.com/perseids-project/perseids_docs/wiki/Morphology-Service-Setup Morphology Service Setup] and [https://github.com/alpheios-project/arethusa/wiki/Adding-a-new-Morphology-Service-to-Arethusa Tufts Morphology Service/Arethusa integration]
* The [[Archimedes Project Morphology Service]] provides easy Python or Perls scripts to query Morpheus with Latin or Greek word forms
* The Classical Languages ToolKit (CLTK) has a [http://docs.cltk.org/en/latest/latin.html#lemmatization Latin lemmatizer] written in Python. One can install the CLTK via pip or from source on github: https://github.com/cltk/cltk
* [http://inlustre.net/latinowl/ LatinOWL]: app for iPhone and iPad using data from the Perseus Latin Word Tool
* [https://wiki.digitalclassicist.org/Deucalion_and_Pie_lemmatizers Deucalion and Pie] A deep learning tool that reaches high scores on both morphology, POS and lemmatization.
* [https://github.com/hipster-philology/pyrrha Pyrrha] A post-correction interface for lemmatization

==See also==
* Longrée, Dominique and Poudat, Céline. "New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA". in Anreiter, Peter; Kienpointner, Manfred (Eds.) Proceedings of the 15th International Colloquium on Latin Linguistics (2010). (The proceedings are available here: [http://www.uibk.ac.at/sprachen-literaturen/sprawi/pdf/referategeordnet.pdf].)
* [[Morpheus]]
* [[Stopwords for Greek and Latin]]
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1006&L=DIGITALCLASSICIST&F=&S=&P=59 Discussion (2010) of morphological analysis on Digital Classicist mailing list]
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1601&L=DIGITALCLASSICIST&F=&S=&P=23677#TOP A more recent (2016) discussion on the same topic on the same mailing list]
* [http://perseus.uchicago.edu/about.html About Perseus under PhiloLogic]
* [http://morphadorner.northwestern.edu/ MorphAdorner] "provides methods for adorning text with standard spellings, parts of speech and lemmata" (but has primarily been used for English language texts).

[[category:FAQ]]
[[category:Tools]]
[[category:morphology]]
[[category:Lemmatisation]]
[[category:Syntactic analysis]]
[[category:Linguistics]]

Deucalion and Pie lemmatizers

2019-06-04T16:27:31Z

ThibaultClerice: Deucalion and Pie page creation

Collatinus

2019-06-04T16:08:29Z

ThibaultClerice: /* Available */

== Available ==

* Download: http://outils.biblissima.fr/collatinus/
* Use Online: http://outils.biblissima.fr/collatinus-web/

* Source code :
** https://github.com/biblissima/collatinus-src : sources of Collatinus software
** https://github.com/biblissima/collatinus-data : linguistic data used by Collatinus software
** https://github.com/biblissima/collatinus-web-daemon : server daemon for Collatinus-web
** https://github.com/ponteineptique/collatinus-python : a port to python3 of the morphological part of Collatinus
** http://github.com/ponteineptique/pycollatinus : a port of the lemmatization analysis to Python 3

== Authors ==

* Yves Ouvrard
* Philippe Verkerk

== Description ==

'''Collatinus''' is a free, open-source application for the lemmatization and morphological analysis of Latin texts, available in both online and stand-alone versions (the latter available for Mac, Windows, and Linux platforms).

The following extended description (in French) is copied from the project website (accessed 2016-01-12):

<blockquote>Collatinus est à la fois un lemmatiseur et un analyseur morphologique de textes latins : il est capable, si on lui donne une forme déclinée ou conjuguée, de trouver quel mot il faudra chercher dans le dictionnaire pour avoir sa traduction dans une autre langue, ses différents sens, et toutes les autres données que fournit habituellement le dictionnaire.

En pratique, il est utile surtout au professeur de latin, qui peut ainsi très rapidement, à partir d’un texte hors-manuel, distribuer à ses élèves un texte inédit avec son aide lexicale. Les élèves s’en servent souvent pour lire plus facilement le latin lorsque leurs connaissances lexicales et morphologiques sont encore insuffisantes.

Principales fonctionnalités: lemmatisation de mots latins ou d'un texte latin entier, traduction des lemmes grâce aux dictionnaires de latin incorporés dans l'application, affichage des quantités (durée longue ou brève des syllabes) et des flexions (déclinaison ou conjugaison).
</blockquote>

[[Category:Lemmatisation]]
[[Category:Morphology]]
[[Category:Tools]]
[[category:Linguistics]]

Collatinus

2018-05-25T07:43:16Z

ThibaultClerice: /* Available */ Added a link to the Python Port

== Available ==

* Download: http://outils.biblissima.fr/collatinus/
* Use Online: http://collatinus.fltr.ucl.ac.be/ (moved permanently to http://outils.biblissima.fr/collatinus-web/)
* Source code :
** https://github.com/biblissima/collatinus-src : sources of Collatinus software
** https://github.com/biblissima/collatinus-data : linguistic data used by Collatinus software
** https://github.com/biblissima/collatinus-web-daemon : server daemon for Collatinus-web
** https://github.com/ponteineptique/collatinus-python : a port to python3 of the morphological part of Collatinus

== Author ==

* Yves Ouvrard (with the assistance of Philippe Verkerk)

== Description ==

'''Collatinus''' is a free, open-source application for the lemmatization and morphological analysis of Latin texts, available in both online and stand-alone versions (the latter available for Mac, Windows, and Linux platforms).

The following extended description (in French) is copied from the project website:

: Collatinus est à la fois un lemmatiseur et un analyseur morphologique de textes latins : il est capable, si on lui donne une forme déclinée ou conjuguée, de trouver quel mot il faudra chercher dans le dictionnaire pour avoir sa traduction dans une autre langue, ses différents sens, et toutes les autres données que fournit habituellement le dictionnaire.

: En pratique, il est utile surtout au professeur de latin, qui peut ainsi très rapidement, à partir d’un texte hors-manuel, distribuer à ses élèves un texte inédit avec son aide lexicale. Les élèves s’en servent souvent pour lire plus facilement le latin lorsque leurs connaissances lexicales et morphologiques sont encore insuffisantes.

: Principales fonctionnalités: lemmatisation de mots latins ou d'un texte latin entier, traduction des lemmes grâce aux dictionnaires de latin incorporés dans l'application, affichage des quantités (durée longue ou brève des syllabes) et des flexions (déclinaison ou conjugaison).

[[Category:Lemmatisation]]
[[Category:Morphology]]
[[Category:Tools]]
[[category:Linguistics]]