Morphological parsing or lemmatising Greek and Latin: Difference between revisions

From The Digital Classicist Wiki
Jump to navigation Jump to search
(→‎Tools: Added Deucalion, Pie, Pyrrha)
(27 intermediate revisions by 7 users not shown)
Line 1: Line 1:
==Lemmatisation==
==Lemmatisation and morphological analysis ==  


See: [http://en.wikipedia.org/wiki/Lemmatisation Wikipedia page on lemmatisation]
See: [http://en.wikipedia.org/wiki/Lemmatisation Wikipedia page on lemmatisation]
Line 9: Line 9:
For Greek and Latin, the foremost freely available lemma dictionaries are included in the [[Morpheus]] source as XML files.  
For Greek and Latin, the foremost freely available lemma dictionaries are included in the [[Morpheus]] source as XML files.  


The [http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html Archimedes Project Morphology Service] also provides an XML-RPC web interface --- a script which forwards queries to the Morpheus dictionaries. Such a script can be included in pages of other text collections, enabling lemmatizing searches via a "third-party" service.
A related problem is that of parsing an inflected form, that is of performing a morphological analysis of that word. For example, saying that 'hominis' is genitive singular of lemma 'homo, -inis'. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. This is why lemmatisation software and online services typically also provide a morphological analysis of the inflected form, so they act both as lemmatisers and parsers.


==Parsing==
Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger] trained to your language with a [http://en.wikipedia.org/wiki/Treebank Treebank] (such as the [http://perseusdl.github.io/treebank_data/ Perseus Treebanks]).


A related problem is that of parsing a text to mark up its syntactic structure. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger] trained to your language with a [http://en.wikipedia.org/wiki/Treebank Treebank] (such as the [http://nlp.perseus.tufts.edu/syntax/treebank/ Perseus Treebanks]).
The [http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html Archimedes Project Morphology Service] also provides an XML-RPC web interface --- a script which forwards queries to the Morpheus lemmatiser/parser. Such a script can be included in pages of other text collections, enabling lemmatizing searches via a "third-party" service.


==Stemming==
==Stemming==
Line 22: Line 22:


Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. [[XTF]] has [http://xtf.cdlib.org/documentation/under-the-hood/#Spelling a good introduction] to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).
Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. [[XTF]] has [http://xtf.cdlib.org/documentation/under-the-hood/#Spelling a good introduction] to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).
== Curated Lexico-morphological Data ==
Numerous services and tools provide for any word in a given ancient text the possible lexico-morphological combinations (e.g., τῶν could be 6 possibilities, lexeme ὁ or ὅς, in masculine, feminine, or neuter forms). Such data is useful in many contexts, especially pedagogical. But such data will include many forms that are, for the context, incorrect. Some scholarly research questions require well-curated data sets, where alternative lexico-morphological forms are eliminated, weighted, or qualified. To curate a lexicon-morphological dataset can be time consuming (due in part to interpretive difficulties), but enormously profitable, since such data can be queried in sophisticated ways (e.g., in this corpora, how much more frequent are first-person aorists than third-person indicatives?). Further, such curated data can help refine other sets of lexico-morphological data, by priming an algorithm with likelihood of forms.
Listed here are published datasets of lexico-morphological data for ancient texts.
=== Coptic ===
* New Testament: The [[Coptic SCRIPTORIUM]] is in the process of curating lexico-morphological data for the New Testament, data as yet unpublished.
=== Greek ===
* The collection of the EPIDOC-compliant texts of the Open Greek and Latin Project [https://github.com/OpenGreekAndLatin] and PerseusDL [https://github.com/PerseusDL/canonical-greekLit] has been automatically analyzed morphologically and lemmatized [https://github.com/gcelano/LemmatizedAncientGreekXML].
* Classical corpora: Perseus Ancient Greek Dependency Treebank [https://github.com/PerseusDL/treebank_data/tree/master/v2.0/Greek version 2.0]. Data is semi-automatically annotated. See also [https://perseusdl.github.io/treebank_data/ Ancient Greek and Latin Dependency Treebank].
* New Testament: Morphological tagging of the SBL Greek New Testament [https://github.com/morphgnt/sblgnt (plain text UTF-8)] [https://github.com/Arithmeticus/TAN-bible/tree/master/TAN-LM (TAN-LM XML format)]
* Septuagint: CCAT tagging of Rahlfs's edition of the Septuagint [http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/ (source CCAT files, UTF-8; text in Betacode)] [https://unbound.biola.edu/index.cfm?method=downloads.showDownloadMain (UTF-8, Biola Unbound Bible]; derivative from the CCAT files, with Betacode converted to Unicode). NB, the CCAT opted to segment off verbal prefixes in the lexeme field, e.g., A)/GW E)K in Gen. 1.24. The Biola-converted data has fused these elements together, ἐκἄγω, without reconciliation (ἐξάγω).
=== Latin ===
* Classical texts: Perseus Treebank Data [https://github.com/PerseusDL/treebank_data/tree/master/v2.0/Latin version 2.0] XML data, without the cover annotation for its Greek counterpart. See also [https://perseusdl.github.io/treebank_data/ Ancient Greek and Latin Dependency Treebank].
=== Syriac ===
* New Testament: [https://sedra.bethmardutho.org/about/sedra Beth Mardutho]. Sedra version 3 available for download at [http://syrcom.cua.edu/Projects/Complete.html CUA]. A version 4 is under development as of April 2016.
== Tools ==
* [http://outils.biblissima.fr/collatinus/ Collatinus]: lemmatisation and morphological analysis tool for Latin (available source code and packages for Windows, Mac OS and Debian GNU/Linux, developed by Yves Ouvrard). [http://outils.biblissima.fr/collatinus-web Collatinus-web] is the web version of this software
* [http://outils.biblissima.fr/eulexis Eulexis]: lemmatisation tool for ancient Greek
* [http://www.ilc.cnr.it/lemlat/lemlat/index.html LemLat Latin Wordform Lemmatizer] (Istituto di Linguistica Computazionale "Antonio Zampolli" - Consiglio Nazionale delle Ricerche - Area della Ricerca di Pisa)
* Tufts Morphology service (using Morpheus for Latin): see [http://sites.tufts.edu/perseusupdates/2012/11/01/morphology-service-beta/ Morphology Service Beta] and [https://wikihub.berkeley.edu/display/pbamboo/Morphological+Analysis+Service+Contract+Description+-+v1.1.1 Morphological Analysis Service Contract Description - v1.1.1], [https://github.com/perseids-project/perseids_docs/wiki/Morphology-Service-Setup Morphology Service Setup] and [https://github.com/alpheios-project/arethusa/wiki/Adding-a-new-Morphology-Service-to-Arethusa Tufts Morphology Service/Arethusa integration]
* The [[Archimedes Project Morphology Service]] provides easy Python or Perls scripts to query Morpheus with Latin or Greek word forms
* The Classical Languages ToolKit (CLTK) has a [http://docs.cltk.org/en/latest/latin.html#lemmatization Latin lemmatizer] written in Python. One can install the CLTK via pip or from source on github: https://github.com/cltk/cltk
* [http://inlustre.net/latinowl/ LatinOWL]: app for iPhone and iPad using data from the Perseus Latin Word Tool
* [https://wiki.digitalclassicist.org/Deucalion_and_Pie_lemmatizers Deucalion and Pie] A deep learning tool that reaches high scores on both morphology, POS and lemmatization.
* [https://github.com/hipster-philology/pyrrha Pyrrha] A post-correction interface for lemmatization


==See also==
==See also==
* Longrée, Dominique and Poudat, Céline. "New Ways of Lemmatizing and Tagging Classical and post-Classical Latin: the LATLEM project of the LASLA". in Anreiter, Peter; Kienpointner, Manfred (Eds.) Proceedings of the 15th International Colloquium on Latin Linguistics (2010). (The proceedings are available here: [http://www.uibk.ac.at/sprachen-literaturen/sprawi/pdf/referategeordnet.pdf].)
* [[Morpheus]]
* [[Morpheus]]
* [[Stopwords for Greek and Latin]]
* [[Stopwords for Greek and Latin]]
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1006&L=DIGITALCLASSICIST&F=&S=&P=59 Discussion of morphological analysis on Digital Classicist mailing list]
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1006&L=DIGITALCLASSICIST&F=&S=&P=59 Discussion (2010) of morphological analysis on Digital Classicist mailing list]
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1601&L=DIGITALCLASSICIST&F=&S=&P=23677#TOP A more recent (2016) discussion on the same topic on the same mailing list]
* [http://perseus.uchicago.edu/about.html About Perseus under PhiloLogic]
* [http://perseus.uchicago.edu/about.html About Perseus under PhiloLogic]
* [http://morphadorner.northwestern.edu/ MorphAdorner] "provides methods for adorning text with standard spellings, parts of speech and lemmata" (but has primarily been used for English language texts).
* [http://morphadorner.northwestern.edu/ MorphAdorner] "provides methods for adorning text with standard spellings, parts of speech and lemmata" (but has primarily been used for English language texts).
===Solutions for online parsing===
* [http://www.ilc.cnr.it/lemlat/lemlat/index.html LemLat Latin Wordform Lemmatizer] (Istituto di Linguistica Computazionale "Antonio Zampolli" - Consiglio Nazionale delle Ricerche - Area della Ricerca di Pisa)
* [http://www.stanthonypaduainstitute.org/xlateany.htm Latin Parse Help]
* [http://www.agfl.cs.ru.nl/lat/try.html LATINA parser of classical Latin]
* [https://wiki.projectbamboo.org/display/BTECH/Morphological+Analysis+Service+Contract+Description Tufts/Bamboo Morphology Service API]


[[category:FAQ]]
[[category:FAQ]]
[[category:Tools]]
[[category:Tools]]
[[category:morphology]]
[[category:morphology]]
[[category:Lemmatisation]]
[[category:Syntactic analysis]]
[[category:Linguistics]]

Revision as of 16:30, 4 June 2019

Lemmatisation and morphological analysis

See: Wikipedia page on lemmatisation

Typically when implementing a search engine for a digital corpus, one wants to enable discovery not only of occurrences of exact word forms in the query but also of other inflections of the search terms. For example if you search Google for "digital classicism", your results will include Digital Classicist and even though "classicist" is not the exact word "classicism", you may be interested in the result. The same applies even more to highly flective languages such as Greek and Latin (this is, after all, how people are taught to use the dictionaries --- you have to know, or predict, the lemma of a word to be able to look up its meaning and other information on it).

The lemma dictionaries typically connect many occurrences of inflected word forms to their lemma form, and act as a mediator between a query (or the one who asks it) and a database, a corpus, or a text collection.

For Greek and Latin, the foremost freely available lemma dictionaries are included in the Morpheus source as XML files.

A related problem is that of parsing an inflected form, that is of performing a morphological analysis of that word. For example, saying that 'hominis' is genitive singular of lemma 'homo, -inis'. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. This is why lemmatisation software and online services typically also provide a morphological analysis of the inflected form, so they act both as lemmatisers and parsers.

Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as TreeTagger trained to your language with a Treebank (such as the Perseus Treebanks).

The Archimedes Project Morphology Service also provides an XML-RPC web interface --- a script which forwards queries to the Morpheus lemmatiser/parser. Such a script can be included in pages of other text collections, enabling lemmatizing searches via a "third-party" service.

Stemming

Another approach often used for expanding search results is stemming, which typically tries to use an algorithmic approach to normalize inflected words and "chop off" the inflections to produce a "stem" word. An example for Latin is the Schinke Latin Stemmer. The search engine Egothor also has a trainable stemmer component.

Orthographic Variation

Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. XTF has a good introduction to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).

Curated Lexico-morphological Data

Numerous services and tools provide for any word in a given ancient text the possible lexico-morphological combinations (e.g., τῶν could be 6 possibilities, lexeme ὁ or ὅς, in masculine, feminine, or neuter forms). Such data is useful in many contexts, especially pedagogical. But such data will include many forms that are, for the context, incorrect. Some scholarly research questions require well-curated data sets, where alternative lexico-morphological forms are eliminated, weighted, or qualified. To curate a lexicon-morphological dataset can be time consuming (due in part to interpretive difficulties), but enormously profitable, since such data can be queried in sophisticated ways (e.g., in this corpora, how much more frequent are first-person aorists than third-person indicatives?). Further, such curated data can help refine other sets of lexico-morphological data, by priming an algorithm with likelihood of forms.

Listed here are published datasets of lexico-morphological data for ancient texts.

Coptic

  • New Testament: The Coptic SCRIPTORIUM is in the process of curating lexico-morphological data for the New Testament, data as yet unpublished.

Greek

  • The collection of the EPIDOC-compliant texts of the Open Greek and Latin Project [1] and PerseusDL [2] has been automatically analyzed morphologically and lemmatized [3].
  • Classical corpora: Perseus Ancient Greek Dependency Treebank version 2.0. Data is semi-automatically annotated. See also Ancient Greek and Latin Dependency Treebank.
  • New Testament: Morphological tagging of the SBL Greek New Testament (plain text UTF-8) (TAN-LM XML format)
  • Septuagint: CCAT tagging of Rahlfs's edition of the Septuagint (source CCAT files, UTF-8; text in Betacode) (UTF-8, Biola Unbound Bible; derivative from the CCAT files, with Betacode converted to Unicode). NB, the CCAT opted to segment off verbal prefixes in the lexeme field, e.g., A)/GW E)K in Gen. 1.24. The Biola-converted data has fused these elements together, ἐκἄγω, without reconciliation (ἐξάγω).

Latin

Syriac

  • New Testament: Beth Mardutho. Sedra version 3 available for download at CUA. A version 4 is under development as of April 2016.

Tools

See also