Difference between revisions of "Morphological parsing or lemmatising Greek and Latin"

From The Digital Classicist Wiki
Jump to navigation Jump to search
(try to fill in with some explanation of the problems and various solutions)
Line 1: Line 1:
(to be added)
+
==Lemmatisation==
 +
 
 +
See: [http://en.wikipedia.org/wiki/Lemmatisation Wikipedia page on lemmatisation]
 +
 
 +
Typically when implementing a search engine for a digital corpus, one wants to enable discovery not only of occurrences of exact (i.e. inflected) word forms in the query but also of other inflections of the search terms. For example if you search Google for "digital classicism", your results will include [[Digital Classicist]] and even though "classicist" is not the exact word "classicism", you may be interested in the result.
 +
 
 +
The foremost lemma dictionaries freely available for Greek and Latin are included in the [[Morpheus]] source as XML files. From these you can look up many occurrences of inflected word forms to their lemma form. The [http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html Archimedes Project Morphology Service] also provides an XML-RPC web interface to the Morpheus dictionaries.
 +
 
 +
==Parsing==
 +
 
 +
A related problem is that of parsing a text to mark up its syntactic structure. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger] trained to your language with a [http://en.wikipedia.org/wiki/Treebank Treebank] (such as the [http://nlp.perseus.tufts.edu/syntax/treebank/ Perseus Treebanks]).
 +
 
 +
==Stemming==
 +
 
 +
Another approach often used for expanding search results is [http://en.wikipedia.org/wiki/Stemming stemming], which typically tries to use an algorithmic approach to normalize inflected words and "chop off" the inflections to produce a "stem" word. An example for Latin is the [http://snowball.tartarus.org/otherapps/schinke/intro.html Schinke Latin Stemmer]. The search engine Egothor also has [http://www.egothor.org/book/bk01ch01s06.html a trainable stemmer component].
 +
 
 +
==Orthographic Variation==
 +
 
 +
Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. [[XTF]] has [https://sourceforge.net/apps/trac/xtf/wiki/underHood_Spelling a good introduction] to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).
  
 
==See also==
 
==See also==
 
* [[Morpheus]]
 
* [[Morpheus]]
 
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1006&L=DIGITALCLASSICIST&F=&S=&P=59 Discussion of morphological analysis on Digital Classicist mailing list]
 
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1006&L=DIGITALCLASSICIST&F=&S=&P=59 Discussion of morphological analysis on Digital Classicist mailing list]
 +
* [http://perseus.uchicago.edu/about.html About Perseus under PhiloLogic]
 +
* [http://morphadorner.northwestern.edu/ MorphAdorner] "provides methods for adorning text with standard spellings, parts of speech and lemmata" (but has primarily been used for English language texts).
  
== Solutions for online parsing ==
+
===Solutions for online parsing===
 
* [http://www.ilc.cnr.it/lemlat/lemlat/index.html LemLat Latin Wordform Lemmatizer] (Istituto di Linguistica Computazionale "Antonio Zampolli" - Consiglio Nazionale delle Ricerche - Area della Ricerca di Pisa)
 
* [http://www.ilc.cnr.it/lemlat/lemlat/index.html LemLat Latin Wordform Lemmatizer] (Istituto di Linguistica Computazionale "Antonio Zampolli" - Consiglio Nazionale delle Ricerche - Area della Ricerca di Pisa)
 
* [http://www.stanthonypaduainstitute.org/xlateany.htm Latin Parse Help]
 
* [http://www.stanthonypaduainstitute.org/xlateany.htm Latin Parse Help]

Revision as of 17:01, 7 June 2010

Lemmatisation

See: Wikipedia page on lemmatisation

Typically when implementing a search engine for a digital corpus, one wants to enable discovery not only of occurrences of exact (i.e. inflected) word forms in the query but also of other inflections of the search terms. For example if you search Google for "digital classicism", your results will include Digital Classicist and even though "classicist" is not the exact word "classicism", you may be interested in the result.

The foremost lemma dictionaries freely available for Greek and Latin are included in the Morpheus source as XML files. From these you can look up many occurrences of inflected word forms to their lemma form. The Archimedes Project Morphology Service also provides an XML-RPC web interface to the Morpheus dictionaries.

Parsing

A related problem is that of parsing a text to mark up its syntactic structure. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. Disambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. One approach is to use software such as TreeTagger trained to your language with a Treebank (such as the Perseus Treebanks).

Stemming

Another approach often used for expanding search results is stemming, which typically tries to use an algorithmic approach to normalize inflected words and "chop off" the inflections to produce a "stem" word. An example for Latin is the Schinke Latin Stemmer. The search engine Egothor also has a trainable stemmer component.

Orthographic Variation

Another difficulty in searching a corpus can be orthographic (spelling) variation in the text. For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. XTF has a good introduction to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same).

See also

Solutions for online parsing