Difference between revisions of "Stopwords for Greek and Latin"

From The Digital Classicist Wiki
Jump to navigation Jump to search
(mention ongoing discussion towards update)
(Update after comparing, testing and designing lists for Greek and Latin)
Line 1: Line 1:
== Status quaestionis ==
== Status quaestionis ==


''This page needs to be updated: see [https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md ongoing discussion].''
''Stopwords'' (or ''stop words'') are "words which are filtered out before or after processing of natural language data" ([https://en.wikipedia.org/wiki/Stop_words Wikipedia]), because they are "very common" words and "generally uninteresting to search for" ([http://xtf.cdlib.org/documentation/under-the-hood/index.html#StopWords XTF Definition]).


'''stop word''', n. A very common word that is generally uninteresting to search for (a [http://xtf.cdlib.org/documentation/under-the-hood/index.html#StopWords XTF Definition]).
An important concept in text mining, information retrieval and natural language processing, they are fundamentally relative: the decision that a given lexical element carries no information and should be filtered out as background noise depends on a specific corpus and a specific purpose.


If you are not a linguist with a special interest in words like Latin "cum" or Greek "kai", or if you have a large collection of Greek or Latin texts and want to make searches in these collection more efficient, or if you have to prepare an index to such a collection (probably based on [[Concording_Greek_and_Latin_texts|automatic concordances]]), it is useful to have a list of stop words handy.
If you are not a linguist with a special interest in words like Latin "cum" or Greek "kai", if you have a large collection of Greek or Latin texts and want to make searches in these collection more efficient, or if you have to prepare an index to such a collection (probably based on [[Concording_Greek_and_Latin_texts|automatic concordances]]), it is useful to have a list of stopwords handy.


Of course, such "uninteresting" words will not be excluded from your search results (thanks to the so called "bigramming", cf. [http://xtf.cdlib.org/documentation/under-the-hood/index.html#StopWords XTF Definition]). Also, you can have both, providing to users of your collections searches with filtered stop words and without such filter (as it is done in [http://perseus.uchicago.edu/index.html Perseus under PhiloLogic]).
Of course, such "uninteresting" words will not be excluded from your search results (thanks to the so called "bigramming", cf. the [http://xtf.cdlib.org/documentation/under-the-hood/index.html#StopWords XTF Definition]). Also, you can have both, providing to users of your collections searches with filtered stopwords and without such filter (as it is done in [http://perseus.uchicago.edu/index.html Perseus under PhiloLogic]).


However, at the moment there are no stop word lists freely available for Greek or Latin; it seems that people compile them when they need them (and if they have the time), thereby doing the same all over again, instead of possibly improving on what others already did.
Most of the time, researchers compile stoplists when they need them (and if they have the time), instead of possibly improving on what others already did. This is why stopword lists openly available for Greek or Latin can be useful.


The stop words (apparently) used by [[Perseus]] are (see reading/src/perseus/language/analyzers/greek/GreekAnalyzer.java and reading/src/perseus/language/analyzers/latin/LatinAnalyzer.java in the [http://sourceforge.net/projects/perseus-hopper/ source]):
The stopwords currently used by the [[Perseus Digital Library]] are (see `GreekAnalyzer.java` and `LatinAnalyzer.java` in the [http://sourceforge.net/projects/perseus-hopper/ source]):
 
* Greek: μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός (original Beta Code: mh/, e(autou=, a)/n, a)ll', a)lla/, a)/llos, a)po/, a)/ra, au)to/s, d', de/, dh/, dia/, dai/, dai/s, e)/ti, e)gw/, e)k, e)mo/s, e)n, e)pi/, ei), ei)mi/, ei)/mi, ei)s, ga/r, ge, ga^, h(, h)/, kai/, kata/, me/n, meta/, mh/, o(, o(/de, o(/s, o(/stis, o(/ti, ou(/tws, ou(=tos, ou)/te, ou)=n, ou)dei/s, oi(, ou), ou)de/, ou)k, peri/, pro/s, su/, su/n, ta/, te, th/n, th=s, th=|, ti, ti/, tis, ti/s, to/, toi/, toiou=tos, to/n, tou/s, tou=, tw=n, tw=|, u(mo/s, u(pe/r, u(po/, w(s, w)=, w(/ste, e)a/n, para/, so/s)
* Caveat: if you use this list, you'll want to add τοῖς and ταῖς, and possibly remove the very unfrequent δαίς and ὑμός. (See others problems below.)


* Greek (Beta Code): mh/, e(autou=, a)/n, a)ll', a)lla/, a)/llos, a)po/, a)/ra, au)to/s, d', de/, dh/, dia/, dai/, dai/s, e)/ti, e)gw/, e)k, e)mo/s, e)n, e)pi/, ei), ei)mi/, ei)/mi, ei)s, ga/r, ge, ga^, h(, h)/, kai/, kata/, me/n, meta/, mh/, o(, o(/de, o(/s, o(/stis, o(/ti, ou(/tws, ou(=tos, ou)/te, ou)=n, ou)dei/s, oi(, ou), ou)de/, ou)k, peri/, pro/s, su/, su/n, ta/, te, th/n, th=s, th=|, ti, ti/, tis, ti/s, to/, toi/, toiou=tos, to/n, tou/s, tou=, tw=n, tw=|, u(mo/s, u(pe/r, u(po/, w(s, w)=, w(/ste, e)a/n, para/, so/s
* Greek (converted to Unicode): μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός [you'll probably want to add τοῖς and ταῖς]
* Latin: ab, ac, ad, adhic, aliqui, aliquis, an, ante, apud, at, atque, aut, autem, cum, cur, de, deinde, dum, ego, enim, ergo, es, est, et, etiam, etsi, ex, fio, haud, hic, iam, idem, igitur, ille, in, infra, inter, interim, ipse, is, ita, magis, modo, mox, nam, ne, nec, necque, neque, nisi, non, nos, o, ob, per, possum, post, pro, quae, quam, quare, qui, quia, quicumque, quidem, quilibet, quis, quisnam, quisquam, quisque, quisquis, quo, quoniam, sed, si, sic, sive, sub, sui, sum, super, suus, tam, tamen, trans, tu, tum, ubi, uel, uero
* Latin: ab, ac, ad, adhic, aliqui, aliquis, an, ante, apud, at, atque, aut, autem, cum, cur, de, deinde, dum, ego, enim, ergo, es, est, et, etiam, etsi, ex, fio, haud, hic, iam, idem, igitur, ille, in, infra, inter, interim, ipse, is, ita, magis, modo, mox, nam, ne, nec, necque, neque, nisi, non, nos, o, ob, per, possum, post, pro, quae, quam, quare, qui, quia, quicumque, quidem, quilibet, quis, quisnam, quisquam, quisque, quisquis, quo, quoniam, sed, si, sic, sive, sub, sui, sum, super, suus, tam, tamen, trans, tu, tum, ubi, uel, uero
* Caveat: if you use this list, you'll want to correct "adhic" to "adhuc". (See others problems below.)
The statistical criteria used in selecting the words are not explicit. These lists were designed for a search engine, which also normalises some features of the corpus and of the user input. Accordingly, they cannot simply be re-used. Depending on your purpose and tools, especially whether lemmatisation is available or not, you will have to take into account problems like the following:
* In Greek: alternative breathings and accents, dialectal forms, final and lunate ''sigma'', forms of ''beta'', emphatic iota, iota subscript or adscript, crasis, elisions, one-letter words, and numerals, as well the normalisation of Unicode precomposed forms.
* In Latin: u/v and i/j variants, abbreviations of common ''praenomina'', one-letter words, and numerals.
To determine which stopwords you need, you should analyse your corpus with the tool or programming language of your choice.


For Greek (when no lemmatisation is available) you may sometimes need a list including the various possible breathings and accents. Here is an extended version of the above list, also featuring both forms of ''sigma'' and of the apostrophe as encountered in digital sources:
One approach may be to run a Lucene index on your corpus with no stopwords first, then use [http://www.getopt.org/luke/ Luke] to get the top ''n'' terms for your corpus and filter that result depending on what kind of stopword behavior you want.


* Greek (extended list): ἄλλος, ἄλλοσ, ἄν, ἂν, ἄρα, ἀλλ, ἀλλ', ἀλλ’, ἀλλά, ἀλλὰ, ἀπό, ἀπὸ, αὐτός, αὐτόσ, αὐτὸς, αὐτὸσ, δ, δ', δ’, δαί, δαὶ, δαίς, δαίσ, δαὶς, δαὶσ, δέ, δὲ, δή, δὴ, διά, διὰ, ἑαυτοῦ, ἔτι, ἐάν, ἐὰν, ἐγώ, ἐγὼ, ἐκ, ἐμός, ἐμόσ, ἐμὸς, ἐμὸσ, ἐν, ἐπί, ἐπὶ, εἰ, εἴμι, εἰμί, εἰς, εἰσ, γάρ, γὰρ, γᾶ, γε, ἡ, ἤ, ἢ, καί, καὶ, κατά, κατὰ, μέν, μὲν, μετά, μετὰ, μή, μὴ, ὁ, ὅδε, ὅς, ὅσ, ὃς, ὃσ, ὅστις, ὅστισ, ὅτι, οἱ, οὕτως, οὕτωσ, οὗτος, οὗτοσ, οὐ, οὔτε, οὖν, οὐδέ, οὐδὲ, οὐδείς, οὐδείσ, οὐδεὶς, οὐδεὶσ, οὐκ, οὔκ, οὐχ, παρά, παρὰ, περί, περὶ, πρός, πρόσ, πρὸς, πρὸσ, σός, σόσ, σὸς, σὸσ, σύ, σὺ, σύν, σὺν, τά, τὰ, τάσ, τάς, τὰσ, τὰς, ταῖς, ταῖσ, τε, τήν, τὴν, τῆς, τῆσ, τῇ, τι, τί, τὶ, τίς, τίσ, τις, τισ, τό, τὸ, τόν, τὸν, τοί, τοὶ, τοιοῦτος, τοιοῦτοσ, τοῖς, τοῖσ, τούς, τούσ, τοὺς, τοὺσ, τοῦ, τῶν, τῷ, ὑμός, ὑμὸς, ὑμόσ, ὑμὸσ, ὑπέρ, ὑπό, ὑπὸ, ὥσ, ὥστε, ὡς, ὡσ, ὦ
To learn more about the problems and possibilities, please refer to projects offering alternative lists or methods:


Word frequencies could be distributed differently in your corpus. One approach may be to run a Lucene index on your corpus with no stop words first, then use [http://www.getopt.org/luke/ Luke] to get the top ''n'' terms for your corpus and filter that result depending on what kind of stop word behavior you want.
* [https://github.com/aurelberra/stopwords Ancient Greek and Latin stopwords for textual analysis] provides static stoplists primarily designed for use on the [http://voyant-tools.org/ Voyant Tools] platform, but also documents their creation, which involved comparing existing lists and basing new proposals on a statistical analysis of the most frequent words in TLG E and PHI 5 (see [https://github.com/aurelberra/stopwords/blob/master/rationale.md rationale and history] and detailed [https://github.com/aurelberra/stopwords/blob/master/revision_notes.md revision notes]).
* The [http://cltk.org/ Classical Language Toolkit] was using slightly modified versions of the Perseus lists, but is in the process of implementing dynamic stoplists in its command-line tools.


== Bibliography ==
== Bibliography ==

Revision as of 10:20, 26 January 2018

Status quaestionis

Stopwords (or stop words) are "words which are filtered out before or after processing of natural language data" (Wikipedia), because they are "very common" words and "generally uninteresting to search for" (XTF Definition).

An important concept in text mining, information retrieval and natural language processing, they are fundamentally relative: the decision that a given lexical element carries no information and should be filtered out as background noise depends on a specific corpus and a specific purpose.

If you are not a linguist with a special interest in words like Latin "cum" or Greek "kai", if you have a large collection of Greek or Latin texts and want to make searches in these collection more efficient, or if you have to prepare an index to such a collection (probably based on automatic concordances), it is useful to have a list of stopwords handy.

Of course, such "uninteresting" words will not be excluded from your search results (thanks to the so called "bigramming", cf. the XTF Definition). Also, you can have both, providing to users of your collections searches with filtered stopwords and without such filter (as it is done in Perseus under PhiloLogic).

Most of the time, researchers compile stoplists when they need them (and if they have the time), instead of possibly improving on what others already did. This is why stopword lists openly available for Greek or Latin can be useful.

The stopwords currently used by the Perseus Digital Library are (see `GreekAnalyzer.java` and `LatinAnalyzer.java` in the source):

  • Greek: μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός (original Beta Code: mh/, e(autou=, a)/n, a)ll', a)lla/, a)/llos, a)po/, a)/ra, au)to/s, d', de/, dh/, dia/, dai/, dai/s, e)/ti, e)gw/, e)k, e)mo/s, e)n, e)pi/, ei), ei)mi/, ei)/mi, ei)s, ga/r, ge, ga^, h(, h)/, kai/, kata/, me/n, meta/, mh/, o(, o(/de, o(/s, o(/stis, o(/ti, ou(/tws, ou(=tos, ou)/te, ou)=n, ou)dei/s, oi(, ou), ou)de/, ou)k, peri/, pro/s, su/, su/n, ta/, te, th/n, th=s, th=|, ti, ti/, tis, ti/s, to/, toi/, toiou=tos, to/n, tou/s, tou=, tw=n, tw=|, u(mo/s, u(pe/r, u(po/, w(s, w)=, w(/ste, e)a/n, para/, so/s)
  • Caveat: if you use this list, you'll want to add τοῖς and ταῖς, and possibly remove the very unfrequent δαίς and ὑμός. (See others problems below.)
  • Latin: ab, ac, ad, adhic, aliqui, aliquis, an, ante, apud, at, atque, aut, autem, cum, cur, de, deinde, dum, ego, enim, ergo, es, est, et, etiam, etsi, ex, fio, haud, hic, iam, idem, igitur, ille, in, infra, inter, interim, ipse, is, ita, magis, modo, mox, nam, ne, nec, necque, neque, nisi, non, nos, o, ob, per, possum, post, pro, quae, quam, quare, qui, quia, quicumque, quidem, quilibet, quis, quisnam, quisquam, quisque, quisquis, quo, quoniam, sed, si, sic, sive, sub, sui, sum, super, suus, tam, tamen, trans, tu, tum, ubi, uel, uero
  • Caveat: if you use this list, you'll want to correct "adhic" to "adhuc". (See others problems below.)

The statistical criteria used in selecting the words are not explicit. These lists were designed for a search engine, which also normalises some features of the corpus and of the user input. Accordingly, they cannot simply be re-used. Depending on your purpose and tools, especially whether lemmatisation is available or not, you will have to take into account problems like the following:

  • In Greek: alternative breathings and accents, dialectal forms, final and lunate sigma, forms of beta, emphatic iota, iota subscript or adscript, crasis, elisions, one-letter words, and numerals, as well the normalisation of Unicode precomposed forms.
  • In Latin: u/v and i/j variants, abbreviations of common praenomina, one-letter words, and numerals.

To determine which stopwords you need, you should analyse your corpus with the tool or programming language of your choice.

One approach may be to run a Lucene index on your corpus with no stopwords first, then use Luke to get the top n terms for your corpus and filter that result depending on what kind of stopword behavior you want.

To learn more about the problems and possibilities, please refer to projects offering alternative lists or methods:

Bibliography

The tag LatinWordStopList on bibsonomy provides a working bibliography of bookmarks and publications on word frequency in Latin.