Stopwords for Greek and Latin

From The Digital Classicist Wiki
Revision as of 20:39, 29 June 2017 by AurelienBerra (talk | contribs)
Jump to navigation Jump to search

Status quaestionis

stop word, n. A very common word that is generally uninteresting to search for (a XTF Definition).

If you are not a linguist with a special interest in words like Latin "cum" or Greek "kai", or if you have a large collection of Greek or Latin texts and want to make searches in these collection more efficient, or if you have to prepare an index to such a collection (probably based on automatic concordances), it is useful to have a list of stop words handy.

Of course, such "uninteresting" words will not be excluded from your search results (thanks to the so called "bigramming", cf. XTF Definition). Also, you can have both, providing to users of your collections searches with filtered stop words and without such filter (as it is done in Perseus under PhiloLogic).

However, at the moment there are no stop word lists freely available for Greek or Latin; it seems that people compile them when they need them (and if they have the time), thereby doing the same all over again, instead of possibly improving on what others already did.

The stop words (apparently) used by Perseus are (see reading/src/perseus/language/analyzers/greek/ and reading/src/perseus/language/analyzers/latin/ in the source):

  • Greek (Beta Code): mh/, e(autou=, a)/n, a)ll', a)lla/, a)/llos, a)po/, a)/ra, au)to/s, d', de/, dh/, dia/, dai/, dai/s, e)/ti, e)gw/, e)k, e)mo/s, e)n, e)pi/, ei), ei)mi/, ei)/mi, ei)s, ga/r, ge, ga^, h(, h)/, kai/, kata/, me/n, meta/, mh/, o(, o(/de, o(/s, o(/stis, o(/ti, ou(/tws, ou(=tos, ou)/te, ou)=n, ou)dei/s, oi(, ou), ou)de/, ou)k, peri/, pro/s, su/, su/n, ta/, te, th/n, th=s, th=|, ti, ti/, tis, ti/s, to/, toi/, toiou=tos, to/n, tou/s, tou=, tw=n, tw=|, u(mo/s, u(pe/r, u(po/, w(s, w)=, w(/ste, e)a/n, para/, so/s
  • Greek (converted to Unicode): μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός [you'll probably want to add τοῖς and ταῖς]
  • Latin: ab, ac, ad, adhic, aliqui, aliquis, an, ante, apud, at, atque, aut, autem, cum, cur, de, deinde, dum, ego, enim, ergo, es, est, et, etiam, etsi, ex, fio, haud, hic, iam, idem, igitur, ille, in, infra, inter, interim, ipse, is, ita, magis, modo, mox, nam, ne, nec, necque, neque, nisi, non, nos, o, ob, per, possum, post, pro, quae, quam, quare, qui, quia, quicumque, quidem, quilibet, quis, quisnam, quisquam, quisque, quisquis, quo, quoniam, sed, si, sic, sive, sub, sui, sum, super, suus, tam, tamen, trans, tu, tum, ubi, uel, uero

For Greek (when no lemmatisation is available) you may sometimes need a list including the various possible breathings and accents. Here is an extended version of the above list, also featuring both forms of sigma and of the apostrophe as encountered in digital sources:

  • Greek (extended list): ἄλλος, ἄλλοσ, ἄν, ἂν, ἄρα, ἀλλ, ἀλλ', ἀλλ’, ἀλλά, ἀλλὰ, ἀπό, ἀπὸ, αὐτός, αὐτόσ, αὐτὸς, αὐτὸσ, δ, δ', δ’, δαί, δαὶ, δαίς, δαίσ, δαὶς, δαὶσ, δέ, δὲ, δή, δὴ, διά, διὰ, ἑαυτοῦ, ἔτι, ἐάν, ἐὰν, ἐγώ, ἐγὼ, ἐκ, ἐμός, ἐμόσ, ἐμὸς, ἐμὸσ, ἐν, ἐπί, ἐπὶ, εἰ, εἴμι, εἰμί, εἰς, εἰσ, γάρ, γὰρ, γᾶ, γε, ἡ, ἤ, ἢ, καί, καὶ, κατά, κατὰ, μέν, μὲν, μετά, μετὰ, μή, μὴ, ὁ, ὅδε, ὅς, ὅσ, ὃς, ὃσ, ὅστις, ὅστισ, ὅτι, οἱ, οὕτως, οὕτωσ, οὗτος, οὗτοσ, οὐ, οὔτε, οὖν, οὐδέ, οὐδὲ, οὐδείς, οὐδείσ, οὐδεὶς, οὐδεὶσ, οὐκ, οὔκ, οὐχ, παρά, παρὰ, περί, περὶ, πρός, πρόσ, πρὸς, πρὸσ, σός, σόσ, σὸς, σὸσ, σύ, σὺ, σύν, σὺν, τά, τὰ, τάσ, τάς, τὰσ, τὰς, ταῖς, ταῖσ, τε, τήν, τὴν, τῆς, τῆσ, τῇ, τι, τί, τὶ, τίς, τίσ, τις, τισ, τό, τὸ, τόν, τὸν, τοί, τοὶ, τοιοῦτος, τοιοῦτοσ, τοῖς, τοῖσ, τούς, τούσ, τοὺς, τοὺσ, τοῦ, τῶν, τῷ, ὑμός, ὑμὸς, ὑμόσ, ὑμὸσ, ὑπέρ, ὑπό, ὑπὸ, ὥσ, ὥστε, ὡς, ὡσ, ὦ

Word frequencies could be distributed differently in your corpus. One approach may be to run a Lucene index on your corpus with no stop words first, then use Luke to get the top n terms for your corpus and filter that result depending on what kind of stop word behavior you want.


The tag LatinWordStopList on bibsonomy provides a working bibliography of bookmarks and publications on word frequency in Latin.