languageDetector

Language detector: Simple explanation and Download.


Guesses the language of any text by a simple algorithm:


  1. finds among the 20 most common characters of the text the most common unicode category.

  2. If this category is a letter (category starting with ā€œLā€ : Ll=letter lower case, Lu=letter upper case, Lo=letter other) continue, if not (ie. Mostly other characters like ponctuation) give up.

  3. Check among the 20 most common characters if the first word of the unicode name gives a unique language name.

    1. ("Lo","CJK","zh"),
    2. ("Lo","HANGUL","ko"),
    3. ("Lo","HIRAGANA","ja"),
    4. ("Lo","KATAKANA","ja"),
    5. ("Ll","GREEK","el"),
    6. ("Lu","GREEK","el"),
    7. ("Lo","GUJARATI","gu"),
    8. ("Lo","GEORGIAN","ka"),
    9. ("Lo","BENGALI","bn"),
    10. ("Lo","TAMIL","ta"),
    11. ("Lo","THAI","th"),
    12. ("Lo","THAANA","dv"),
    13. ("Lo","DEVANAGARI","ngram"),
    14. ("Ll","CYRILLIC","ngram"),
    15. ("Lu","CYRILLIC","ngram"),
    16. ("Lo","ARABIC","ngram"),
    17. ("Lo","ARABIC","ngram"),
    18. ("Ll","LATIN","ngram"),
    19. ("Lu","LATIN","ngram")
  4. If yes, we're done, if not, we have to choose by ngram:

  5. Compare the frequency of the 500 most common trigrams of the given text with the (precomputed) trigrams of the languages we have a small corpus of:

    1. Afrikaans (af)
    2. Arabic (ar)
    3. Azerbaijan (az)
    4. Bengali (bn)
    5. Breton (br)
    6. Bulgarian (bg)
    7. Cantonese (Roman script) (cdo)
    8. Catalan (ca)
    9. Chinese (zh)
    10. Czech (cs)
    11. Danish (da)
    12. Dutch (nl)
    13. English (en)
    14. Esperanto (eo)
    15. Estonian (et)
    16. Farsi (fa)
    17. Finnish (fi)
    18. French (fr)
    19. Frisian (fy)
    20. Galician (gl)
    21. German (de)
    22. Greek (el)
    23. Hebrew (he)
    24. Hindi (hi)
    25. Hungarian (hu)
    26. Indonesian (id)
    27. Irish Gaelic (ga)
    28. Italian (it)
    29. Japanese (ja)
    30. Korean (ko)
    31. Malagasy (mg)
    32. Malay (ms)
    33. Marathi (mr)
    34. Norwegian (no)
    35. Norwegian (Nynorsk) (nn)
    36. Pashto (ps)
    37. Polish (pl)
    38. Portuguese (pt)
    39. Romanian (ro)
    40. Russian (ru)
    41. Slovakian (sk)
    42. Slovenian (sl)
    43. Spanish (es)
    44. Swedish (sv)
    45. Turkish (tr)
    46. Ukrainian (uk)
    47. Uzbek (uz)
    48. Vietnamese (vi)
    49. Volapuk (vo)
    50. + gujarati (gu), georgian (ka), tamil (ta), thai (th), Dhivehi (dv), mentioned before =>

      more than 50 languages!

  6. Show the language with the lowest distance to the given text.


TODO: distinguish simplified from traditional Chinese from Cantonese

TODO: add more languages (very easy, a web page is enough - send me one!)




The page also computes the result from the Textcat implementation by Thomas Mangin and from the Compact Language Detector 2



The code is under GPL and used in Gromoteur for getting "clean" corpora. Written by Kim Gerdes

Download:

zip file with language ressources. Usage: python languageDetectorScript.py -r -f example.es.txt

languageDetector