Guesses the language of any text by a simple algorithm:
finds among the 20 most common characters of the text the most common unicode category.
If this category is a letter (category starting with āLā : Ll=letter lower case, Lu=letter upper case, Lo=letter other) continue, if not (ie. Mostly other characters like ponctuation) give up.
Check among the 20 most common characters if the first word of the unicode name gives a unique language name.
If yes, we're done, if not, we have to choose by ngram:
Compare the frequency of the 500 most common trigrams of the given text with the (precomputed) trigrams of the languages we have a small corpus of:
+ gujarati (gu), georgian (ka), tamil (ta), thai (th), Dhivehi (dv), mentioned before =>
more than 50 languages!
Show the language with the lowest distance to the given text.
TODO: distinguish simplified from traditional Chinese from Cantonese
TODO: add more languages (very easy, a web page is enough - send me one!)