Page tree

When specifying the lexical search type for term matching there is a need to specify the collation used, and to specify the default collation for the language in which the terms are to be matched are represented.

Examples based on mysql collation behavior:

"AAO" matches "ÅÄÖ" in utf8_generic_ci and utf8_unicode_ci (and utf8_german2_ci) but not in utf8_swedish_ci collation.

"Aåa" matches "aåa" in utf8_generic_ci and utf8_swedish_ci but not in utf8_bin collation (i.e. case insesitive vs. sensitive, sometimes you need case sensitivity when searching...).

Similar behavior can be implemented e.g. by java.text.Collator in java or by the collection.find() or cursor.collation() method in MongoDB.

  • No labels

1 Comment

  1. There is a Unicode Common Locale Data Repository (http://cldr.unicode.org/) which seems to be what developers like Google, Apple, Microsoft, etc. are using for their collation data needs. Collation is part of locale data for any given language. The locales are identified by their ISO 639-1 language codes and POSIX style identification of language variants.

    Another source would be ISO/IEC 15897 which is the standard for POSIX locales aka C locales. Identified by language and country separated by underscore, e.g. sv_SE. (http://pubs.opengroup.org/onlinepubs/7908799/xbd/locale.html)

    BCP47/RFC3066 specifies tags for identifying languages used e.g. in XML, HTML, RDF etc. (https://www.w3.org/2005/05/font-size-test/em-test.html) but does not refer to collation. IANA also lists additional langauge codes (https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry) including Klingon.