Page tree

Date & Time

20:00 to 22:00 UTC Wednesday 25th March 2020

Location

Zoom meeting: https://snomed.zoom.us/j/471420169

Goals

  • To finalize syntax for term searching in ECL

Agenda and Meeting Notes

Description
Owner
Notes

Welcome and agenda


Concrete valuesLinda Bird

ON HOLD: SCG, ECL, STS, ETL - Ready for publication, but on hold until after MAG meeting in April confirming requirement for Boolean datatype.

Expression Constraint LanguageLinda Bird

WIP ECL Specification

  • ADDED TO DRAFT SYNTAX - Child or self (<<!) and Parent or self (>>!)

    • New examples to be added
  • TERM SEARCH FILTERS - Syntax currently being drafted
    • Examples
      • < 404684003 |Clinical finding (finding)| {{ term = "heart att"}}
      • < 404684003 |Clinical finding (finding)| {{ term != "heart att"}} – A concept for which there exists a description that does not match – E.g. Find all the descendants of |Fracture| that have a description that doesn't contain the word |Fracture|
      • < 404684003 |Clinical finding (finding)| MINUS * {{ term = "heart att"}} – A concept which does not have any descriptions matching the term
      • < 404684003 |Clinical finding (finding)| {{ term = match: "heart att" }} – match is word (separated by white space) prefix any order; Words in substrate are ....; Search term delimiters are any mws
      • < 404684003 |Clinical finding (finding)| {{ term = wild: "heart* *ack" }}
      • < 404684003 |Clinical finding (finding)| {{ term = ("heart" "att") }}
      • < 404684003 |Clinical finding (finding)| {{ term != ("heart" "att") }} – matches concepts with a description that doesn't match "heart" or "att"
      • < 404684003 |Clinical finding (finding)| {{ TERM = (MATCH:"heart" WILD:"*ack") }}
      • < 404684003 |Clinical finding (finding)| {{ term = "myo", term = wild:"*ack" }} — Exists one term that matches both "myo" and "*ack"
      • < 404684003 |Clinical finding (finding)| {{ term = "myo" }} {{ term = wild:"*ack" }} -– Exists one term that matches "myo", and exists a term that matches "*ack" (filters may match on either same term, or different terms)
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = se }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId = 900000000000013009 |synonym| }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId = (900000000000013009 |synonym| 900000000000003001 |fully specified name|)}}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId != 900000000000550004 |Definition|}}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type = syn }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type != def }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type = (syn fsn) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type != (syn fsn) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "cardio", dialectId = 900000000000508004 |GB English| }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialectId = ( 999001261000000100 |National Health Service realm language reference set (clinical part)|
        999000691000001104 |National Health Service realm language reference set (pharmacy part)| ) }}

      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-gb }}
      • < 404684003 |Clinical finding (finding)| {{ dialect != en-gb }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = ( en-nhs-clinical en-nhs-pharmacy ) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical, acceptabilityId = 900000000000548007 |Preferred| }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical, acceptability = prefer }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical, acceptability != prefer }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical, acceptability = (prefer accept) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical, acceptability = (prefer accept) }}
      • < 404684003 |Clinical finding| MINUS * {{ dialect = en-nhs-clinical}}
      • < 73211009 |diabetes|  MINUS * {{ dialect = en-nz-patient }}
      • < 73211009 |diabetes|  MINUS < 73211009 |diabetes|   {{ dialect = en-nz-patient }}
      • < 73211009 |diabetes|  {{ term = "type" }}  MINUS < 73211009 |diabetes|   {{ dialect = en-nz-patient }}
      • (< 404684003 |Clinical finding|:363698007|Finding site| = 80891009 |Heart structure|)  {{ term = "card" }}  MINUS < (404684003 |Clinical finding|:363698007|Finding site| = 80891009 |Heart structure|)   {{ dialect = en-nz-patient }}
      • < 73211009 |Diabetes|  {{ term = "type" }}  OR < 49601007 |Disorder of cardiovascular system (disorder)|  {{ dialect = en-nz-patient }}
    • Questions
      • Wild Term Filter - Is the search term everything inside the quotation marks? (note: Match term is tokenized, but wild search is not)
        • In a Wildcard search, is the search term everything inside the quotation marks (including leading and trailing spaces)? I assume so, but since this is different to the Word-prefix-any-order/Match search, I wanted to check. Note: I think we agreed at last week's meeting to tokenize the words within a WPAO/match search, but to leave the search term for the Wildcard search in tact as a single search string. Does everyone still agree?
      • Would it be best to only allow each type of filter once within a filterConstraint. This would allow me to make acceptabilityFilter dependent on the dialectFilter. E.g.
        • NOT ALLOWED: * {{ term = "card", term = "*itis*" dialect = en-nhs-clinical, dialect = en-gb, acceptability = prefer, acceptability = accept }}
        • ALLOWED: * {{ term = "card", dialect = en-nhs-clinical (accept prefer), dialect = en-gb (prefer) }}
        • ALLOWED: * {{ term = "card", dialect = en-nhs-clinical, dialect != en-nhs-clinical (accept), dialect = en-gb (900000000000548007 |Preferred| 545) }}
        • NOT ALLOWED: * {{ term = "card", acceptability = prefer }}
        • ALLOWED: * {{ term = "card", dialect = en-nhs-clinical, acceptability = prefer }}
        • Discussion:
          • We're currently allowing each type of filter to appear more than once within a filterConstraint (in any order). Does it make sense for the same type of filter to be repeated (I've included some examples of 2 dialect filters below, for discussion)? If so, then I can't find a way in the ABNF of restricting the acceptabilityFilter such that it can only be used when a dialect is specified (as discussed last week). If we were to limit each type of filter to appearing only once, then I think this may also impose a required order on the filter types (e.g. term, then type, then language, then dialect, then acceptability), which is not really user-friendly. Here are some examples:
            • * {{ term = "card", dialect = en-nhs-clinical, dialect = en-gb, acceptability = prefer }}
              • This includes 2 dialect filters and 1 acceptability filters. If the comma means "and", then this probably means - "Concepts which have a description that contains a word starting with "card" that is in both the en-nhs-clinical LRS and the en-gb LRS and it has an acceptability of |Preferred| in [both or at least one?] LRSs.
            • * {{ term = "card", dialect = en-nhs-clinical, dialect = en-gb, acceptability = prefer, acceptability = accept }}
              • This includes 2 dialect filters and 2 acceptability filters. Would this return any concepts, and what would this mean? Would it, for example, return concepts that have a matching description that is in both the en-nhs-clinical LRS and the en-gb LRS, where that description is |preferred| in one LRS and |acceptable| in the other LRS? ... or is this just too confusing to be useful? Thoughts? Use cases?
            • * {{ term = "card", acceptability = prefer }} 
              • -- Note: This has an acceptability without a dialect. We discussed that this should be illegal ... just need to work out how to constrain this in the ANBF.
      • Deferred for now - Should we consider introducing the 'version' filter into ECL when we add search terms? We had previously discussed this. A possible example could be:
      • Concept Filters:
        • Deferred for now.
      • Case/accent folding + uni-code collation - What advice should we be giving in the specification?
        • Daniel - "PRO" folding (see Unicode reference that database providers refer to in their search engines)
        • Folding should happen before matching
        • UCA - Unicode Collation Algorithm
        • CLDR - Common Locale Data Repository http://cldr.unicode.org
        • å → a
        • Index using the Swedish/English index engine
        • Refer to Ed's questions and references - 2020-02-26 - SLPG Meeting
        • In particular - https://www.w3.org/TR/charmod-norm/#performNorm
        • Question - * {{ term = match (noFold):"" }}
      • Tokenizing the substrate - What advice should we be giving in the specification?
    • Next steps
      • Answer above questions, and update brief syntax accordingly
      • Test updated brief syntax parser
      • Update long syntax and informative comments to match
      • Test updated long syntax parser
      • Add examples to specification
      • Clarify execution semantics for consistency
      • Document execution semantics in specification
      • SLPG review / Community review
      • Any required updates
      • Publish with PDF
    • DONE - Send recommendation to MAG to consider the following
      1. Dialect Alias Refset
        • Alternative 1 - Annotation Refset
          • Dialect_Alias refset : alias + languageRefset-conceptId - e.g. "en-GB", 900000000000508004
          • Example row
            • referencedComponentId = 999001261000000100
            • dialectAlias = nhs-clinical
        • Alternative 2 - Add alias as a synonym to the language refset concept
          • Create a simple type refset that refers to the preferred alias for each language refset
        2. Constructing a Language Refset from other Language Refset
        • Allowing an intensional definition for a language refset
        • Includes order/precedence of language refsets being combined
  • Potential Use cases - Note some of these will be out of scope for the simple ECL filters
    • Find concepts with a term which matches "car" that is preferred in one language refset and not acceptable in another
    • Find the concepts that ..... have a PT = X in language refset = Y
    • Find the concepts that ..... have a Syn = X in language refset = Y
    • Find the concepts that ... have one matching description in one language, and another matching description in another language
    • Find the concepts that have a matching description that is in language refset X and not in language refset Y
    • Find the concepts that .... have a matching description that is either preferred in one language refset and/or acceptable in another language refset
    • Returning the set of concepts, for which there exists a description that matches the filter
    • Intentionally define a reference set for chronic disease. Starting point was ECL with modelling; This misses concepts modelled using the pattern you would expect. So important in building out that reference set.
    • Authors quality assuring names of concepts
    • Checking translations, retranslating. Queries for a concept that has one word in Swedish, another word in English
    • AU use case would have at most 3 or 4 words in match
    • Consistency of implementation in different terminology services
    • Authoring use cases currently supported by description templates
    • A set of the "*ectomy"s and "*itis"s
Querying Refset AttributesLinda Bird

Proposed syntax to support querying and return of alternative refset attributes (To be included in the SNOMED Query Language)

  • Example use cases
    • Execution of maps from international substance concepts to AMT substance concepts
    • Find the anatomical parts of a given anatomy structure concept (in |Anatomy structure and part association reference set)
    • Find potential replacement concepts for an inactive concept in record
    • Find the order of a given concept in an Ordered component reference set
    • Find a concept with a given order in an Ordered component reference set
  • Potential syntax to consider (brainstorming ideas)
    • SELECT ??
      • SELECT 123 |referenced component|, 456 |target component|
        FROM 799 |Anatomy structure and part association refset|
        WHERE 123 |referenced component| = (< 888 |Upper abdomen structure| {{ term = "*heart*" }} )
      • SELECT id, moduleId
        FROM concept
        WHERE id IN (< |Clinical finding|)
        AND definitionStatus = |primitive|
      • SELECT id, moduleId
        FROM concept, ECL("< |Clinical finding") CF
        WHERE concept.id = CF.sctid
        AND definitionStatus = |primitive|
      • SELECT ??? |id|, ??? |moduleId|
        FROM concept ( < |Clinical finding| {{ term = "*heart*" }} {{ definitionStatus = |primitive| }} )
      • Question - Can we assume some table joins - e.g. Concept.id = Description.conceptId etc ??
      • Examples
        • Try to recast relationships table as a Refset table → + graph-based extension
        • Find primitive concepts in a hierarchy
    • ROW ... ?
      • ROWOF (|Anatomy structure and part association refset|) ? (|referenced component| , |target component|)
        • same as: ^ |Anatomy structure and part association refset|
      • ROWOF (|Anatomy structure and part association refset|) . |referenced component|
        • same as: ^ |Anatomy structure and part association refset|
      • ROWOF (|Anatomy structure and part association refset|) {{ |referenced component| = << |Upper abdomen structure|}} ? |targetComponentId|
      • ROWOF (< 900000000000496009|Simple map type reference set| {{ term = "*My hospital*"}}) {{ 449608002|Referenced component| = 80581009 |Upper abdomen structure|}} ? 900000000000505001 |Map target|
        • (ROW (< 900000000000496009|Simple map type reference set| {{ term = "*My hospital*"}}) : 449608002|Referenced component| = 80581009 |Upper abdomen structure| ).900000000000505001 |Map target|
    • # ... ?
      • # |Anatomy structure and part association refset| ? |referenced component\
      • # (|Anatomy struture and part association refset| {{|referenced component| = << |Upper abdomen structure|) ? |targetComponentid|
    • ? notation + Filter refinement
      • |Anatomy structure and part association refset| ? |targetComponentId|
      • |Anatomy structure and part association refset| ? |referencedComponent| (Same as ^ |Anatomy structure and part association refset|)
        (|Anatomy structure and part association refset| {{ |referencedComponent| = << |Upper abdomen structure}} )? |targetComponentId|
      • ( |Anatomy structure and part association refset| {{ |targetComponentId| = << |Upper abdomen structure}} ) ? |referencedComponent|
      • ( |My ordered component refset|: |Referenced component| = |Upper abdomen structure ) ? |priority order|
      • ? |My ordered component refset| {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • ? |My ordered component refset| . |referenced component|
        • equivalent to ^ |My ordered component refset|
      • ? (<|My ordered component refset|) {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • ? (<|My ordered component refset| {{ term = "*map"}} ) {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • REFSETROWS (<|My ordered component refset| {{ term = "*map"}} ) {{ |Referenced component| = |Upper abdomen structure| }} SELECT |priority order|
    • Specify value to be returned
      • ? 449608002 |Referenced component|?
        734139008 |Anatomy structure and part association refset|
      • ^ 734139008 |Anatomy structure and part association refset| (Same as previous)
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset|
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset| :
        449608002 |ReferencedComponent| = << |Upper abdomen structure|
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset|
        {{ 449608002 |referencedComponent| = << |Upper abdomen structure| }}
      • (? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset| :
        449608002 |ReferencedComponent| = (<< |Upper abdomen structure|) : |Finding site| = *)
Returning AttributesMichael Lawley

Proposal (by Michael) for discussion

  • Currently ECL expressions can match (return) concepts that are either the source or the target of a relationship triple (target is accessed via the 'reverse' notation or 'dot notation', but not the relationship type (ie attribute name) itself. 

For example, I can write: 

<< 404684003|Clinical finding| : 363698007|Finding site| = <<66019005|Limb structure| 

<< 404684003|Clinical finding| . 363698007|Finding site| 

But I can't get all the attribute names that are used by << 404684003|Clinical finding| 

    • Perhaps something like:
      • ? R.type ? (<< 404684003 |Clinical finding|)
    • This could be extended to, for example, return different values - e.g.
      • ? |Simple map refset|.|maptarget| ? (^|Simple map refset| AND < |Fracture|)
Reverse Member OfMichael Lawley

Proposal for discussion

What refsets is a given concept (e.g. 421235005 |Structure of femur|) a member of?

  • Possible new notation for this:
    • ^ . 421235005 |Structure of femur|
    • ? X ? 421235005 |Structure of femur| = ^ X

Expression Templates

  • ON HOLD WAITING FROM IMPLEMENTATION FEEDBACK FROM INTERNAL TECH TEAM
  • WIP version - https://confluence.ihtsdotools.org/display/WIPSTS/Template+Syntax+Specification
      • Added a 'default' constraint to each replacement slot - e.g. default (72673000 |Bone structure (body structure)|)
      • Enabling 'slot references' to be used within the value constraint of a replacement slot - e.g. [[ +id (<< 123037004 |Body structure| MINUS << $findingSite2) @findingSite1]]
      • Allowing repeating role groups to be referenced using an array - e.g. $rolegroup[1] or $rolegroup[!=SELF]
      • Allow reference to 'SELF' in role group arrays
      • Adding 'sameValue' and 'allOrNone' constraints to information slots - e.g. sameValue ($site), allOrNone ($occurrence)
      • See changes in red here: 5.1. Normative Specification

Examples:

[[+id]]: [[1..*] @my_group sameValue(morphology)] { |Finding site| = [[ +id (<<123037004 |Body structure (body structure)| MINUS << $site[! SELF ] ) @site ]] , |Associated morphology| = [[ +id @my_morphology ]] }

  • Implementation feedback on draft updates to Expression Template Language syntax
    • Use cases from the Quality Improvement Project:
      • Multiple instances of the same role group, with some attributes the same and others different. Eg same morphology, potentially different finding sites.

Note that QI Project is coming from a radically different use case. Instead of filling template slots, we're looking at existing content and asking "exactly how does this concept fail to comply to this template?"

For discussion:

 [[0..1]] { [[0..1]]   246075003 |Causative agent|  = [[+id (<   410607006 |Organism| ) @Organism]] }

Is it correct to say either one of the cardinality blocks is redundant? What are the implications of 1..1 on either side? This is less obvious for the self grouped case.

Road Forward for SI

  1. Generate the parser from the ABNF and implement in the Template Service
  2. User Interface to a) allow users to specify template at runtime b) tabular (auto-completion) lookup → STL
  3. Template Service to allow multiple templates to be specified for alignment check (aligns to none-off)
  4. Output must clearly indicate exactly what feature of concept caused misalignment, and what condition was not met.

Additional note: QI project is no longer working in subhierarchies. Every 'set' of concepts is selected via ECL. In fact most reports should now move to this way of working since a subhierarchy is the trivial case. For a given template, we additionally specify the "domain" to which it should be applied via ECL. This is much more specific than using the focus concept which is usually the PPP eg Disease.

FYI Michael Chu

Description TemplatesKai Kewley
  • ON HOLD
  • Previous discussion (in Malaysia)
      • Overview of current use
      • Review of General rules for generating descriptions
        • Removing tags, words
        • Conditional removal of words
        • Automatic case significance
        • Generating PTs from target PTs
        • Reordering terms
      • Mechanism for sharing general rules - inheritance? include?
      • Description Templates for translation
      • Status of planned specification
Query Language
- Summary from previous meetings




FUTURE WORK

Examples: version and dialect

Notes

    • Allow nested where, version, language
    • Scope of variables is inner query
Confirm next meeting date/time

Next meeting is scheduled for Wednesday 22nd April 2020 at 20:00 UTC.

No files shared here yet.


  • No labels

16 Comments

  1. Following up on our homework: UCA/CLDR/Case/accent folding + Unicode collation - What advice should we be giving in the specification?

    I have personally found trying to answer this torture!

    Ideally we want to be try and get predictable (per locale) search behaviour. This could then be neatly summed up in a sentence in the guidance something like this:

    “The search specification assumes that descriptions are indexed for search using the default UCA, or UCA tailored for a specific language or locale according to CLDR. The selected locale can be specified using the ‘language=[ISO 639-1 code]’ filter. Descriptions indexed this way are compared with unmodified search tokens.”

    However, it looks as though ‘default UCA’ doesn’t ignore case (but bafflingly how case is handled is predominantly specified using a parameter called ‘strength’!). The UCA specification states that “…Language-sensitive searching and matching are closely related to collation…”, but this also indicates that they are not the same. The required collation strength for case insensitive searching is ‘secondary’, whilst the default for collation is ‘tertiary’. This may be explained here and/or here , and is probably buried somewhere deep in here, but to me is actually most clearly described by the kind people who maintain the mongoDB documentation.

    If we therefore need to add something about case insensitivity to the assumption statement above (and possibly even make case sensitivity configurable in our filters), could we just say ‘“The search specification assumes that descriptions are indexed for search using case insensitive default UCA…”?

    From a practical point of view this is tempting (commercial product configurations seem to use the “_CI” notation when setting collation (e.g. “>>mysqld --character-set-server=utf8 --collation-server=utf8_unicode_ci"). However if we are going to reference UCA then it’s worth noting that the Unicode materials don't seem to use the phrase ‘case insensitive’. Instead they talk in terms of secondary or tertiary ‘strength’ (as does the configuration page of mongoDB).

    On balance I suspect that if we make case sensitivity configurable then we should name the filter ‘case=’ with values of ‘case sensitive’ and ‘case insensitive’ (implicit default). The alternative is to name the filter ‘strength’ with values of ‘secondary’ and ‘tertiary’ and so on. Whilst the latter looks more principled I suspect it’s just confusing.

    I’ll stop there,  but will just add for info that the W3C reference we looked at last time was coming at this from a different direction. Their concern relates to string matching as it applies to the syntactic content of web pages etc. Consequently their recommendation is for a normalization step that changes nothing  - to avoid changes in element names/markup. Other content (what that paper calls natural language content) may well benefit from extensive normalisation - closer to case insensitive UCA transformation.

    Ed

    1. Thanks Ed, this research was really valuable input

      1. Thanks Michael - appreciated. Ed

    2. The documentation for this stuff in Lucene (https://lucene.apache.org/core/8_5_1/analyzers-icu/index.html) has now led me to chapter 3 of the Unicode spec https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf that talks about "Default Caseless Matching" with a variety of rules (eg D144, D145, ...) that might be a helpful reference for some.


      1. Thanks Michael

        Again, the Lucene reference, rather like the MongoDB one, is refreshing in its simplicity when compared to the all-or-nothing complexity of the Unicode materials! After a bit of reading I'm not sure I'm any nearer an easy way of saying 'so long as everyone do this...they should get the same language-specific search behaviour'; maybe the Unicode conformance chapter provides a useful shorthand - I'm just not sure I can translate into something easy to understand without losing an important aspect.

        Meanwhile, I did find out that I can adopt an Emoji, and have just spent a playful half-hour playing with the international, Swedish and Danish SI browsers and a few search term variants:

        angstrom
        Ångström
        ÅNGSTRÖM
        ångström
        Ångstrøm
        ångstrøm
        ÅNGSTRØM
        angstrøm
        Ång
        tår
        deja
        déjà
        DÉJÀ
        sjögren
        sjøgren

        I'm going to have to assume they behave as intended in each language - not what I expected comparing it with the asymmetric search tables, but I don't think I understand them either! Linda - did you get a chance to ask about how the SI browser handles language-specific indexing?

        Kind regards

        Ed

        1. Thanks again Ed!

          Re my homework .... I'm working on it. Will hopefully have something to report  at the meeting later.

        2. Hi Ed Cheetham

          The International Browser uses the Snowstorm API which uses Elasticsearch. We fold diacritic characters into their simpler form to allow matching with or without the diacritics. For example in English some terms have é, for example "Déjà vu" but in English we expect to get a match when using the same characters with or without diacritics, for example using "Dejà" or "deja".

          However, as I am sure you are aware the expectation of character folding is language dependant...

          In Danish a term with the character "ø" should not be found when searching using "o". For example the concept "Ångstrøm" should not be found when searching using the term "Ångstrom".

          Conversely in Swedish the character "ø" is not considered as an additional letter in the alphabet so the concept "Brønsted-Lowrys syra" should be found when searching using the term "Bronsted".

          This all works as expected using the Snowstorm API. We have implemented language specific character folding for each of the extensions we host. More languages can be added via configuration (see Snowstorm configuration).

          Implementation notes: terms are indexed twice in Elasticsearch, in their raw form and with language specific character folding. When querying for matches we fold the search term using each configured strategy with a constraint to only match that folding against that language.

          I created the configuration for this behaviour by asking members. I was not able to find an official source for this information.

          I hope that helps!

          Kai


          1. Thanks Kai

            Yes, it does help. The challenge we're facing is to try and get the search filter elements of ECL to behave as predictably across implementations as its graph-based elements. Maybe this is an unrealistic goal, but it seems a little early to admit defeat. You've clearly given a lot of thought to how to optimise search (generally steps to increase sensitivity) whilst respecting the features of individual languages (generally in the direction of specificity). We also need to figure out the search requirements for QA, which I believe to have a greater emphasis on specificity.

            I see that Elasticsearch uses Lucene, and Michael's Lucene reference above gives a really nice distillation of the text normalisation functions that can be configured. As with my recent Unicode documentation odyssey however, the really tricky bit is tying up the 'standards' way of describing what can be configured with each application-specific way they are managed. There is no shortage of 'official' material (the Unicode consortium certainly like words) but I am struggling to turn this into a suitably terse form to describe a (language-specific) 'default' indexing and then systematic mechanisms for varying from this default.

            The search.language.charactersNotFolded.{LanguageCode}={Characters} settings in Snowstorm, for example, make a lot of sense, but are the sets of characters identified in any way 'standard'?

            As I say above, I was hoping that the significance of accented characters in search terms would be something akin to Unicode's explanation of asymmetric search, but instead I see that whilst å is explicitly NotFolded in Swedish, the token ång returns both en and sv descriptions that begin with a simple/unaccented 'ang' - I was expecting it only to return terms that began with ång. Hopefully others on the call tonight will be able to enlighten me!

            Thanks again

            Ed

            1. Ed Cheetham the reason you see some descriptions starting 'ang' when searching using 'ång' is that the API has returned a mixture of Swedish descriptions starting 'ång' and also some English descriptions starting 'ang' ... because folding the 'å' character is acceptable in English.

              If you filter the search results by the Swedish language (controls on the left) you will see only matches starting 'ång'.

              Standards, yes... I was astounded when after spending many hours looking for a standard in this area which covered the most common international languages I found nothing. Crowd sourcing some good configuration by asking tech savvy SNOMED members seemed like a good alternative (smile) 

              1. Thanks Kai - yes, I think we figured the 'ång' thing out on the call!

                Makes sense now, but 'Filter the search results by the Swedish language' is more stringent than what I assumed it meant. I assumed it would just leave behind all the Swedish terms, but it's really 'filter the results according to the Swedish language matching rules'. Consequently 1729 'Swedish' term matches are reduced to 234 'really Swedish' matches. 

                Ed

          2. Thanks Kai, that is helpful.

            On a slightly separate note, I noticed that the UK module is configured there as well:

            codesystem.config.SNOMEDCT-UK=UK|999000031000000106

            but that it appears to be out of date.


            1. Michael Lawleythanks for being vigilant, if you have more accurate information I'm all ears. 

              1. I'm not exactly sure when it happened (sometime last year?), but there was a change in the module structure for UK and 83821000000107 is now the module that aggregates the Clinical and Drug extensions.


  2. I'll second that Michael. Thank you very much Ed - really appreciate your research on this!

    1. Thanks Linda! Ed