Page tree

Date & Time

20:00 to 21:00 UTC Wednesday 15th July 2020

Location

Zoom meeting link (password: 764978)

Goals

  • To finalize draft collation/folding wording
  • To agree on options for supporting case sensitivity/regex
  • To develop examples to illustrate new term searching functionality

Attendees 

Agenda and Meeting Notes

Description
Owner
Notes

Welcome and agenda

NOTE: Next meeting to be held on Wednesday 29th July

Concrete ValuesLinda Bird

Specifications

  • SCG v2.4 (with booleans) has been published
  • ECL v1.4 (with booleans, childOrSelfOf and parentOrSelfOf) has been published
  • STS v1.1 and ETL v1.1 (with booleans) to be published (after MAG meeting this week)
  • MRCM (with updated rangeConstraint) - 5.3 MRCM Attribute Range Reference Set
Expression Constraint LanguageLinda Bird
  • Recent Updates to WIP

  • This Week's Questions

      • Confirm updated wording re collation recommendations.
      • Should we support an option to allow case sensitive searches? If so, should this be supported by (a) an additional parameter, (b) regex searching?
  • On Hold

    • Can/should we register ECL as a MIME type? – Waiting for volunteer time to complete registration form

  • To Do - Child or self (<<!) and Parent or self (>>!)
    • New examples to be added
  • TERM SEARCH FILTERS - Syntax currently being drafted
    • Examples
      • < 404684003 |Clinical finding (finding)| {{ term = "heart att"}}
      • < 404684003 |Clinical finding (finding)| {{ term != "heart att"}} – A concept for which there exists a description that does not match – E.g. Find all the descendants of |Fracture| that have a description that doesn't contain the word |Fracture|
      • < 404684003 |Clinical finding (finding)| MINUS * {{ term = "heart att"}} – A concept which does not have any descriptions matching the term
      • < 404684003 |Clinical finding (finding)| {{ term = match: "heart att" }} – match is word (separated by white space) prefix any order; Words in substrate are ....; Search term delimiters are any mws
      • < 404684003 |Clinical finding (finding)| {{ term = wild: "heart* *ack" }}
      • < 404684003 |Clinical finding (finding)| {{ term = ("heart" "att") }}
      • < 404684003 |Clinical finding (finding)| {{ term != ("heart" "att") }} – matches concepts with a description that doesn't match "heart" or "att"
      • < 404684003 |Clinical finding (finding)| {{ TERM = (MATCH:"heart" WILD:"*ack") }}
      • < 404684003 |Clinical finding (finding)| {{ term = "myo", term = wild:"*ack" }} — Exists one term that matches both "myo" and "*ack"
      • < 404684003 |Clinical finding (finding)| {{ term = "myo" }} {{ term = wild:"*ack" }} -– Exists one term that matches "myo", and exists a term that matches "*ack" (filters may match on either same term, or different terms)
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = se }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId = 900000000000013009 |synonym| }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId = (900000000000013009 |synonym| 900000000000003001 |fully specified name|)}}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, typeId != 900000000000550004 |Definition|}}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type = syn }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type != def }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type = (syn fsn) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "hjärta", language = SE, type != (syn fsn) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "cardio", dialectId = 900000000000508004 |GB English| }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialectId = ( 999001261000000100 |National Health Service realm language reference set (clinical part)|
        999000691000001104 |National Health Service realm language reference set (pharmacy part)| ) }}

      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-gb }}
      • < 404684003 |Clinical finding (finding)| {{ dialect != en-gb }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = ( en-nhs-clinical en-nhs-pharmacy ) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical (900000000000548007 |Preferred|) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical (prefer) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical (accept) }}
      • < 404684003 |Clinical finding (finding)| {{ term = "card", dialect = en-nhs-clinical (prefer accept), dialect = en-gb (prefer) }}
      • < 404684003 |Clinical finding| MINUS * {{ dialect = en-nhs-clinical}}
      • < 73211009 |diabetes|  MINUS * {{ dialect = en-nz-patient }}
      • < 73211009 |diabetes|  MINUS < 73211009 |diabetes|   {{ dialect = en-nz-patient }}
      • < 73211009 |diabetes|  {{ term = "type" }}  MINUS < 73211009 |diabetes|   {{ dialect = en-nz-patient }}
      • (< 404684003 |Clinical finding|:363698007|Finding site| = 80891009 |Heart structure|)  {{ term = "card" }}  MINUS < (404684003 |Clinical finding|:363698007|Finding site| = 80891009 |Heart structure|)   {{ dialect = en-nz-patient }}
      • < 73211009 |Diabetes|  {{ term = "type" }}  OR < 49601007 |Disorder of cardiovascular system (disorder)|  {{ dialect = en-nz-patient }}
    • Previous Decisions
      • Wild Term Filter - Everything inside the quotation marks is the search term (including leading and trailing spaces - Note: Match term is tokenized, but wild search is not
      • Acceptability will be an option directly attached to a dialect filter - for example:
        • * {{ term = "card", dialect = en-nhs-clinical (accept prefer), dialect = en-gb (prefer) }}
        • * {{ term = "card", dialect = en-nhs-clinical, dialect != en-nhs-clinical (accept), dialect = en-gb (900000000000548007 |Preferred| ) }}
      • The default behaviour of a system implementing these ECL queries with term searching, is to use asymmetric searching at the secondary level. This means that the search is, by default, case insensitive, with some character normalization behaviour (as determined by the value of the language). Which characters are normalized in the search string and target term index should be determined using the CLDR (Collation Locale Data Repository) rules for the given language.

        • "This means that characters in the search that are unmarked will match a character in the target that is either marked or unmarked at the same level, but a character in the query that is marked will only match a character in the target that is marked in the same way. At the secondary level, an unaccented 'e' would be treated as unmarked, while the accented letters ‘é’, ‘è’ would (in English) be treated as marked. Thus a lowercase query character matches that character or the uppercase version of that character, and an unaccented query character matches that character or any accented version of that character even if strength is set to secondary." [http://www.unicode.org/reports/tr10/#Asymmetric_Search_Secondary]

    • DONE - Send recommendation to MAG to consider the following
      1. Dialect Alias Refset
        • Alternative 1 - Annotation Refset
          • Dialect_Alias refset : alias + languageRefset-conceptId - e.g. "en-GB", 900000000000508004
          • Example row
            • referencedComponentId = 999001261000000100
            • dialectAlias = nhs-clinical
        • Alternative 2 - Add alias as a synonym to the language refset concept
          • Create a simple type refset that refers to the preferred alias for each language refset
        2. Constructing a Language Refset from other Language Refset
        • Allowing an intensional definition for a language refset
        • Includes order/precedence of language refsets being combined
  • Potential Use cases - Note some of these will be out of scope for the simple ECL filters
    • Find concepts with a term which matches "car" that is preferred in one language refset and not acceptable in another
    • Find the concepts that ..... have a PT = X in language refset = Y
    • Find the concepts that ..... have a Syn = X in language refset = Y
    • Find the concepts that ... have one matching description in one language, and another matching description in another language
    • Find the concepts that have a matching description that is in language refset X and not in language refset Y
    • Find the concepts that .... have a matching description that is either preferred in one language refset and/or acceptable in another language refset
    • Returning the set of concepts, for which there exists a description that matches the filter
    • Intentionally define a reference set for chronic disease. Starting point was ECL with modelling; This misses concepts modelled using the pattern you would expect. So important in building out that reference set.
    • Authors quality assuring names of concepts
    • Checking translations, retranslating. Queries for a concept that has one word in Swedish, another word in English
    • AU use case would have at most 3 or 4 words in match
    • Consistency of implementation in different terminology services
    • Authoring use cases currently supported by description templates
    • A set of the "*ectomy"s and "*itis"s
Querying Refset AttributesLinda Bird

Proposed syntax to support querying and return of alternative refset attributes (To be included in the SNOMED Query Language)

  • Example use cases
    • Execution of maps from international substance concepts to AMT substance concepts
    • Find the anatomical parts of a given anatomy structure concept (in |Anatomy structure and part association reference set)
    • Find potential replacement concepts for an inactive concept in record
    • Find the order of a given concept in an Ordered component reference set
    • Find a concept with a given order in an Ordered component reference set
  • Potential syntax to consider (brainstorming ideas)
    • SELECT ??
      • SELECT 123 |referenced component|, 456 |target component|
        FROM 799 |Anatomy structure and part association refset|
        WHERE 123 |referenced component| = (< 888 |Upper abdomen structure| {{ term = "*heart*" }} )
      • SELECT id, moduleId
        FROM concept
        WHERE id IN (< |Clinical finding|)
        AND definitionStatus = |primitive|
      • SELECT id, moduleId
        FROM concept, ECL("< |Clinical finding") CF
        WHERE concept.id = CF.sctid
        AND definitionStatus = |primitive|
      • SELECT ??? |id|, ??? |moduleId|
        FROM concept ( < |Clinical finding| {{ term = "*heart*" }} {{ definitionStatus = |primitive| }} )
      • Question - Can we assume some table joins - e.g. Concept.id = Description.conceptId etc ??
      • Examples
        • Try to recast relationships table as a Refset table → + graph-based extension
        • Find primitive concepts in a hierarchy
    • ROW ... ?
      • ROWOF (|Anatomy structure and part association refset|) ? (|referenced component| , |target component|)
        • same as: ^ |Anatomy structure and part association refset|
      • ROWOF (|Anatomy structure and part association refset|) . |referenced component|
        • same as: ^ |Anatomy structure and part association refset|
      • ROWOF (|Anatomy structure and part association refset|) {{ |referenced component| = << |Upper abdomen structure|}} ? |targetComponentId|
      • ROWOF (< 900000000000496009|Simple map type reference set| {{ term = "*My hospital*"}}) {{ 449608002|Referenced component| = 80581009 |Upper abdomen structure|}} ? 900000000000505001 |Map target|
        • (ROW (< 900000000000496009|Simple map type reference set| {{ term = "*My hospital*"}}) : 449608002|Referenced component| = 80581009 |Upper abdomen structure| ).900000000000505001 |Map target|
    • # ... ?
      • # |Anatomy structure and part association refset| ? |referenced component\
      • # (|Anatomy struture and part association refset| {{|referenced component| = << |Upper abdomen structure|) ? |targetComponentid|
    • ? notation + Filter refinement
      • |Anatomy structure and part association refset| ? |targetComponentId|
      • |Anatomy structure and part association refset| ? |referencedComponent| (Same as ^ |Anatomy structure and part association refset|)
        (|Anatomy structure and part association refset| {{ |referencedComponent| = << |Upper abdomen structure}} )? |targetComponentId|
      • ( |Anatomy structure and part association refset| {{ |targetComponentId| = << |Upper abdomen structure}} ) ? |referencedComponent|
      • ( |My ordered component refset|: |Referenced component| = |Upper abdomen structure ) ? |priority order|
      • ? |My ordered component refset| {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • ? |My ordered component refset| . |referenced component|
        • equivalent to ^ |My ordered component refset|
      • ? (<|My ordered component refset|) {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • ? (<|My ordered component refset| {{ term = "*map"}} ) {{ |Referenced component| = |Upper abdomen structure| }} . |priority order|
      • REFSETROWS (<|My ordered component refset| {{ term = "*map"}} ) {{ |Referenced component| = |Upper abdomen structure| }} SELECT |priority order|
    • Specify value to be returned
      • ? 449608002 |Referenced component|?
        734139008 |Anatomy structure and part association refset|
      • ^ 734139008 |Anatomy structure and part association refset| (Same as previous)
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset|
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset| :
        449608002 |ReferencedComponent| = << |Upper abdomen structure|
      • ? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset|
        {{ 449608002 |referencedComponent| = << |Upper abdomen structure| }}
      • (? 900000000000533001 |Association target component|?
        734139008 |Anatomy structure and part association refset| :
        449608002 |ReferencedComponent| = (<< |Upper abdomen structure|) : |Finding site| = *)
Returning AttributesMichael Lawley

Proposal (by Michael) for discussion

  • Currently ECL expressions can match (return) concepts that are either the source or the target of a relationship triple (target is accessed via the 'reverse' notation or 'dot notation', but not the relationship type (ie attribute name) itself. 

For example, I can write: 

<< 404684003|Clinical finding| : 363698007|Finding site| = <<66019005|Limb structure| 

<< 404684003|Clinical finding| . 363698007|Finding site| 

But I can't get all the attribute names that are used by << 404684003|Clinical finding| 

    • Perhaps something like:
      • ? R.type ? (<< 404684003 |Clinical finding|)
    • This could be extended to, for example, return different values - e.g.
      • ? |Simple map refset|.|maptarget| ? (^|Simple map refset| AND < |Fracture|)
Reverse Member OfMichael Lawley

Proposal for discussion

What refsets is a given concept (e.g. 421235005 |Structure of femur|) a member of?

  • Possible new notation for this:
    • ^ . 421235005 |Structure of femur|
    • ? X ? 421235005 |Structure of femur| = ^ X

Expression Templates

  • ON HOLD WAITING FROM IMPLEMENTATION FEEDBACK FROM INTERNAL TECH TEAM
  • WIP version - https://confluence.ihtsdotools.org/display/WIPSTS/Template+Syntax+Specification
      • Added a 'default' constraint to each replacement slot - e.g. default (72673000 |Bone structure (body structure)|)
      • Enabling 'slot references' to be used within the value constraint of a replacement slot - e.g. [[ +id (<< 123037004 |Body structure| MINUS << $findingSite2) @findingSite1]]
      • Allowing repeating role groups to be referenced using an array - e.g. $rolegroup[1] or $rolegroup[!=SELF]
      • Allow reference to 'SELF' in role group arrays
      • Adding 'sameValue' and 'allOrNone' constraints to information slots - e.g. sameValue ($site), allOrNone ($occurrence)
      • See changes in red here: 5.1. Normative Specification

Examples:

[[+id]]: [[1..*] @my_group sameValue(morphology)] { |Finding site| = [[ +id (<<123037004 |Body structure (body structure)| MINUS << $site[! SELF ] ) @site ]] , |Associated morphology| = [[ +id @my_morphology ]] }

  • Implementation feedback on draft updates to Expression Template Language syntax
    • Use cases from the Quality Improvement Project:
      • Multiple instances of the same role group, with some attributes the same and others different. Eg same morphology, potentially different finding sites.

Note that QI Project is coming from a radically different use case. Instead of filling template slots, we're looking at existing content and asking "exactly how does this concept fail to comply to this template?"

For discussion:

 [[0..1]] { [[0..1]]   246075003 |Causative agent|  = [[+id (<   410607006 |Organism| ) @Organism]] }

Is it correct to say either one of the cardinality blocks is redundant? What are the implications of 1..1 on either side? This is less obvious for the self grouped case.

Road Forward for SI

  1. Generate the parser from the ABNF and implement in the Template Service
  2. User Interface to a) allow users to specify template at runtime b) tabular (auto-completion) lookup → STL
  3. Template Service to allow multiple templates to be specified for alignment check (aligns to none-off)
  4. Output must clearly indicate exactly what feature of concept caused misalignment, and what condition was not met.

Additional note: QI project is no longer working in subhierarchies. Every 'set' of concepts is selected via ECL. In fact most reports should now move to this way of working since a subhierarchy is the trivial case. For a given template, we additionally specify the "domain" to which it should be applied via ECL. This is much more specific than using the focus concept which is usually the PPP eg Disease.

FYI Michael Chu

Description TemplatesKai Kewley
  • ON HOLD
  • Previous discussion (in Malaysia)
      • Overview of current use
      • Review of General rules for generating descriptions
        • Removing tags, words
        • Conditional removal of words
        • Automatic case significance
        • Generating PTs from target PTs
        • Reordering terms
      • Mechanism for sharing general rules - inheritance? include?
      • Description Templates for translation
      • Status of planned specification
Query Language
- Summary from previous meetings




FUTURE WORK

Examples: version and dialect

Notes

    • Allow nested where, version, language
    • Scope of variables is inner query
Confirm next meeting date/time

Next meeting is scheduled for Wednesday 22nd April 2020 at 20:00 UTC.

  File Modified
ZIP Archive equiv.zip 2020-Jul-28 by Ed Cheetham


  • No labels

20 Comments

  1. Regarding 5 & 6 - perhaps it is time to abandon 'simple syntax' for some of these niche(?) features and instead start with 'long syntax'?


  2. Hi all

    Regarding our homework…

    I wonder if the following might be helpful ammendments to the suggested text from last time. I think if we are going to use the word ‘secondary’ we should probably also include ‘strength’, and it might be additionally helpful to stretch this to ‘comparison strength’. ‘UCA tailoring’ seems worth including (as a description of the thing the CLDR rules ‘do’:

    "The default behaviour of a system implementing these ECL queries with term searching, is to use language/locale-specific asymmetric searching at the secondary comparison strength level. This means that the search is, by default, case insensitive, with some character normalization behaviour (as determined by the value of the language specified). Which characters are normalized in the search string and target term index should be determined by tailoring the UCA (Unicode Collation Algorithm) using the CLDR (Collation Locale Data Repository) rules for the given specified language."

    The other parameter that UTS#10 pulls out as worthy of mention in relation to search is ‘alternate=shifted’. This seems to relate to a standard (but still locale-tailorable) set of ignorable characters. We might want to include the same default.

    Based on Guillermo’s comments last time, the complexities of the higher comparison strengths (quaternary and identical) and the non-standard nature of regex, I’d be inclined to keep the case sensitivity requirement (and thus any need to consider caseSignificanceId) out of initial scope.

    Kind regards

    Ed

  3. Wrong place for this, but here goes...

    Some recent discussion has drawn my attention back to the concrete domains proposal

    What I'm wondering is, will an entry such as 'dec(>0..)' in the rangeConstraint field satisfy the 'type' of that field? The RDR declares this field in 723562003 |MRCM attribute range international reference set| to be 'SNOMED CT parsable string'; whilst 'dec(>0)' or the like would qualify as valid fragments of such a string, it's not clear whether they, in isolation, would qualify as 'parsable' (in the sense of being 'parsable by a parser that is expecting a well-formed ECL or STS expression'). It looks like a workaround was offered here (5.3 MRCM Attribute Range Reference Set) allowing a concrete range constraint to function as a simple (? now sub) expression constraint, but this seems to be out of date now.

    I wonder if either the definition of a 'SNOMED CT parsable string' needs to be relaxed to include valid elements of a more 'complete' parsable string, or whether the work-around needs rewriting. At present the concrete values will only be introduced to the ECL spec as values in the eclAttribute rule, so I don't see how a parser could 'make sense' of them if they just appear on their own.

    Ed

  4. Good comment Ed. Thanks for raising this.

    I think the MRCM documentation will require some updating to ensure that the rangeConstraint field is typed appropriately. I believe the RDR will remain correct in declaring the rangeConstraint to be a parsable string - however, once the syntax of "dec(>#0..)" has been fully agreed, we will need to ensure that this is a distinct parsable unit. I fully expect to pull this syntax out of the template syntax to ensure that there is a discrete set of ABNF rules that can be used to parse this. One question I have though is - where is the best place to hold this 'data attribute range constraint' syntax. I suspect that this syntax belongs in the MRCM, to be used for the rangeConstraint when the concept model attribute is a type of 'data attribute' - and we'll just need to maintain consistency between this and the corresponding syntax in the template language.

    Any other suggestions?

    • Thanks Linda. Tricky. When you say "...I suspect that this syntax belongs in the MRCM..." do you mean 'just this bit of the syntax will be used in the MRCM'? If the freedom to represent a constraint such as "dec(>#0..)" as a distinct parsable unit leaks out into the wider ECL then (in theory) it provides an (almost certainly unhelpful) alternative mechanism for specifying corresponding numeric values for observables! A rather tortured option would be to use a prefix that sort of 'legitimises' the concrete rangeConstraint such as '* : * = dec(>#0..)'.  This would presumably be a valid constraint parsable using the unmodified ECL ('anything constrained by any attribute name and then takes a decimal value'), and could then be 'merged' with its corresponding attributeRule to produce the combined constraint.
  5. I think my concern is best illustrated with something like schvannom/schwannom because "w" is marked in Swedish but is not in English.




    "w" markedYNYN


    search termschvannomschwannom


    localesvensven
    "w" markedIdDescriptionLang



    Y1schvannomsvYYNN
    N2schvannomenYYNN
    Y3schwannomsvYYYY
    N4schwannomenYNYY

    Based on this analysis, a search for "schvannom" in the en locale will both match and not match the "string" schwannom, because it is present in the index twice; once as Swedish, and once as English.

    Of course I'm artificially including Swedish words as English descriptions, but as we see with Sjögren, this kind of thing does happen.


  6. Hi Michael,

    Thank you for articulating your thoughts on this so clearly!

    My understanding of the situation is that when you're doing a term search there is only a single 'locale' (i.e. the search 'locale') which defines the rules by which you will compare the characters in the search term with the characters in the target term. And this one search 'locale' is what defines whether the characters are marked or unmarked (and otherwise related or not). So, it doesn't make sense to say that the "w" is marked in the Search term, but not in the Target term. It's either 'marked' or 'unmarked' in the one search locale.

    So, based on this understanding, I would redraw your table like this:


    search term

    schvannom

    schwannom


    search locale

    sv

    en

    sv

    en

     

    “w” marked

    Y

    N

    Y

    N

    Id

    Description

    Lang





    1

    schvannom

    sv

    Y

    Y

    N

    N

    2

    schvannom

    en

    Y

    Y

    N

    N

    3

    schwannom

    sv

    Y

    N

    Y

    Y

    4

    schwannom

    en

    Y

    N

    Y

    Y

    ...which illustrates that the language of the search term is irrelevant, except in terms of how it influences the search locale.

    Daniel and Ed - Does this match your understanding?

    Kind regards,

    Linda.

    1. Interesting.  So regardless of the language of the description (the target term), it will always be processed according to the rules of the search locale?

      This would have the consequence that every English description containing a "w" would match the corresponding string with "v"s replacing the "w"s IF your search locale was Swedish?


      1. Thanks for the examples. Isn't this one of hazards of using a collation mechanism (motivated to inform sort ordering by identifying differences) to inform search (motivated by detecting similiarities)? I suspect it also exposes weaknesses in the definition of languageCode (certainly for terms adopted by a highly acquisitive language like English)! The Unicode collation FAQ's have this to say on 'mixed languages' (my emphasis):

        Q: How are mixed Japanese and Chinese handled?

        A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation).

        My reading of this is that a collation-informed sorting or searching is conducted according to a selected context or locale, irrespective of the 'language' of the items in substrate ('source') being tested. Accordingly, all terms tested in a Swedish locale will be processed as if they were Swedish words. For sorting this is probably not a big deal, but for searching and 'sameness' it could be odd. The second emphasis seems to throw us a line: SNOMED CT terms are tagged with a language, so it ought to be possible, even where the search locale is specified as Swedish, to apply a languageCode-specified rules when these terms are encountered (e.g. 'en' → 'w' ≠ 'v'). This would probably work as intended most of the time, but would presumably risk false negatives where a genuinely Swedish word is used in an en-labelled term. Addressing this would require each term to be tagged with a 'desired collation' (or more likely a 'set of desired collations') which would be extra preparation work.

  7. I think there are 2 situations:

    1. When we define a query with a search filter of "language = se", I believe that we are doing 2 things - (a) setting the locale to Swedish, and (b) selecting only the Swedish descriptions for matching ... so the search locale is the same as the target language. Note: The same query could have 2 filters, each with different languages set (and therefore different locales)
    2. If, however, no "language = X" filter is used, then do we need to assume a default locale? And if so, then (a) is the locale set by the system environment? or (b) Should each language refset (referred to by "dialect = X") be assigned a default locale (e.g. in the dialect alias refset?) or should this instead be derivable from the language of the descriptions in that refset? or (c) should the search locale change depending on the language of the target-descriptions?

    Thoughts?

    Kind regards,
    Linda.

    1. Hi Linda/all

      Looking over the WIP document (I've made a couple of in-line comments) and looking at the current data I'm concerned that "....(b) selecting only the Swedish descriptions for matching..." is too stringent for some use cases. I haven't done the exact numbers but it looks as though a Swedish locale search for 'sjögren', in the Swedish data, restricted to the Swedish descriptions will not find 44212003 |antikropp mot SS-A|. This concept has a 'sjögren'-containing description - but it is an 'en' description. Most of the time the substrate language and the locale will be the same, but we might want to consider supporting exceptions.

      Also I note that there seems to be some discrepancy creeping in between our 'tables of expected behaviour' in (1) the WIP document, the character associations in the various (2) CLDR collation documents, and (3) the International browser character handling rules.  I can see why (3) is at odds, since it's only trying to limit case folding, not identify equivalence, but (1) and (2) should be in step, and they aren't. Looking at the Danish example, I can see no association between 'æ' and 'ae' in the CLDR, but this is used heavily in our WIP examples.

      I wonder if we are getting to the stage where we need to contact the nice people at unicode.org?

      Ed

      1. Hi Ed/all,

        I agree that it's very strange behaviour to not find 44212003 |Sjögren's syndrome - A antibody| with the search term "Sjögren". However, I think we'll need to raise this with the browser implementation, because I'm sure we can agree that this is not the desired behaviour.

        With respect to the examples in the WIP document - the Danish ones were drafted by Anne based on the behaviour she would expect. However, I am aware that these need to be reviewed against the CLDR collation rules. I plan to look at this before this week's meeting, and would appreciate any suggestions people have regarding the examples.

        In terms of what needs to be agreed on to progress this - Based on the assumption that each comparison between a search term and a description in the substrate must use a single set of collation rules - we should try to agree on where the default collation rules are coming from:

        1. Is the default collation based on the 'Language' property of the description in the substrate?
          • Note: This might make indexing easier, as each description only needs to be indexed once, based on the collation rules for its language
        2. Is the default collation based on the language reference set being used?
          • Note: This would require us to record the default collation for each language reference, which could potentially refer to descriptions of multiple languages if required. 1 index per language refset?
        3. Is the default collation set in the ECL query? If so, then is it set by the 'language' filter (which restricts the language of the matching descriptions), the 'dialect' filter (which restricts the language refset), or do we need a separate 'collation' filter that allows this to be set in the query 
        4. Is the default collation set by the environment ..... and we just say "The local environment determines the default collation rules, and how the collation rules are altered, if required."
        5. Is the default collation a combination of these - e.g. First check if the collation is set in query, and if not then use the Description's language.

        Can everyone in the SLPG please consider these options (or additional options if your recommendation is not there), and be ready to explain your preference at this week's meeting? I'd like to draw this conversation to some conclusions this week. If we're unable to agree on this, we may need to simply leave this detail to the local implementation. While that wouldn't be ideal, I think there's benefit in getting these enhancements published and getting some experience in using the term filters.

        Thanks very much for all your contributions! Looking forward to the discussions this week!

        Kind regards,
        Linda.


        1. Hi Linda.

          Just to get something on paper, my thinking is closest to option 2 ("default collation...based on the language reference set being used..."). Looking back to my post from February, I still assume that any valid (Snapshot) must include at least one Language RefSet. Language Refsets pick up a 'language' identity from their metadata concepts, and it would be a reasonable assumption/default position to expect the environment or database-level index and sort collation to have been set to correspond to whatever this language is. Specifying the the collation in the ECL filter is really just reinforcing this build-time design decision and then saying 'I have designed this query that I am now sharing using 'sv' collation applied to my data (which is specified in an 'sv' refset). Unless you do the same I certainly wouldn't expect you to get the same results as me.'

          I'm increasingly sceptical about the value of description-level languageCodes for this exercise. As I say in the earlier post I don't actually think that the definition of languageCode ("...specifies the language of the description...") is sufficiently nuanced to cope with the challenge. It can usefully bunch together sets of 'terms that are used by many speakers of a given language', but whilst 'déjà vu', 'mittelschmerz' and 'mosquito' indeed fall into the category of 'words used by English-speakers' I do not believe that makes them 'English words'.

          If an 'sv' refset includes acceptable 'en' or 'da' (or 'se'!) descriptions then they just get treated as if they were 'sv' - surely that's how they would be indexed too when the database collation was specified. I had it in my mind that there was a rule/expectation that Language Refsets could only include same-language descriptions, but your bullet in point 2 above - and the fact that I can't find this constraint written down - means we have to handle this (likely) occurrence.

          Ed

  8. As ever, raised in the wrong place but I don't seem to be able to start 'discussions'!

    A couple of questions/comments regarding 'AND' in the ECL...

    (1) Was there a reason why we didn't include the plus sign ('+') as an alternative for AND to represent conjunction between multiple focus constraints? Both '+' and ',' (comma) indicate conjunction in the SCG, and we included the comma in ECL between attributes, but not the plus sign. I can't recall why we have one but not the other.

    (2) Basically a parser/ECL engine consistency question (please note, not a critique of the implementations, just an attempt to tie down the 'standard' - thanks for making the implementations available):

    The following example constraint is handled differently in different public implementations:

    << 118934005 | Disorder of head | AND
    << 118932009 | Disorder of foot |: 
    {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |}

    Run against the international 20200309 data, I get the following:

    CSIRO Ontoserver - 10 results

    SNOMED International Browser - 27 results

    SNQuery - 'syntax is incorrect'.

    All parsers give 10 results if I put brackets around the focus constraints: 

    (<< 118934005 | Disorder of head | AND
    << 118932009 | Disorder of foot |):
    {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |}

    So question: is there a requirement to wrap multiple conjunctions (when there are only conjunctions) in brackets if accompanied by a refinement (essentially another 'AND')? My instinct (and the current ECL specification) would suggest no, but some implementations seem to expect it.

    Thoughts?

    Kind regards Ed




  9. Hi Ed,

    In answer to your questions:

    1. I don't recall the exact reason that we did not include '+' in ECL ... but I suspect it was for consistency between the operators - i.e. "AND", "OR" and "MINUS". By including '+' as an alternative for "AND", we may have been tempted to include '-' as an alternative for "MINUS" which is visually more difficult to see in an expression constraint (particularly if the terms contain hyphens).
    2. According to the published ECL syntax, the example you gave (below) is syntactically incorrect - so only the SNQuery implementation complies with the standard.
      • << 118934005 | Disorder of head | AND 
        << 118932009 | Disorder of foot |: 
        {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |}
      While it may be insignificant how the brackets fall on this one, the constraint "<< 118932009 | Disorder of foot |: {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |}" only simple expression constraints - i.e. with a single focus concept optionally preceded by a constraint operator (e.g. '<') and/or a memberOf function (i.e. "^") - can avoid brackets. I will raise this bug with our Tech Team.
      Note: If the above syntax was correct, then I would expect it to return the same answer as both of the following two bracketed versions (i.e. '10' on 20200331 international):
      • (<< 118934005 | Disorder of head | AND 
        << 118932009 | Disorder of foot |): 
        {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |}
      • << 118934005 | Disorder of head | AND 
        (<< 118932009 | Disorder of foot |: 
        {116676008 | Associated morphology | = << 26996000 | Hyperkeratosis |})

    Thanks Ed!
    Kind regards,
    Linda.

    1. Thanks Linda.

      I sort of thought the same about the first ('AND'/'+') point, and if it's not causing anyone problems then can probably be left.

      The second point is a bit of a surprise to me! The ECL section on operator precedence says "...when all binary operators are either conjunction (i.e. 'and') or disjunction (i.e. 'or'), brackets are not required...". To my reading, the addition of a refinement to a set of focus conjuntions is just another 'AND' (where the right hand side is a role constraint). As your later examples show, no matter how the brackets are used the result is the same. I probably don't mind either way, but others may have a view?

      That said, personally I'd like to keep each language rooted in and near-isomorphic with the SCG upon which they are based (and upon which they operate). In this situation this would mean the retention of '+' for focus concept conjunction and the lack of need for brackets when dealing with chains of uncomplicated 'AND' logic.

      Hey ho.

      Kind regards

      Ed

      1. Thanks Ed.

        With respect to the wording "... when all binary operators are either conjunction (i.e. 'and') or disjunction (i.e. 'or'), brackets are not required..." - when it refers to "binary operators" this is specifically referring to the use of the operators AND, OR and MINUS (as per the ECL logical model). Refinement, in this sense, is not labelled as a 'binary operator' in the model. So while I agree that brackets aren't required to resolve ambiguity in this case, I would prefer not to revisit this unless others feel strongly.

        SImilarly with '+' not being used in ECL - I would prefer not to revisit this, unless there is a strong use case to make ECL a strict 'superset' of SCG.

        Kind regards,
        Linda. 

        1. I agree with excluding '+' from ECL.

          In particular, it would create confusion because

          118934005 | Disorder of head | + 118932009 | Disorder of foot |: {116676008 | Associated morphology | = 26996000 | Hyperkeratosis |}

          Is a valid SCG expression, but (based on Linda's analysis above), would not be a valid ECL expression (if we allowed '+')

          I was surprised, however, that the expression with AND is invalid wrt the published syntax, but it is clearly so:

          conjunctionExpressionConstraint = subExpressionConstraint 1*(ws conjunction ws subExpressionConstraint)
          subExpressionConstraint = [constraintOperator ws] [memberOf ws] (eclFocusConcept / "(" ws expressionConstraint ws ")") 


          1. Thanks both - that's fine.

            Ed

  10. Daniel, all

    Picking up on your suggestion about eponyms (in particular 'Ångström'), and noting that the Danish data has another spelling ('ångstrøm'), I thought it might be time to share what I was experimenting with at the weekend to try and understand this a bit better. Basically it's a reworking of my word equivalent (python3) code, but now reading characters rather than words/phrases from the accompanying charEquiv.txt. The latter is derived from the sv and da CLDR files. I know at the moment the rules are too permissive, but this can be tailored (mostly by changing the 0s and 1s in the fourth column.

    If you unzip the two files into a single folder and run 'python3 charEquiv.py' (or however your system invokes python3) then you should be prompted to input locale code, a comma, and then the search term/phrase (this loops until you enter 'zz'). For example:

    Locale,Search string: da,Ångström should give:
    ---------
    da,Ångström
    ---------
    ångström
    ångstrøm
    ångstrőm
    aangström
    aangstrøm
    aangstrőm
    ---------

    ...suggesting that everything in the longer (output) list is a valid match in the Danish locale. I know this is currently too permissive, treating all << associations in the CLDR as bidirectional. The only place I haven't allowed this is the sv v/w pair, where the 0 and 1 settings in the right hand column say 'if you find a v in the search term, try a w, but not the other way around'. The charEquiv.txt file can be edited whilst the loop is running, so for example you can experiment:

    4åda1
    4aada1

    gives:

    Locale,Search string: da,aarhus
    ---------
    da,aarhus
    ---------
    aarhus
    århus
    ---------

    whilst:

    4åda1
    4aada0

    only gives:

    Locale,Search string: da,aarhus
    ---------
    da,aarhus
    ---------
    aarhus
    ---------

    Might be useful to understand what's going on, but probably just a great way to waste time (while we think about Linda's more serious questions about default locale)!

    Ed



    equiv.zip