Because of the fluidity of translation files that have historically been supplied, the format for each country is currently coded into the script and the selection of format is defined at this line: https://git.ihtsdotools.org/ihtsdo/termserver-scripting/blob/develop/src/main/java/org/ihtsdo/termserver/scripting/delta/GenerateTranslation.java#L25
It's probably worth downloading the develop branch of the code and checking that the file format supplied matches the expectations in the function that breaks up the line into component elements. See https://git.ihtsdotools.org/ihtsdo/termserver-scripting/blob/develop/src/main/java/org/ihtsdo/termserver/scripting/delta/GenerateTranslation.java#L204
Ideally the file supplied by the Managed Service Customer would be Tab separated (the input format for the java program that creates the delta archive), but conversion from an Excel spreadsheet is more usual. When the file has been converted to TSV, it's worth grepping for double quotes (grep "\"") as that shows up various issues that prevented a straight forward format eg extra carriage returns within fields.
The headers are ConceptID, Source term, Target term, Language, Type, CaseSignificance. Some comments about each:
|ConceptID||The SCTID. We sometimes see issues here if the spreadsheet column has not been specifically formatted as 'Text' and long SCTIDs can be expressed in scientific notation.|
|Source term||This is the term in english which we compare with the current term in the TS to ensure there's not been a mix up. The code is going to be made more flexible so that either the US PT or the FSN (with the semantic tag removed) can be used here.|
|Language||eg nl_BE This field is not required where a single language is used eg sv|
|Type||"Preferred term" or "Accepted synonym" . In the case of SE we only receive preferred terms so this field hasn't been specified to date. The code in fact checks for the string "Preferred" and anything else is considered acceptable, so we can be quite flexible about the content of this field.|
|CaseSignificance||One of CS, ci, cI (either capital i or small L work here). This field is not supplied by BE and - in their case - is calculated based on the occurrence of capital letters. Note that BE supply all their translations starting with a small letter, unlike the International Edition which would normally capitalise all Terms. Because of this, any BE term which DOES start with a capital letter can be considered CS - Entire Term Case Sensitive. SE in contrast, do state the Case Significance for each translation.|
Potential Issues with the File
Most recently working with a file supplied by the Belgian NRC, we encountered the following issues:
- UTF-8 encoding problems. It was unclear at what point in the customer's process these were introduced, but "Café au lait" appeared as "CafÃ© au lait spots" which prevented the check on the Source Term from passing validation.
- Multiple translations in a single field. These have to be manually edited since a selection of the 'best' term cannot be made programatically.
- Additional spaces and carriage returns within fields. These are quite tricky to spot, and caused major issues when opening in Open Office eg the same row repeated 50 times. The solution was to use the unix command "tr" to change \r for # and then "sed" to look for double quotes followed by hashes, and strip out the hash. Then use "tr" again to convert \r to \n to be more unix friendly. Code was also added to remove double spaces from terms - see https://git.ihtsdotools.org/ihtsdo/termserver-scripting/blob/develop/src/main/java/org/ihtsdo/termserver/scripting/delta/GenerateTranslation.java#L211
- Excel encoding problems. Longer SCTIDs can appear in scientific notation if the column has not been properly formatted as text eg 1.591E+12 instead of 1591000119103.
- Duplicates between Preferred and Acceptable terms. Duplicates can only happen within the same language. It is acceptable to have the same text string for two descriptions, if they're in separate languages. The processing program has been enhanced to ignore an exact duplicate, and to upgrade an acceptable term to preferred, if the 2nd term encountered has the higher acceptability.
- Source Term not matching. I think SE were not aware of the importance of this field with a specific validation which ensures that there's not been a mix up in the terms. There was variation in whether the PT or FSN (minus the semantic tag) had been used, and we saw some keying errors and extra information added in brackets that caused a mismatch. We've agreed that we'll validate against BOTH the PT and FSN before rejecting the row in future.
- Inactive Concepts. This issue was a mix of the concept actually being inactive and it "going to be" inactive in the next release (see "wrong dependency used" issue below). In fact it would be OK and perhaps even desirable for descriptions to be added to inactive concepts when working with historical EHRs, so I will downgrade this validation to a warning.
- Multiple Preferred Terms. The code has been enhanced to deal with this situation. In the case where two preferred terms are specified for a single concept in the same language, the 2nd term encountered will be downgraded to "Acceptable" and a warning given in the processing results file.
Potential Issues with Processing
- Wrong dependency used. This was an issue encountered some time after the fact for the 3rd Swedish translation in that the MS TS started timing out when doing an extract, so a snapshot from the INT TS was used as the basedline for processing the translation file instead. Unfortunately this file was the current position for our next INT release and so contained several concepts that had been made inactive ready for the 20180131 release but - as far as SE was concerned - were still active for them, since they were basing off the 20170731 release. If a snapshot cannot be extracted from the server, a file can be manually constructed by zipping up just the SNAPSHOT directory of the International Release.
- Description Ids. SCTIDs are obtained manually from CIS and assigned in the delta archive supplied. IF the Prod ids are used for the UAT import, it's quite possible for existing descriptions in UAT to be effectively deleted when their 'already used' description id appears to switch to a new concept which - as far as Prod was concerned - was an available id. So separate files of identifiers must be obtained for UAT and Prod imports, and a new Delta archive constructed for each environment.