A major application for Natural Language Processing technologies is indexing collections of free text transcripts or documents such that topic specific searches may be run on them. The challenge is to return ranked matches which permit selection of texts with high sensitivity and high specificity (i.e. that relevant documents are rarely overlooked and that irrelevant documents are rarely returned).
Clinical searches may be performed over transcripts or documents that reside in an electronic library, within medical records, or the Internet. Examples of searches include:
Bevan Koopman's PhD thesis explores semantic and statistical approaches to search. The intention is to move beyond the limitations of plain keyword searching strategies for medical document retrieval. Characterizing these limitations as the 'semantic gap' Bevan identifies and addresses several issues including:
His specific aim was to determine whether graph-based features and the propagation of information over a graph can provide an inference mechanism to bridge this semantic gap. As part of this work, he assessed the contribution of using SNOMED CT data within the graphs used to drive inferences.
The specific application in the thesis to find patients who match certain inclusion criteria for recruitment into clinical trials based on the analysis of free text transcripts from clinical records.
Indexing methods were applied to the TREC MedTrack corpus - a standard collection of electronic texts containing de-identified reports from multiple hospitals in the United States. It includes nine types of transcripts: history and physical examinations, consultations, reports, progress notes, discharge summaries and emergency department operation reports, radiology, surgical pathology and cardiology reports. The collection as used contained around 100,000 reports within around 17,000 unique 'visits'.
Graphs have a number of characteristics that align with the requirements of semantic search as inference. The edges in a graph capture interdependence between concepts – which is identified as one of the semantic gap problems. Graphs are a common feature of both ontologies and retrieval models. The propagation of information over a graph — such as the popular PageRank algorithm used in Internet Search engines— provides a powerful means of identifying relevant information items (be they terms, concepts or documents). Ontologies such as SNOMED CT may also be represented as graphs.
The Graph Inference model developed by Bevan Koopman specifically addresses a number of semantic gap problems. Regarding vocabulary mismatch, the Graph Inference model utilizes a concept-based representation as this helps to overcome vocabulary mismatches (i.e. missed synonymy). The Graph Inference model specifically addresses granularity mismatch by traversing parent-child (i.e. 'is a') relationships.
The semantic gap problem of 'conceptual implication' is where the presence of certain terms in the document infer the query terms. For example, an organism may imply the presence of a certain disease. Such associations are captured in SNOMED CT and thus the Graph Inference model can specifically address conceptual implication by traversing those relationships.
Finally, the semantic gap problem of 'inference of similarity', where the strength of association between two entities is critical, is specifically addressed by the diffusion factor, which assigns a measure of similarity to each domain knowledge-based relationship. In the case of SNOMED CT the diffusion factor was derived from SNOMED CT relationships. It was noted that some relationships contributed to search sensitivity or conversely could lead to noise (loss of specificity) for the purpose of document retrieval. A weighting was applied (empirically) to each SNOMED CT relationship type and used as part of the relationship type component of the diffusion factor. For example, relationship type weightings included:
Documents were parsed and analyzed using Lemur – a highly versatile and customizable open source information retrieval package developed at the University of Massachusetts. The construction of the graph was done using the open source LEMON graph library. The graph was serialized using LEMON and stored inside the Lemur index directory. For the MedTrack corpus, which was found to have a vocabulary size of 36,467 SNOMED CT concepts, the resulting graph was 4.4MB.
The findings of the thesis demonstrated that the graph based retrieval approaches using SNOMED CT derived data performed better than other approaches on 'hard queries'. A number of additional insights were also revealed. First, hard queries require inference and easy queries do not. Hard queries tended to be verbose and often contained multiple dependent aspects to the query (for example, a procedure and a diagnosis concept). Re-ranking using the Graph Inference model was effective here. Easy queries tended to have a small number of relevant documents and an unambiguous query concept. For these queries, inference was not required and the Bag-of concepts model was most effective. Overall, when valuable domain knowledge was provided by SNOMED CT, the Graph Inference model was effective — either by returning new relevant documents or by effectively re-ranking those selected. This again highlights the dependence on the underlying domain knowledge.
Regarding residual lack of sensitivity of all the IR strategies, Koopman suggests that an ideal ontology for information retrieval would not only contain definitional but also assertional data – for example "captopril can be used as a treatment of hypertension", "myocardial infarction [may] cause heart block" and "diabetes mellitus may lead to renal failure".