***DRAFT MANUSCRIPT: NOT FOR CITATION WITHOUT PERMISSION***


Automatic Semantic Interpretation of Anatomic Spatial Relationships in Clinical Text

Carol A. Bean Ph.D
Thomas C. Rindflesch Ph.D.
Charles A. Sneiderman M.D. Ph.D

National Library of Medicine, Bethesda MD


Abstract

A set of semantic interpretation rules to link the syntax and semantics of locative relationships among anatomic entities was developed and implemented in a natural language processing system. Two experiments assessed the ability of the system to identify and characterize physico-spatial relationships in coronary angiography reports. Branching relationships were by far the most common observed (75%), followed by PATH (20%) and PART/WHOLE relationships. Recall and precision scores were 0.78 and 0.67 overall, suggesting the viability of this approach in semantic processing of clinical text.


INTRODUCTION

For many medical informatics researchers, the "holy grail" is reliable automated analysis of natural language to provide meaningful semantic interpretation and access to medical information. Tools in support of this goal have been under development for the past 10 years at the National Library of Medicine (NLM), under the Unified Medical Language System (UMLS) program [1]. Specifically, the domain-specific conceptual content is provided by the Metathesaurus, a compilation of controlled medical vocabularies, while the Lexicon and associated lexical variant programs [2] support natural language processing techniques being developed in the SPECIALIST system [3]. These include a syntactic component that serves as the basis for MetaMap [4], a program for mapping free text to Metathesaurus concepts. The preceding resources have allowed progress to be made in semantic interpretation of medical text, based on the Semantic Network [5]. This interpretation determines the relationships being asserted between Metathesarus concepts.

The process of semantic intepretation (implemented as a Prolog program) relies on the SPECIALIST Lexicon and a stochastic tagger for resolving part-of-speech ambiguities. Underspecified syntactic analysis uses this information to provide input to the MetaMap program. So, for example, the text in (1) is given the syntactic parse in (2).

(1) The right coronary artery arises from the aorta.

(2) [[det(the), mod(right), mod(coronary) head(artery)] [ verb(arise) ] [ prep(from), det(the), head(aorta) ] ]

On the basis of this structure, MetaMap determines the Metathesarus concepts in (3).

(3) Right coronary artery, NOS ('Body Part, Organ, or Organ Component') Aorta ('Body Part, Organ, or Organ Component')

Semantic processing then combines syntactic information, Metathesaurus concepts, and relationships from the UMLS Semantic Network in order to produce the final semantic interpretation (4).

(4) Aorta-HAS_BRANCH-Right coronary artery, NOS

Crucial to this process are rules which provide a link between aspects of syntactic structure and the relational information encoded in the Semantic Network. In the current example such a rule stipulates that the verb "arise" indicates the Semantic Network relationship BRANCH_OF. On the basis of this rule, general principles of semantic interpretation then determine the syntactic arguments of "arise" in the current sentence, namely the two noun phrases "the right coronary artery" and "the aorta." Subsequently, the program notes the Metathesaurus concepts corresponding to these noun phrases ("Right coronary artery" and "Aorta"), and the respective semantic types ('Body Part, Organ, or Organ Component' for both concepts). The program further notes that one of the semantic triples assigned to BRANCH_OF is (5)

(5) Type 1: 'Body Part, Organ, or Organ Component' Relation: BRANCH_OF Type 2: 'Body Part, Organ, or Organ Component

Since the semantic types associated with the Metathesaurus concepts assigned to the syntactic arguments of the verb "arise" match the semantic types of the arguments of the Semantic Network relation which correlates with the verb "arise" (as stipulated by rule), namely BRANCH_OF, the final semantic interpretation of the (1) is as given in (4), where the inverse relation BRANCH_OF has been normalized to the direct relation HAS_BRANCH.

Given the importance of anatomic representation and reasoning in medicine and the recent availability via the UMLS Metathesaurus of the anatomic terminology of the UWDA Symbolic Knowledge Base [6], the present study investigated the development and implementation of semantic interpretation rules to link the syntax and semantics of locative relationships among anatomic entities. Two preliminary experiments examined how well these could be used to identify and characterize specific types of locative relationships in clinical text. The first experiment aimed to test only those rules derived from the physical and spatial relationships in the 1998 Semantic Net and their equivalents, i.e., the set of operationally defined equivalents drawn from the UMLS documentation [7], and additional equivalents derived from an analysis of verb argument structures indicating locative relationships in medical text[8]. The second experiment augmented these with additional indicator rules derived for a set of relationships expressing the topological elements of a frame-like structure for the generalized concept PATH: a starting point, a destination, and the path between the two. The primary elements of the PATH frame structure are Source, Path, Destination, Direction, and Distance. These comprise a common set of spatial concepts and relationships in anatomic description, but an evaluation of Semantic Net coverage of spatial and physical relationships in anatomy found this class of relationships lacking [8]. For this study, the general PATH relationship was designated as 'Projects,' and the three structural components as PATH.Source: HAS-ORIGIN (inverse: ORIGIN-OF), PATH.Path: TRAVERSES (inverse: TRAVERSED-BY), and PATH:Destination: HAS-TERMINUS (inverse: TERMINUS-OF). The other two components, Direction and Distance, are a different type of spatial relationship (geometric), which require separate treatment and will not be addressed by the present study.


METHOD

The general procedure for development of the rules is described here using PATH as an example. In brief, there exists a set of verbs that generally evoke the image scheme satisfied by the conceptual structure PATH as described above; examples include 'extend,' 'project,' 'run,' 'course,' etc. These verbs may be modified or further specified, typically by an adverbial prepositional phrase, to focus or highlight one or more particular aspects of the PATH construct. For example, consider the general PATH verb 'extend': adding a phrase beginning with 'from' to create 'extend from' directs attention to the Source of the PATH; likewise, 'extend to' focuses on the Destination while 'extend through' highlights the Path traversed between the two. Similarly, different forms of 'merge,' 'split,' 'join,' 'separate,' etc., verbs were used to construct rules to specify branching and tributary relationships. This phenomenon was exploited in a systematic fashion to develop a set of semantic rules for each type of spatial and physical relationship that could identify the associated locative relationship indicators in text. Table 1 provides a summary of the distribution of the semantic indicator rules across relationship types.

In the experimental phase of the study, 15 reports of coronary angiography were obtained from the Johns Hopkins Cardiac Catheterization Laboratory and stripped of patient identifying information. These reports contain sections on Indication, Cardiac Catheterization Procedure, Hemodynamics, Coronary Arteriography (by major vessel), Left Ventriculography, and Impression. Seven coronary arteriography sections describing the anatomical structure and location of coronary vasculature served as the test set of documents in this study; the reports contained a total of 93 sentences, of which 68 had spatial indicators. Two of the authors (CAB, CAS) reviewed the sections to identify and record the relevant textual spatial indicators; there were a total of 114 in the test set.

All 93 sentences were processed using the syntactic parser and semantic interpreter as described above. Because the UWDA anatomy vocabulary was added to the 1998 Metathesaurus, which has not yet been incorporated into the current MetaMap version, the system uses the 1997 Metathesaurus; hence full semantic interpretation was not possible to test here, and a separate study addressed treatment of arguments [9]. Thus, although the system was allowed to fully process each sentence to the extent possible, the actual assessment was made at the point at which the rules triggered the relationships rather than on the final interpreted conceptual representation.

The actual dataset analyzed consisted of the list of potential relationships suggested by semantic interpretation of the rules triggered in each sentence. These were compared directly with the manually marked text, using the latter as the standard, and scores were recorded for each relationship. A correct response was scored whenever the system automatically generated the correct relationship for the marked text segment. A "miss" was scored if the system did not list the correct relationship for a marked text segment. A "false drop" occurred when the system returned a relationship that was not identified for the marked text segment. Recall and precision scores were calculated for each relationship separately, for Experiment 1 and Experiment 2, and overall.


RESULTS AND DISCUSSION

Table 2 provides the distribution of relationship types observed across all 7 Coronary Arteriography sections, along with performance measures for semantic interpretation. Although Recall and Precision scores are provided for all relationship types, only a few were observed frequently enough to permit drawing any but the most preliminary conclusions, and even these must be viewed with caution. Still, overall performance seemed quite acceptable compared to current standards of performance for natural language processing, and results for some individual relationship types were rather encouraging.

As would be expected, the most common relationships observed in the coronary arteriography sections were branching relationships, accounting for two-thirds of those observed in Experiment 1 and half in Experiment 2; direct and inverse forms of branching relationships were about equally represented. Almost all of these were correctly identified by the semantic interpretation rules, with a combined recall of 0.90. Precision scores were even better for HAS-BRANCH (0.97), with only a single false positive. In contrast, incorrect BRANCH-OF relationships were triggered each time the system encountered a named branch, contributing to a relatively low precision of 0.53. Next most common in Experiment 1 were meronymic (PART-OF) relationships (18%), which also had the highest rate of missed relationships, yielding a 0.53 precision. While it is not clear just what caused this, the syntactic cues may simply be more subtle than the semantic interpretation rules can currently accommodate. It is also worth noting that most of these missed relationships referred to locations of abnormalities; again, there may be subtle distinctions in the syntactic treatment of locative relationships between normal anatomic and pathologic entities than among anatomic entities themselves.

Largely because of the prevalence of branching relationships, physical relationships accounted for the bulk (75%) of the relationships overall. Combined, the PATH relationships were the most common spatial relationship class, and were next most common overall after BRANCH- OF, lending support to the importance of these relationships and the distinctions among them. With the inclusion of rules for PATH relationships, performance degraded somewhat in Experiment 2, probably reflecting the preliminary nature of the rule development for these forms.

While the processing in this study stopped short of full implementation of semantic interpretation, the results suggest promise for this approach. Further, when the results from this study are considered along with those described in [9], it seems reasonable to expect comparable performance for a system combining both components, that is, when the complex arguments may be reliably identified and coupled with the appropriate relationships, opening the door for true semantic interpretation of medical text.


References

[1] Humphreys BL, Lindberg DAB, Schoolman HM, and Barnett GO. The Unified Medical language System: An informatics research collaboration. Journal of the American Medical Informatics Association 1998:5(1):1-13.

[2] McCray AT, Srinivasan S and Browne AC. Lexical methods for managing variation in biomedical terminologies. In Ozbolt JG (ed.) Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care, 1994:335-239.

[3] McCray AT, Aronson AR, Browne AC, Rindflesch TC, Razi A and Srinivasan S. UMLS knowledge for biomedical language processing. Bulletin of the Medical Library Association 1993:81:184-194.

[4] Aronson AR, Rindflesch TC, and Browne AC. Exploiting a large thesaurus for information retrieval. Proceedings of RIAO 94, 1994, 197-216.

[5] Rindflesch TC and Aronson AR. Semantic processing in information retrieval. In Safran C (ed.) Proceedings of the 17th Annual SCAMC, 1993:611-615.

[6] Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw KP, Brinkley JF. Motivation and organizational principles for anatomical knowledge representation: The Digital Anatomist Symbolic Knowledge Base. Journal of the American Medical Informatics Association 1998:5(1):17-40.

[7] National Library of Medicine. 1998. Unified Medical Language System Documentation

[8] Bean CA. Formative evaluation of a frame-based model of locative relationships in human anatomy. IN Masys DR (ed). Proceedings of the 1997 AMIA Annual Fall Symposium (Formerly SCAMC), JAMIA Symposium Supplement 1997. Nashville, TN: October 26-29, 1997, pp. 625-9.

[9] Sneiderman CA, Rindflesch TC, Bean CA. Identification of anatomical terminology in medical text. Paper submitted to the 1998 AMIA Annual Fall Symposium.


TABLE 1. Distribution of semantic locative indicator rules across relationship types in the 1998 Semantic Net

Semantic Net Groups RELATIONSHIPS
RULES (N) INVERSE RULES (N)
physically_related_to part_of 13 has_part 13
consists_of 6 constitutes 4
contains 4 contained_in 6
connected_to 10 ----- -----
interconnects 7 interconnected_by 7
branch_of 6 has_branch 12
tributary_of 11 has_tributary 1
spatially_related_to location_of 9 has_location 14
adjacent_to 12 ----- -----
surrounds 18 surrounded_by 6
traverses 36 traversed_by 10
SUBTOTAL (Expt.1) 205
Proposed PATH Relationships traverses 41 ----- -----
origin_of 1 has_origin 17
terminus_of 41 has_terminus 1
projects 8 ----- -----
SUBTOTAL (Expt.2) 109
TOTAL 314

TABLE 2. Performance of automated identification of text indicators of spatial relationships: Frequency and Recall and Precision

98 UMLS SNRs
Errors

CCR 1-7
Text
Correct
ID
Miss &
False Pos
Recall Precision
part_of 17 9 8 R=0.53 P=1.00
contains 1 1 2 R=1.00 P=0.33
interconnects 2 0 2 R=0.00 P=0.00
branch_of 30 28 27 R=0.93 P=0.53
(has_branch) 32 28 5 R=0.87 P=0.97
tributary_of 3 3 0 R=1.00 P=1.00
location_of 6 6 0 R=1.00 P=1.00
surrounds 2 2 0 R=1.00 P=1.00
traverses 2 0 2 R=0.00 P=0.00
SUBTOTAL EXPT.1 95 77 46 R=0.81 P=0.73
Proposed PATH Relationships
traverses 7 3 6 R=0.43 P=0.60
origin_of 5 4 4 R=0.80 P=0.57
(has_origin) 5 5 4 R=1.00 P=0.56
(has_terminus) 1 0 3 R=0.00 P=0.00
projects 1 0 4 R=0.00 P=0.00
SUBTOTAL EXPT.2 19 12 21 R=0.63 P=0.46
TOTAL 114 89 67 R=0.78 P=0.67