MetaMap: Mapping Text to the UMLS® Metathesaurus®


MetaMap: Mapping Text to the UMLS® Metathesaurus®


Alan R. Aronson


March 6, 1996



The task of automatically determining the concepts referred to in text is a common one. It occurs in SPECIALISTTM as a prerequisite to improving the retrieval of relevant MEDLINE® citations based on queries formulated in English. In this case the text consists not only of user queries but also the titles and abstracts of MEDLINE citations; and the concepts to be found in the text are those of the UMLS Metathesaurus. The task becomes one of mapping text to the Meta thesaurus (referred to subsequently as Meta). This report describes the development of MetaMap, a program for automatically mapping biomedical text to Meta. Section 1 contains the results of manually examining the mapping problem for a small collection of utterances. It provides the basis for defining a strategy for automatically mapping to Meta as outlined in Section 2. The auto matic approach is characterized by determining how to map from the noun phrases discovered by the SPECIALIST minimal commitment parser to appropriate concepts in Meta. The results of such a mapping can be used to normalize the text so that each referenced concept is represented uniquely. Details of the MetaMap implementation are given in Section 3 through Section 7. It should be noted that the MetaMap algorithms are not specific to the biomedical domain and can be generalized to any domain with adequate knowledge sources.


1. A Preliminary Examination of the Mapping Problem


In order to determine the scope of the problem of mapping text to Meta, a set of 99 utter ances (16 queries and 83 citation titles) was taken from the NLM Test Collection. The SPECIAL IST minimal commitment parser was applied to the utterances producing 301 noun phrases. Each of the phrases was manually mapped to the 1992 version of Meta and classified into one of four categories based on how well it maps to Meta. Membership in a category is determined by lexi cal properties of the mapping as defined below; and in each case inflectional and spelling varia tion are ignored:

Normal partial match--The simplest type of partial match occurs when a Meta string maps part of the noun phrase without gaps in what it does map. For example, liquid crystal thermog raphy maps to Thermography where the mapping does not involve liquid crystal. Similarly, cochlear implant subjects maps to Cochlear Implant where subjects is not involved. Normal partial matches provide good results for the part of the noun phrase involved.

Gapped partial match--Gapped partial matches involve a gap either in the noun phrase or Meta string or both. For the mapping of ambulatory monitoring to AMBULATORY CAR DIAC MONITORING, the gap CARDIAC occurs in the Meta string. For the mapping of obstructive sleep apnea to Obstructive Apnea, the gap sleep occurs in the noun phrase. And for the mapping of continuous pump-driven hemofiltration to Continuous Arteriovenous Hemofil tration, gaps occur in both. Gapped partial matches often provide better results than normal partial matches because of their greater matching involvement. However, when the gap occurs in the Meta string, the string tends to be too specific.

Overmatch--An overmatch occurs when a match does not involve words at either or both ends of the Meta string. An overmatch is similar to a normal partial match except that part of the Meta string is uninvolved in the mapping. For example, the Meta string Postoperative Compli cations is an overmatch for ocular complications. The phrase application has many over matches including Job Application, Heat/Cold Application and Medical Informatics Application. Overmatches almost always give poor results unless browsing is the object of the mapping.


The categories above are listed in order of the strength of the mapping, a simple match being the strongest. It should be emphasized, however, that the semantic or conceptual quality of the map ping varies widely. Even simple matches can map text to unrelated Meta terms. For example, the noun phrase the numeric values maps to the Meta concept Values with semantic type Qualitative Concept. In Meta, Values is a term from psychology referring to social values; it is not a quantita tive concept. Furthermore, even when Meta contains the correct concept, that concept may be ambiguous. For example, Meta contains two ventilation concepts, one related to air flow in build ings and the other related to respiration.


Some examples from the manual study illustrating the types of match just discussed are given below. In each case a phrase and one or more Meta terms are listed together with the type of match. Note that each phrase generally maps to more Meta terms than shown.

The results of manually mapping the 301 noun phrases to Meta are summarized in Table 1. [Note that 70 percent of the 113 partial matches involved the head of the noun phrase.]

Table 1. Summary of Noun Phrase Mappings to Meta

-----------------------------------------
| Lexical Mapping   | Count  | Percent  |
| Category          |        |          |
=========================================
| Simple match      | 91     | 30%      |
-----------------------------------------
| Complex match     | 24     | 8%       |
-----------------------------------------
| Partial match     | 113    | 38%      |
-----------------------------------------
| No match          | 73     | 24%      |
-----------------------------------------
| Total             | 301    | 100%     |
-----------------------------------------


2. The Basic Mapping Strategy


The experience gained from the manual mapping exercise described above led naturally to the following strategy for accomplishing the mapping automatically. Perform the following steps for each textual utterance:

  1. Parse the text into noun phrases and perform the remaining steps for each phrase;
  2. Generate the variants for the noun phrase where a variant essentially consists of one or more noun phrase words together with all of its spelling variants, abbreviations, acronyms, syn onyms, inflectional and derivational variants, and meaningful combinations of these;
  3. Form the candidate set of all Meta strings containing one of the variants;
  4. For each candidate, compute the mapping from the noun phrase and calculate the strength of the mapping using an evaluation function. Order the candidates by mapping strength; and
  5. Combine candidates involved with disjoint parts of the noun phrase, recompute the match strength based on the combined candidates, and select those having the highest score to form a set of best Meta mappings for the original noun phrase.

Descriptions of steps 2-5 of the mapping strategy are given in the next four sections.


3.
Noun Phrase Variants


The Meta mapping algorithm begins by computing a set of variant generators for each noun phrase discovered by the parser. A variant generator is any meaningful subsequence of words in the phrase where a subsequence is meaningful if it is either a single word or occurs in the SPECIALIST lexicon. For example, the variant generators for the noun phrase of liquid crystal thermography are liquid crystal thermography, liquid crystal, liquid, crystal and thermography (prepositions, determiners, conjunctions, auxiliaries, modals, pronouns and punctuation are ignored). Note the multi-word generators. A simpler example which will be used throughout the sequel is based on the noun phrase ocular complications. Its generators are simply ocular and complications.


The approach taken in computing variants is a canonicalization approach. This simply means that a variant represents not only itself but all of its inflectional and spelling variants. Col lapsing inflectional and spelling variants results in significant computational savings. Variants are computed for each of the variant generators according to the scheme pictured in Figure 1.

(Figure)

The computation for each generator proceeds as follows:

  1. Compute all acronyms, abbreviations and synonyms of the generator. This results in the three sets Generator, Acronyms/Abbreviations, and Synonyms which are highlighted with boxes in Figure 1;
  2. Augment the elements of the three sets by computing their derivational variants and the syn onyms of the derivational variants;
  3. For each member of the Acronyms/Abbreviations set, compute synonyms; and
  4. For each member of the Synonyms set, compute acronyms/abbreviations.

The issue of whether to recursively generate variants of a given type is handled as follows:


The variants computed for the generator ocular are shown in Figure 2.

(Figure)

Following each variant is its variant distance score, a rough measure of how much it varies from its generator (see Section 5) and the history of how it was computed. For example,


The variant generation algorithm described here is knowledge intensive. It uses the fol lowing knowledge sources:


4. Meta Candidates


The Meta candidates for a noun phrase consist of the set of all Meta strings containing at least one of the variants computed for the phrase. The candidates are easily found by using a ver sion of the Meta word index, an index from words to all Meta strings containing them. The Meta candidates for the noun phrase ocular complications are shown in Figure 3.

(Figure)

When a string is not, itself, the preferred name for a Meta concept, the preferred name appears in parentheses following the string. The candidates are ordered according to the evaluation function described in the next section. The best candidates are Complications and complications <1> both of which are simple matches involving the head of the phrase. The remaining candidates are variants of ocular and are listed in order of similarity to ocular.


5. The Evaluation function


The evaluation function computes a measure of the quality of the match between a phrase and a Meta candidate. For normal MetaMap operation the evaluation function is based on four components: centrality, variation, coverage, and cohesiveness. A normalized value between 0 (the weakest match) and 1 (the strongest match) is computed for each of these components. A weighted average is computed in which the coverage and cohesiveness components receive twice the weight as the centrality and variation components. The result is normalized to a value between 0 and 1000, 0 indicating no match at all and 1000 indicating an identical match (except for capitalization). When MetaMap is used for browsing (e.g., for term processing), the coverage and cohesiveness components are both replaced by a single component, involvement. Each of the evaluation function components is discussed below.


The final evaluation for Eye is the weighted average (0 + 2/3 + 2*(5/6) + 2*(3/4))/6 which normalizes to 638. Similarly, the final evaluation for Complications is (1 + 1 + 2*(5/6) + 2*(3/4))/ 6 which normalizes to 861.


6. The Final Mapping


The final step in the mapping algorithm is straightforward. It consists of examining com binations of Meta candidates which participate in matches with disjoint parts of the noun phrase. The evaluation function is applied to the combined candidates, and the best ones form the final mapping result. The best mappings for ocular complications are shown in Figure 4.

(Figure)

The central ity, variation, coverage and cohesiveness values for the mapping are 1, 2/3, 1 and 1, respectively. The final evaluation of the mapping is the weighted average (1 + 2/3 + 2*1 + 2*1)/6 which nor malizes to 861 and is reported as a confidence value in the figure.


7. MetaMap Control Options


MetaMap behavior is controlled by several option flags each of which has a short version (e.g., -p) and a long version (e.g., --plain_syntax). With the exception of the --threshold option, each option is a toggle switch. Specifying a default option toggles it off; specifying a non- default option toggles it on. The options are described in the following sections.

7.1 The default options


MetaMap's default behavior is defined by the options: -t (--tag_text), -l (-- stop_large_n), -b (--best_mappings_only), -p (--plain_syntax), -c (candidates), -s (- -semantic_types), and -m (--mappings). Each of these options is defined below.

7.2 Processing options


Processing options control MetaMap's internal behavior.

7.3 Output options

7.4 Miscellaneous options