Gapped partial match--Gapped partial matches involve a gap either in the noun phrase or
Meta string or both. For the mapping of ambulatory monitoring to AMBULATORY CAR
DIAC MONITORING, the gap CARDIAC occurs in the Meta string. For the mapping of
obstructive sleep apnea to Obstructive Apnea, the gap sleep occurs in the noun phrase. And for
the mapping of continuous pump-driven hemofiltration to Continuous Arteriovenous Hemofil
tration, gaps occur in both. Gapped partial matches often provide better results than normal
partial matches because of their greater matching involvement. However, when the gap occurs
in the Meta string, the string tends to be too specific.
Overmatch--An overmatch occurs when a match does not involve words at either or both ends
of the Meta string. An overmatch is similar to a normal partial match except that part of the
Meta string is uninvolved in the mapping. For example, the Meta string Postoperative Compli
cations is an overmatch for ocular complications. The phrase application has many over
matches including Job Application, Heat/Cold Application and Medical Informatics
Application. Overmatches almost always give poor results unless browsing is the object of the
mapping.
2. The Basic Mapping Strategy
The experience gained from the manual mapping exercise described above led naturally to
the following strategy for accomplishing the mapping automatically. Perform the following steps
for each textual utterance:
- Parse the text into noun phrases and perform the remaining steps for each phrase;
- Generate the variants for the noun phrase where a variant essentially consists of one or more
noun phrase words together with all of its spelling variants, abbreviations, acronyms, syn
onyms, inflectional and derivational variants, and meaningful combinations of these;
- Form the candidate set of all Meta strings containing one of the variants;
- For each candidate, compute the mapping from the noun phrase and calculate the strength of
the mapping using an evaluation function. Order the candidates by mapping strength; and
- Combine candidates involved with disjoint parts of the noun phrase, recompute the match
strength based on the combined candidates, and select those having the highest score to form a
set of best Meta mappings for the original noun phrase.
Descriptions of steps 2-5 of the mapping strategy are given in the next four sections.
The Meta mapping algorithm begins by computing a set of variant generators for each
noun phrase discovered by the parser. A variant generator is any meaningful subsequence of
words in the phrase where a subsequence is meaningful if it is either a single word or occurs in the
SPECIALIST lexicon. For example, the variant generators for the noun phrase of liquid crystal
thermography are liquid crystal thermography, liquid crystal, liquid, crystal and thermography
(prepositions, determiners, conjunctions, auxiliaries, modals, pronouns and punctuation are
ignored). Note the multi-word generators. A simpler example which will be used throughout the
sequel is based on the noun phrase ocular complications. Its generators are simply ocular and
complications.
The approach taken in computing variants is a canonicalization approach. This simply
means that a variant represents not only itself but all of its inflectional and spelling variants. Col
lapsing inflectional and spelling variants results in significant computational savings. Variants are
computed for each of the variant generators according to the scheme pictured in Figure 1.
The
computation for each generator proceeds as follows:
- Compute all acronyms, abbreviations and synonyms of the generator. This results in the three
sets Generator, Acronyms/Abbreviations, and Synonyms which are highlighted with boxes in
Figure 1;
- Augment the elements of the three sets by computing their derivational variants and the syn
onyms of the derivational variants;
- For each member of the Acronyms/Abbreviations set, compute synonyms; and
- For each member of the Synonyms set, compute acronyms/abbreviations.
The issue of whether to recursively generate variants of a given type is handled as follows:
- Acronyms and abbreviations are not recursively generated since doing so almost always pro
duces incorrect results. For example, the abbreviation na of sodium has expansions nurse's
aide and nuclear antigen which are unrelated to sodium; and
- Derivational variants and synonyms are recursively generated since this often produces mean
ingful variants.
The variants computed for the generator ocular are shown in Figure 2.
Following each
variant is its variant distance score, a rough measure of how much it varies from its generator (see
Section 5) and the history of how it was computed. For example,
- oculus (with variant distance 3 and history "d") is simply a derivational variant of the generator
ocular;
- optical (with variant distance 7 and history "ssd") is a derivational variant of a synonym (optic)
of a synonym (eye) of ocular; and
- vision (with variant distance 9 and history "ssds") is a synonym of the derivational variant opti
cal described above.
The variant generation algorithm described here is knowledge intensive. It uses the fol
lowing knowledge sources:
- the SPECIALIST lexicon and a table of canonical forms derived from it;
- a SPECIALIST knowledge base of acronyms and abbreviations;
- a SPECIALIST knowledge base containing rules of derivational morphology; and
- two knowledge bases of synonyms: one obtained by extracting synonyms from Dorland's Illus
trated Medical Dictionary, and a supplemental synonym knowledge base developed for use
with SPECIALIST.
4. Meta Candidates
The Meta candidates for a noun phrase consist of the set of all Meta strings containing at
least one of the variants computed for the phrase. The candidates are easily found by using a ver
sion of the Meta word index, an index from words to all Meta strings containing them. The Meta
candidates for the noun phrase ocular complications are shown in Figure 3.
When a string is not,
itself, the preferred name for a Meta concept, the preferred name appears in parentheses following
the string. The candidates are ordered according to the evaluation function described in the next
section. The best candidates are Complications and complications <1> both of which are simple
matches involving the head of the phrase. The remaining candidates are variants of ocular and
are listed in order of similarity to ocular.
The evaluation function computes a measure of the quality of the match between a phrase
and a Meta candidate. For normal MetaMap operation the evaluation function is based on four
components: centrality, variation, coverage, and cohesiveness. A normalized value between 0
(the weakest match) and 1 (the strongest match) is computed for each of these components. A
weighted average is computed in which the coverage and cohesiveness components receive twice
the weight as the centrality and variation components. The result is normalized to a value
between 0 and 1000, 0 indicating no match at all and 1000 indicating an identical match (except
for capitalization). When MetaMap is used for browsing (e.g., for term processing), the coverage
and cohesiveness components are both replaced by a single component, involvement. Each of the
evaluation function components is discussed below.
- Centrality: The centrality value is simply 1 if the string involves the head of the phrase and 0
otherwise. For the noun phrase ocular complications, Complications has centrality value 1;
and Eye has value 0.
- Variation: The variation value estimates how much the variants in the Meta string differ from
the corresponding words in the phrase. It is computed by first determining the variation dis
tance for each variant in the Meta string. This distance is the sum of the distance values for
each step taken during variant generation.
Table 2. Variant Distances
--------------------------------------
| Variant Type | Distance |
| | Value |
======================================
| spelling | 0 |
--------------------------------------
| inflectional | 1 |
--------------------------------------
| synonym or | 2 |
| acronym/abbreviation | |
--------------------------------------
| derivational | 3 |
--------------------------------------
The values for each step are listed in Table 2. The
variation distance determines the variation value for the given variant according to the formula
V=4/(D+4). As the total distance value, D, increases from its minimum value of 0, V decreases
from a maximum value of 1 and is bounded below by 0. The final variation value for the candi
date is the average of the values for each of the variants. For ocular complications, Eye has a
variant distance value of 2 and hence a variation value of 2/3 (4/(2+4)). Complications has a
variant distance value of 0 and hence a variation value of 1.
- Coverage: The coverage value indicates how much of the Meta string and the phrase are
involved in the match. In order to compute the value, the number of words participating in the
match is computed for both the Meta string and the phrase. These numbers are called the Meta
span and phrase span, respectively. Note, however, that gaps are ignored. The coverage value
for the Meta string is the Meta span divided by the length of the string. Similarly, the coverage
value for the phrase is the phrase span divided by the length of the phrase. The final coverage
value is the weighted average of the values for the Meta string and the phrase where the Meta
string is given twice the weight as the phrase. For ocular complications and either Eye or Com
plications, the Meta span and phrase span are both 1, and the coverage value is 5/6 (2/3*(1/1) +
1/3*(1/2)).
- Cohesiveness: The cohesiveness value is similar to the coverage value but emphasizes the
importance of connected components. A connected component is a maximal sequence of con
tiguous words participating in the match. The connected components for both the Meta string
and the phrase are computed. This information is abstracted by noting the size of each compo
nent. This produces a set of connected component sizes for both the Meta string and the phrase.
The cohesiveness value for the Meta string is the sum of the squares of the connected Meta
string component sizes divided by the square of the length of the string. A similar cohesiveness
value is computed for the phrase. The final cohesiveness value is the weighted average of the
Meta string and phrase values where the Meta string is again given twice the weight as the
phrase. For ocular complications and either Eye or Complications, the connected component
sizes for both the Meta string and the phrase are {1} since one word from either the phrase or
Meta string participates in the match. The cohesiveness value is 3/4 (2/3*(1/1) + 1/3(1/4)).
The final evaluation for Eye is the weighted average (0 + 2/3 + 2*(5/6) + 2*(3/4))/6 which
normalizes to 638. Similarly, the final evaluation for Complications is (1 + 1 + 2*(5/6) + 2*(3/4))/
6 which normalizes to 861.
- Involvement: The involvement value is a rough approximation of the coverage and cohesive
ness values. The strict word order implied by the matchmap is no longer followed. The
involvement value for the phrase is the proportion of phrase words which can map to a Meta
word whether or not they do according to the matchmap. For example, given the phrase
Advanced cancer of the lung with words [advanced, cancer, lung] and the Meta string "Lung
Cancer" with words [lung, cancer], the matchmap maps lung to lung, but does not map cancer
because of word order. The phrase involvement value here is 2/3 as opposed to the coverage
value of 1/3. Similarly, the involvement value for the Meta string is the proportion of words
which can be mapped to from the phrase. For the current example, the Meta involvement value
is 2/2 or 1 rather than 1/2 for coverage. Thus the final involvement value for this example is the
weighted average (2/3 + 1)/2 or 0.83.
The final step in the mapping algorithm is straightforward. It consists of examining com
binations of Meta candidates which participate in matches with disjoint parts of the noun phrase.
The evaluation function is applied to the combined candidates, and the best ones form the final
mapping result. The best mappings for ocular complications are shown in Figure 4.
The central
ity, variation, coverage and cohesiveness values for the mapping are 1, 2/3, 1 and 1, respectively.
The final evaluation of the mapping is the weighted average (1 + 2/3 + 2*1 + 2*1)/6 which nor
malizes to 861 and is reported as a confidence value in the figure.
MetaMap behavior is controlled by several option flags each of which has a short version
(e.g., -p) and a long version (e.g., --plain_syntax). With the exception of the --threshold
option, each option is a toggle switch. Specifying a default option toggles it off; specifying a non-
default option toggles it on. The options are described in the following sections.
7.1 The default options
MetaMap's default behavior is defined by the options: -t (--tag_text), -l (--
stop_large_n), -b (--best_mappings_only), -p (--plain_syntax), -c (candidates), -s (-
-semantic_types), and -m (--mappings). Each of these options is defined below.
7.2 Processing options
Processing options control MetaMap's internal behavior.
- -t (--tag_text) specifies that the SPECIALIST parser will use the results of the Xerox Parc
Part-of-Speech Tagger to assist in parsing. If a preprocessed tag file is specified on the com
mand line, tagging results are read from it; otherwise, tagging is done dynamically using the
server version of the tagger.
-z (--term_processing) invokes MetaMap's browsing mode and causes it to use the involve
ment metric rather than coverage and cohesiveness for evaluating Meta candidates. It is nor
mally used in conjunction with the --allow_overmatches and --allow_concept_gaps
options.
-o (--allow_overmatches) causes MetaMap to retrieve Meta candidates which are over
matches. This greatly increases the number of candidates retrieved and is consequently much
slower than MetaMap without overmatches. The --allow_overmatches option is appropriate
for browsing purposes.
-g (--allow_concept_gaps) causes MetaMap to retrieve Meta candidates with gaps (such as
Unspecified childhood psychosis for unspecified psychosis). The --allow_concept_gaps
option is appropriate for browsing purposes.
-a (--no_acros_abbrs) prevents the generation of acronym/abbreviation variants.
-u (--unique_acros_abbrs_only) restricts the generation of acronym/abbreviation variants to
those acronyms and abbreviations with unique expansions.
- -l (--stop_large_n) prevents retrieval of Meta candidates based on either a two-character
word occurring in more than 1,000 Meta strings or a one-character word occurring in more
than 500 Meta strings.
7.3 Output options
- -b (--best_mappings_only) restricts mappings displayed to only the top scoring ones.
-r (--threshold <integer>) restricts output to candidates with evaluation score of the thresh
old or better.
-j (--mesh_projection) is a special option for restricting MetaMap to the MeSH vocabulary.
-q (--machine_output) causes output to take the form of Prolog clauses rather than human-
readable form. The --machine_output option affects all other output options.
- -p (--plain_syntax) and
-x (--syntax) control the output form of the results of the SPE
CIALIST minimal commitment parser. --plain_syntax simply outputs text; --syntax out
puts a Prolog-like structure showing details of the syntactic processing.
-v (--variants) and -f (--full_variants) display summary (--variants) and detailed
(full_variants) information regarding variant generation.
- -c (--candidates) causes the list of Meta candidates to be displayed.
-n (--number_the_candidates) simply numbers the displayed candidates.
- -s (--semantic_types) causes the semantic types of Meta concepts to be displayed.
- -m (--mappings) causes mappings to be displayed.
7.4 Miscellaneous options
-h (--help) displays MetaMap usage.
-i (--info) causes system information to be displayed.
-w (--warnings) enables the display of conditions which are noteworthy if not erroneous.