Scoring and comparison

MAT uses a common algorithm for comparing two documents to each other, and for producing common scoring statistics based on that comparison. This algorithm underlies both the comparison view and MATScore, and will soon support a general approach to reconciling conflicting annotations.

MAT's approach to scoring and comparison is extremely flexible and powerful. It does not cover all possible cases of annotations you might wish to compare, but it covers a large number of them.

When you need to understand scoring and comparison, and when you don't
The comparison algorithm

Similarity profiles
The Kuhn-Munkres algorithm
Segmenting Kuhn-Munkres
Stratifying Kuhn-Munkres

Defining similarity profiles

Strata
Tag profiles
Dimensions
Default profiles
Multiple profiles

Scoring profiles and the scorer

Aggregations
Decompositions
Label limitations
Multiple profiles

Acknowledgements

When you need to understand scoring and comparison, and when you don't

MAT's scoring and comparison algorithm is meant to do something reasonably intelligent in the following cases:

when your task consists exclusively of spanned annotations
when your task consists of spanned annotations, together with spanless annotations whose annotation-valued attributes have label restrictions which consist exclusively of spanned annotations

So if your task looks like this:

<annotation_set_descriptors>
  <annotation_set_descriptor>
    <annotation label="PERSON"/>
    <annotation label="LOCATION"/>
    <annotation label="ORGANIZATION"/>
  </annotation_set_descriptor>
</annotation_set_descriptors>

or this:

<annotation_set_descriptors>
  <annotation_set_descriptor>
    <annotation label="PERSON"/>
    <annotation label="LOCATION"/>
    <annotation label="ORGANIZATION"/>
    <annotation label="RELATION" span="no"/>
    <attribute of_annotation="RELATION" type="annotation" name="arg">
      <label_restriction label="LOCATION"/>
    </attribute>
  </annotation_set_descriptor>
</annotation_set_descriptors>

you don't really have to know anything more about this algorithm in order to compare and score, but if your task looks like this (note that "RELATION" is spanned rather than spanless):

<annotation_set_descriptors>
  <annotation_set_descriptor>
    <annotation label="PERSON"/>
    <annotation label="LOCATION"/>
    <annotation label="ORGANIZATION"/>
    <annotation label="RELATION"/>
    <attribute of_annotation="RELATION" type="annotation" name="arg">
      <label_restriction label="LOCATION"/>
    </attribute>
  </annotation_set_descriptor>
</annotation_set_descriptors>

you probably do. And, of course, if you're not interested in the defaults that MAT provides in this case, or if you want to have multiple similarity or score configurations, you'll have to learn about the algorithm in order to customize it.

The comparison algorithm

The MAT comparison algorithm uses a set of configurable similarity profiles to create similarity metrics for the Kuhn-Munkres bipartite set algorithm, and executes the algorithm in relevant subsets, and in a stratified manner that ensures that annotations are compared in an appropriate order. Here's how it works.

Similarity profiles

The MAT comparison algorithm begins with a set of similarity profiles. These are declarative descriptions, for particular labels, how these labels should be compared: what dimensions of the annotation to use, how the dimensions should be compared, how the similarity of each dimension contributes to the similarity of the whole.

The <similarity_profile> element is an immediate child of the toplevel <task> element. Here's an example:

<similarity_profile>
  <tag_profile true_labels="PERSON,ORGANIZATION">
    <dimension name="_label" weight="2"/>
    <dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
    <dimension name="nomtype" weight="1"/>
  </tag_profile>
</similarity_profile>

So consider a PERSON annotation over characters 10 - 20 with nomtype = PRO in document 1, vs. an ORGANIZATION annotation over characters 11 - 20 with nomtype = NAM in document 2. The labels don't match (so the _label dimension contribution will be 0 of a possible 2); the nomtype values don't match (so the nomtype dimension contribution will be 0 of a possible 1), and the spans overlap in 9 out of a possible 10 characters (so the _span contribution will be 8 out of a possible 8, since the overlap is .9 and .8 is the lower bound for what counts as a full match). So the similarity of these two annotations, given this profile, would be (0 * 2) + (8 * 8) + (0 * 1)/(2 + 8 + 1), or about .73.

As you can see, you can specify special dimensions, like _span or _label, or attributes of the annotations listed in the true_labels. We describe the defaults, and the currently available methods and dimensions, here.

The Kuhn-Munkres algorithm

Once we establish the similarities, we invoke the Kuhn-Munkres bipartite set algorithm (otherwise known as the Hungarian algorithm). Kuhn-Munkres gives you the best match among two sets, given similarity measures among the set elements. The similarity profiles provide these measures. Once we run Kuhn-Munkres, we have a pairing between the annotations in the two comparison sets, along with a similarity measure for each best pair.

Segmenting Kuhn-Munkres

When we run Kuhn-Munkres, we don't need to compare all the annotations in one document with all the annotations in the other. For instance, for spanned annotations, we stipulate that if two annotations don't overlap at all, they can't form a pair. So we can subdivide the spanned annotations into candidate overlap sets, and run Kuhn-Munkres over many small sets instead of one big one. In addition, if a region of text has annotations in one document but none in the other, none of those annotations can be paired, and we don't need to run Kuhn-Munkres at all for that subset of annotations.

If we're trying to compare spanless annotations, we have two choices. We first attempt to compute an "implied span", which reaches from the start index of the earliest annotation that this annotation "points to" (i.e., has as a value of an annotation-valued attribute), to the end index of the latest annotation that this annotation "points to". This computation is performed recursively. We subdivide all spanless annotations for which we can compute an implied span in the same way we subdivide spanned annotations; for all other annotations, we subdivide by label.

Stratifying Kuhn-Munkres

The presence of annotations which have annotation-valued attributes (e.g., relations and events) poses a problem for the general case. Such annotations imply that what we should be doing is finding a best graph alignment, rather than a best set alignment. We've chosen not to deal with this most complex case. Instead, we've chosen to require that the similarity algorithm be stratified: all annotations which are "pointed to" must be paired before the annotations that "point to" them are paired. The output of the earlier stratum is used as input to the later stratum; by default, annotation-valued attributes cannot match if their values were not paired on a previous stratum.

This stratification is enforced on a label-by-label basis, not an annotation-by-annotation basis. The analysis is static: if an annotation label is listed in the label restriction of an annotation-valued attribute in your task, it has to be paired before the annotation that bears the annotation-valued attribute. E.g. in:

<annotation label="PERSON">
<annotation label="EVENT">
<attribute of_annotation="EVENT" type="annotation" name="arg1">
  <label_restriction label="PERSON"/>
</attribute>

the PERSON annotations must be guaranteed to be paired before the EVENT annotations.

This guarantee can be imposed in one of two ways. Each stratum is internally processed so that spanned annotations are paired first, and by default, all annotations are in a single stratum. So if your task contains some spanned and some spanless annotations, and you say nothing about the stratification, the spanned annotations will be paired before the spanless annotations. If that default isn't enough (e.g., in the example above), you can define strata explicitly:

<similarity_profile>
  <stratum true_labels="PERSON"/>
  <stratum true_labels="EVENT"/>
  ...
</similarity_profile>

The order of the <stratum> elements determines the stratum order.

The stratification requirement allows us to continue to use a set alignment algorithm, rather than a graph alignment algorithm, but also leads to a couple of restrictions. First, the overall annotation pairing isn't a true global best pair, because we can't use the annotations that "point to" a given annotation as a candidate comparison dimension. Second, annotation descriptions which have cycles can't be compared:

<annotation label="PERSON">
<annotation label="EVENT">
<attribute of_annotation="EVENT" type="annotation" name="arg1">
  <label_restriction label="PERSON"/>
  <label_restriction label="EVENT"/>
</attribute>

There's nothing incoherent about such an annotation schema; it just can't be compared or scored using the algorithm in MAT.

Defining similarity profiles

Each similarity profile consists of zero or more <stratum> declarations, and zero or more <tag_profile> declarations. The labels referenced in the similarity profiles are true, not effective, labels.

Strata

The <stratum> declarations are present to enforce the stratification requirements of the comparison algorithm. If your profile contains no <stratum> declarations, the comparison algorithm will assume that all the labels are in the same stratum, and will try to process the spanned labels first, and then the spanless labels. If this processing order ends up violating the stratification requirements, the comparison algorithm will raise an error. If you declare at least one <stratum>, and your strata do not mention all your content labels, the comparison algorithm may also raise an error. So if you have to declare strata, be sure to mention all your content labels. You define a stratum like this:

<similarity_profile>
  <stratum true_labels="label1,label2,..."/>
  ...
</similarity_profile>

Tag profiles

Each <tag_profile> element defines the way a set of true labels are compared to each other. All the labels listed must have the dimensions specified; if the dimensions are annotation attributes, the attribute must have the same type and aggregation in each annotation. Here's our example again:

<similarity_profile>
  <tag_profile true_labels="PERSON,ORGANIZATION">
    <dimension name="_label" weight="2"/>
    <dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
    <dimension name="nomtype" weight="1"/>
  </tag_profile>
</similarity_profile>

Here, both PERSON and ORGANIZATION must be spanned annotations (since the _span dimension is present) and both must have a "nomtype" attribute, and the type and aggregation of that attribute must be the same in each (e.g., it can't be a string for PERSON and a set of integers for ORGANIZATION).

If the annotations have attributes you want to compare, but the attributes have different names, you can use the <attr_equivalences> element to address that:

<similarity_profile>
  <tag_profile true_labels="PERSON,ORGANIZATION">
    <attr_equivalences name="nomtype" equivalences="p_nomtype,o_nomtype"/>
    <dimension name="_label" weight="2"/>
    <dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
    <dimension name="nomtype" weight="1"/>
  </tag_profile>
</similarity_profile>

So in this example, if PERSON has the p_nomtype attribute, and ORGANIZATION has the o_nomtype attribute, you can treat them as the same attribute for the purposes of comparison. Note that if you use an attribute equivalence, the dimensions must refer to the name of the equivalence.

If the algorithm has to compare two annotations which are not in the same <tag_profile>, it will use only the non-attribute equivalences for comparison, and use the weight of the attribute dimensions as "dead weight" (i.e., as if the comparison score for those dimensions is 0). It will compare annotation A to annotation B using annotation A's profile, and then compare them using annotation B's profile, and take the smaller of the two. In other words, the comparison scores of annotations which are not in the same <tag_profile> is strongly penalized.

Dimensions

Each dimension must have a name and a weight. You can also specify a method; every dimension has a default method, and a default means of applying the specified method to aggregate types (lists and sets). At the moment, only set aggregates have an implemented default comparison method (we haven't had much use for the list types yet). If you don't specify a method, the default method will be used. Some methods have optional parameters which can also be specified in the <dimension> element, as show in the example above for _span. Each method returns a value between 0 and 1, and this value is multiplied by the weight to determine the total contribution to the similarity that this dimension makes.

It's possible to define your own dimension methods, but describing this process is beyond the scope of this documentation.

The available dimensions are:

"_label"

This method compares the two annotation labels. If the annotation has an effective label, that label will be used; if the annotation label is mentioned in an equivalence class when the scorer is invoked (see the --equivalence_class option of MATScore), the equivalence class name will be used. The default _label method is "label_equality". This method accepts an optional "true_residue" option, which is a float between 0 and 1; if there are effective label values that don't match, but the underlying true labels do, this value will be returned as the similarity. Examples:

<dimension name="_label" weight="2"/>
<dimension name="_label" method="label_equality" weight="2"/>
<dimension name="_label" weight="2" true_residue=".5"/>

"_span"

This method compares the two spans. It can only be used with spanned annotations; it cannot be used to compare the "implied span" used when segmenting the Kuhn-Munkres algorithm. The default _span method is "overlap". This method returns the fraction of the combined span of the two annotations in which they overlap; so if one annotation covers characters 10 - 15, and another annotation covers characters 12 - 20, the combined span is 10 - 20, and the overlap is 12 - 15, so the method returns .3. This method accepts two optional options: "overlap_match_lower_bound", which specifies an overlap value above which 1.0 will be returned, and "overlap_mismatch_upper_bound", which specifies an overlap value below which 0 will be returned. Examples:

<dimension name="_span" method="overlap" weight="4"/>
<dimension name="_span" weight="8" overlap_match_lower_bound=".8"/>

attributes and attribute equivalences

Every annotation attribute is a dimension. If a dimension name isn't one of the special cases listed here, it's interpreted as an attribute (or as an attribute equivalence, if such an equivalence is declared). If one of your attributes has the same name as one of the special cases documented here, you won't be able to refer to it in your similarity description.

For attributes of type other than "annotation", the default comparison method is called "equality", and simply returns 1 if the two values are equal, 0 otherwise. If the attribute is a set aggregation, the value is the ratio of the size of the intersection of the set values over the size of the union of the set values.

For attributes of type "annotation", the default comparison method is called "similarity", and it returns 0 if the annotation values were not paired in a previous stratum, and the computed similarity value otherwise. If the attribute is a set aggregation, the Kuhn-Munkres algorithm is invoked to compare the sets, using the "similarity" comparison method to provide the similarity weights, and the overall similarity is the sum of the similarities found by the Kuhn-Munkres algorithm, divided by the maximum possible similarity.

Example:

<dimension name="nomtype" weight="1"/>

multiple attributes

If a dimension name contains a comma, it's treated as a set of attributes to be considered together. Such a dimension must have a method specified (and there will be no aggregate handler for that method). At the moment, the only known method for this case is "_annotation_set_similarity". This method is intended to be used in the case where there are two or more annotation-valued attributes which are interchangeable, as in the case of a symmetric relation like "related-to". The implementation of this method is essentially identical to the method for comparing annotation-valued attributes which are set aggregations. Example:

<dimension name="arg1,arg2" weight="2" method="_annotation_set_similarity"/>

"_nonannotation_attribute_remainder"

This special method is used primarily in the default profiles specified below. It looks at all the attributes that

either of the annotations bear, and
are not of type "annotation", and
are not otherwise declared in a single-attribute or multiple-attribute dimension

and compares them. If there are no such attributes, this dimension is ignored; i.e., its weight will be treated as zero. Otherwise, the similarity is the sum of the similarities of these otherwise-undeclared attributes divided by the number of attributes.

"_annotation_attribute_remainder"

This special method is used primarily in the default profiles specified below. It looks at all the attributes that

either of the annotations bear, and
are of type "annotation", and
are not otherwise declared in a single-attribute or multiple-attribute dimension

and collapses all the values for these attributes into a single set for each annotation (including the members of any set aggregations), and compares those sets using Kuhn-Munkres. The idea here is that when comparing relations in a default manner, the actual names of the arguments aren't nearly as important as they are in the nonannotation case, and should be ignored.

As in the _nonannotation_attribute_remainder case above, if there are no attributes which meet the requirements, this dimension's weight will be treated as zero.

Default profiles

If an annotation label is not mentioned in a <tag_profile> element, and it's a spanned label, it will be assigned the profile

<dimension name="_label" weight=".1" true_residue=".5"/>
<dimension name="_span" weight=".9"/>
<dimension name="_nonannotation_attribute_remainder" weight=".1"/>
<dimension name="_annotation_attribute_remainder" weight=".1"/>

If the label is not spanned, it will be assigned the profile

<dimension name="_label" weight=".2" true_residue=".5"/>
<dimension name="_nonannotation_attribute_remainder" weight=".2"/>
<dimension name="_annotation_attribute_remainder" weight=".6"/>

on the assumption that spanless annotations are typically used as relations.

Multiple profiles

You can define as many similarity profiles as you like, naming them via the name attribute:

<similarity_profile name="...">
  ...
</similarity_profile>

If you leave one of the profiles unnamed, it will be the default profile for the task.

Scoring profiles and the scorer

The similarities among the annotations are used to feed the MAT scorer. At the moment, annotation pairs which have similarity scores of 1.0 are treated as matching; pairs with scores less than 1.0 are treated as clashing; and annotations which are not paired are treated as spurious or missing, as appropriate. There is currently no facility in the MAT scorer for treating a similarity score less than 1.0 as a partial match.

By default, the scorer reports numbers for each true or effective label, and for all labels aggregated together, for each document and for all documents aggregated together. You can modify this behavior using the <score_profile> element, and its child elements.

The <score_profile> element is an immediate child of the toplevel <task> element. Here's an example:

<score_profile>
  <aggregation name="loc elements" true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY"/>
  <attr_decomposition attrs="nomtype" true_labels="PERSON"/>
  <label_limitation true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY,PERSON"/>
</score_profile>

Aggregations

The <aggregation> element can be used to specify a label under which the scores for a list of true labels will be aggregated. This aggregation is in addition to the defaults. This level is specified by a list of true labels, and is "in between" the individual labels and the aggregation of all labels.

Decompositions

In addition to aggregations, you can specify decompositions of various labels. You can specify a decomposition by attribute (as shown here), or by use of a method (using the <partition_decomposition> element, not exemplified here). The scorer will report a separate score for each set of attribute values for the attributes specified, for the specified labels. So in the example we've been using on this page, if the nomtype attribute has three values NAM, NOM and PRO, the scorer will give you a separate subscore for PERSON annotations where nomtype=NAM, etc.

Label limitations

The scorer gives you the option of ignoring labels, using the --ignore parameter. But this parameter doesn't simply cause the score output to ignore the annotations; it removes them from consideration entirely, as if they aren't present at all. This can lead to problems if, e.g., you're scoring relations, and you only want to see the relation scores, but you can't ignore the entity annotations, since they're a crucial ingredient to the relation comparison. You can use the <label_limitation> element to list just those labels you want to see in the score output.

Multiple profiles

Like <similarity_profile>, you can define as many score profiles as you like, naming them via the name attribute:

<score_profile name="...">
  ...
</score_profile>

If you leave one of the profiles unnamed, it will be the default profile for the task.

Acknowledgements

The inspiration for configurable comparison is due to the scorer used for the ACE 2005 evaluation. The inspiration for using Kuhn-Munkres as a basis for a general-purpose comparison algorithm is due to

Xiaoqiang Luo, "On Coreference Resolution Performance Metrics", Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 25–32, Vancouver, October 2005.