Score output

This document describes the output of the MATScore tool. There are several spreadsheets which can be produced: tag-level scores, token-level scores, character-level scores, "pseudo-token"-level scores, and details. By default, only the tag-level scores are produced.

Annotation pairing and label matching

The scorer uses a sophisticated pairing algorithm to determine which annotation pairs should generate the scores.

Throughout the scorer, we use the notion of effective label, which we describe here and here. Whenever possible, the scorer will use effective labels to display its scores.

All score tables

The four score tables have the following columns:

similarity profile	The similarity profile used to generate the similarity scores for the annotations.
score profile	The score profile used to group the output scores.
file	The file basename of the document being scored.
test docs	The number of test (hypothesis) documents. This value will be the same for all rows.
tag	The true or effective label which is being scored in this row. The final row will be a cumulative score, with label "<all>".
tag subset	Optional. This column lists the particular subset of the tag instances to be scored, if such a decomposition is described in the score profile.
test toks	The number of tokens in the test documents. This value will be the same for all rows.
match	The number of elements for this true or effective label whose pairs have a perfect similarity score.
refclash	The number of elements which bear this true or effective label in the reference document which are paired with annotations in the corresponding hypothesis document but do not have a perfect similarity score. The scorer does not yet make it possible to make this value the sum of the similarity scores, rather than the count of elements.
missing	The number of elements which bear this true or effective label in the reference document but are not paired with any element in the corresponding hypothesis document.
refonly	refclash + missing
reftotal	refonly + match
hypclash	The number of elements which bear this true or effective label in the hypothesis document which are paired with annotations in the corresponding reference document but do not have a perfect similarity score. The scorer does not yet make it possible to make this value the sum of the similarity scores, rather than the count of elements.
spurious	The number of elements which bear this true or effective label in the hypothesis document but are not paired with any element in the corresponding reference document.
hyponly	hypclash + spurious
hyptotal	hyponly + match
precision	match / hyptotal
recall	match / reftotal
fmeasure	2 * ((precision * recall) / (precision + recall))

For tag-level scores, the elements counted in the match, refclash, missing, hypclash and spurious columns are annotations; for the other scores, the elements counted are the basic elements for the table (tokens, pseudo-tokens, or characters).

Confidence data

When the user requests confidence data in MATScore via the --compute_confidence_data option, the scorer adds three columns (mean, variance and standard deviation) to the spreadsheet for each of the computed metrics (precision, recall, f-measure). These columns appear immediately to the right of the column for the metric.

Optional error detail columns in the tag spreadsheet

When the pairing algorithm produces a similarity score for a pair of annotations, and that pair is not perfect (between 0 and 1.0), the pairing algorithm optionally records a "slug" for the cause of the mismatch. If you specify the --tag_output_mismatch_details flat in MATScore, the accumulated causes will be tabulated and reported in the tag spreadsheet. These cause slugs vary by similarity dimension: overmark and undermark for spans, tagclash for labels, setclash (set mismatch) for set-valued attributes, etc. A pairing may contribute multiple causes; so if a pair exhibits both a span and label mismatch, both will be recorded. So the sum of the causes will not correspond to the counts in the refclash or hypclash columns.

If these causes are displayed, a column for each cause will appear immediately after the refclash and hypclash columns in the tag spreadsheet. For each <slug>, the reference column will be named "ref<slug> (detail)" and the hypothesis column will be named "hyp<slug> (detail)".

Fixed-span scores (token, character, pseudo-token)

The token, character and pseudo-token tables each use a different basic element for their element counts. Because these elements are fixed across the reference and hypothesis documents, there are no span clashes in these score tables. The "test toks" column will be labeled "test chars" and "test pseudo-toks" in the last two spreadsheets.

The fixed-span score tables have some additions to the core column set. The additional columns are:

tag_sensitive_accuracy	(test toks - refclash - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were tagged correctly, including those which were not tagged at all)
tag_sensitive_error_rate	1 - tag_sensitive_accuracy
tag_blind_accuracy	(test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag)
tag_blind_error_rate	1 - tag_blind_accuracy

If user requests confidence data, it will be reported for all four of these additional columns.

Pseudo-token scores

The token-level score elements are generated by whatever tokenizer was used to tokenize the scored documents. If no tokenizer was used, you won't be able to produce token-level scores. Character scores, on the other hand, are always available, because the characters themselves serve as the basic elements.

We've included character-level scoring to provide sub-tag-level granularity in situations where tokenization hasn't been performed or isn't available for some reason (although nothing stops you from using these methods alongside token-level scores). In addition, in the interest of producing something more "token-like" in the absence of actual tokenization, we've designed a notion of "pseudo-token". To compute the pseudo-tokens for a document, we collect the set of start and end indices for the content annotations in both the reference and hypothesis documents, order the indices, and count the whitespace-delimited tokens in each span, including the edge spans of the document. This count will be, at minimum, the number of whitespace-delimited tokens in the document as a whole, but may be greater, if annotation boundaries don't abut whitespace.

For example, consider this deeply artificial example:

ref: the future <NP>President of the United State</NP>s
hyp: the<NP> future President of the Unit</NP>ed States

The pseudo-tokens in this document are computed as follows:

First, find all the annotations. In the reference document, there's an NP annotation at 11 - 40; in the hypothesis, an NP annotation at 3 - 32.
Next, order all the indices. The sequence here is [3, 11, 32, 40].
Now, tokenize each interval. There's 1 token from 0 - 3, 1 token from 3 - 11, 4 tokens from 11 - 32 ("President of the Unit"), 2 tokens from 32 - 40 ("ed State"), and 1 token at the end ("s").
Add them up. The total number of pseudo-tokens is 9: 1 spurious, 4 match, and 2 missing, and 2 not involved in any annotation.

The granularity of pseudo-tokens is hopefully more informative than character granularity for those languages which are substantially whitespace-delimited, without having to make any complex, and perhaps irrelevant, decisions about tokenization. Using both the whitespace boundaries and the annotation boundaries as region delimiters allows us to deal with the minimum level of granularity that the pair of documents in question requires to account for all the annotation contrasts. We recognize that this is a novel approach, but we hope it will be useful.

Note: unlike token and character scores, the number of pseudo-tokens is a function of the overlaps between the reference and hypothesis. Therefore, the actual number of pseudo-tokens in the document will vary slightly depending on the performance and properties of your tagger. Do not be alarmed by this.

Details

The detail spreadsheet is intended to provide a span-by-span assessment of the scoring inputs.

file	the name of the hypothesis from which the entry is drawn
type	one of missing, spurious, match (the meaning of these values should be clear from the preceding discussion), or one of the error causes described above
refid	the ID of the annotation in the reference document, if the ID exists (used for cross-referencing with annotation attribute values)
hypid	the ID of the annotation in the hypothesis document, if the ID exists (used for cross-referencing with annotation attribute values)
refdescription	the description of the annotation in the reference document
hypdescription	the description of the annotation in the hypothesis document
reflabel	the label on the annotation in the reference document
refstart	the start index, in characters, of the annotation in the reference document, if spanned
refend	the end index, in characters, of the annotation in the reference document, if spanned
hyplabel	the label on the annotation in the hypothesis document
hypstart	the start index, in characters, of the annotation in the hypothesis document, if spanned
hypend	the end index, in characters, of the annotation in the hypothesis document, if spanned
refcontent	the text between the start and end indices of the annotation in the reference document, if spanned
hypcontent	the text between the start and end indices of the annotation in the hypothesis document, if spanned

In addition to these columns, if the task contains an explicitly-defined similarity profile (i.e., not the default) which specifies dimensions other than the label and span, the mismatch type associated with each dimension will be listed, one per column, immediately after the "hypend" column.