Scoring Engine

Description

The scoring engine compares two tagged files, or two directories of tagged files. Typically, one input is the hypothesis (an automatically tagged file) and the other is the reference (a gold-standard tagged file). But this tool can be used to compare any two inputs.

Note: In version 1.3, no document which the scoring engine examines may have overlapping content annotations. If a reference or hypothesis document has overlapping content annotations, the results of the scoring engine are undefined (and the engine may actually fail). This shortcoming will be addressed in version 2.0.

There are several spreadsheets which can be produced: tag-level scores, token-level scores, character-level scores, "pseudo-token"-level scores, and details. By default, only the tag-level scores are produced.

All score tables

The four score tables have the following columns:

 tag The label which is being scored in this row. The final row will be a cumulative score, with label "". If the --task option is specified (see below), the task.xml file may specify an "alias" for a tag plus some attribute subset (e.g., for named entity, an ENAMEX tag with attribute "type" = "PERSON", with an alias of "PERSON"). test docs The number of test (hypothesis) documents. This value will be the same for all rows. test toks The number of tokens in the test documents. This value will be the same for all rows. match The number of elements for this tag which occur with the same label and same span extent in the hypothesis document and its corresponding reference document. refclash The number of elements which bear this tag in the reference document which overlap with a tag in the corresponding hypothesis document, but does not match the tag, or the span extent, or both. Note that a count in one of these columns will be mirrored by a corresponding count from the point of view of the hypothesis document. Because only the tag-level scores permit span mismatches, this score reduces to tag clash in all other score tables. missing The number of elements which bear this tag in the reference document but do not overlap with any tagged span in the corresponding hypothesis document. refonly refclash + missing reftotal refonly + match hypclash The number of elements which bear this tag in the hypothesis document which overlap with a tag in the corresponding reference document, but does not match the tag, or the span extent, or both. Note that a count in one of these columns will be mirrored by a corresponding count from the point of view of the reference document. Because only the tag-level scores permit span mismatches, this score reduces to tag clash in all other score tables. spurious The number of elements which bear this tag in the hypothesis document but do not overlap with any tagged span in the corresponding reference document. hyponly hypclash + spurious hyptotal hyponly + match precision match / hyptotal recall match / reftotal fmeasure 2 * ((precision * recall) / (precision + recall))

For tag-level scores, the elements counted in the match, refclash, missing, hypclash and spurious columns are span annotations; for the other scores, the elements counted are the basic elements for the table (tokens, pseudo-tokens, or characters).

Confidence data

The user can also request confidence data via the --compute_confidence_data option. To compute confidence data, the scorer produces 1000 alternative score sets. Each score set is created by making M random selections of file scores from the core set of M file scores. (This procedure will naturally have multiple copies of some documents and no copies of others in each score set, which is the source of the variation for this computation.) The scorer then computes the overall metrics for each alternative score set, and computes the mean and variance over the 1000 instances of each of the precision, recall, and fmeasure metrics. This "sampling with replacement" yields a more stable mean and variance. This procedure adds three columns (mean, variance and standard deviation) to the spreadsheet for each of the metrics; these columns appear immediately to the right of the column for the metric.

Optional tag-level score columns

The tag-level scores, unlike the other three scores, admin span variation. You may optionally obtain a detailed breakdown of the tag level errors using the --tag_span_details option. This option inserts the following columns before the refclash and hypclash columns:

 reftagclash The number of span annotations which bear this tag in the reference document whose span matches an annotation in the hypothesis which has a different tag. Annotation pairs which trigger this column also trigger the corresponding hyptagclash column for the corresponding hypothesis tag. refovermark The number of span annotations in the reference whose span completely covers (but does not match) an annotation in the hypothesis which has the same tag. Annotation pairs which trigger this column also trigger the corresponding hypundermark column for the same tag. refundermark The number of span annotations in the reference whose span is completely covered by (but does not match) an annotation in the hypothesis which has the same tag. Annotation pairs which trigger this column also trigger the corresponding hypovermark column for the same tag. refoverlap The number of span annotations in the reference whose span overlaps an annotation in the hypothesis which has the same tag, but does not qualify for refovermark or refundermark. Annotation pairs which trigger this column also trigger the corresponding hypoverlap column for the same tag. reftagplusovermark The number of span annotations in the reference whose span completely covers (but does not match) an annotation in the hypothesis which has a different tag. Annotation pairs which trigger this column also trigger the corresponding hyptagplusundermark column for the corresponding hypothesis tag. reftagplusundermark The number of span annotations in the reference whose span is completely covered by (but does not match) an annotation in the hypothesis which has a different tag. Annotation pairs which trigger this column also trigger the corresponding hyptagplusovermark column for the corresponding hypothesis tag. reftagplusoverlap The number of span annotations in the reference whose span overlaps an annotation in the hypothesis which has a different tag, but does not qualify for reftagplusovermark or reftagplusundermark. Annotation pairs which trigger this column also trigger the corresponding hyptagplusoverlap column for the corresponding hypothesis tag.

refclash is the sum of these seven columns.

 hyptagclash The number of span annotations which bear this tag in the hypothesis document whose span matches an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagclash column for the corresponding reference tag. hypovermark The number of span annotations in the hypothesis whose span completely covers (but does not match) an annotation in the reference which has the same tag. Annotation pairs which trigger this column also trigger the corresponding refundermark column for the same tag. hypundermark The number of span annotations in the hypothesis whose span is completely covered by (but does not match) an annotation in the reference which has the same tag. Annotation pairs which trigger this column also trigger the corresponding refovermark column for the same tag. hypoverlap The number of span annotations in the hypothesis whose span overlaps an annotation in the reference which has the same tag, but does not qualify for hypovermark or hypundermark. Annotation pairs which trigger this column also trigger the corresponding refoverlap column for the same tag. hyptagplusovermark The number of span annotations in the hypothesis whose span completely covers (but does not match) an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagplusundermark column for the corresponding reference tag. hyptagplusundermark The number of span annotations in the hypothesis whose span is completely covered by (but does not match) an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagplusovermark column for the corresponding reference tag. hyptagplusoverlap he number of span annotations in the hypothesis whose span overlaps an annotation in the reference which has a different tag, but does not qualify for hyptagplusovermark or hyptagplusundermark. Annotation pairs which trigger this column also trigger the corresponding reftagplusoverlap column for the corresponding reference tag.

hypclash is the sum of these seven columns.

Fixed-span scores (token, character, pseudo-token)

The token, character and pseudo-token tables each use a different basic element for their element counts. Because these elements are fixed across the reference and hypothesis documents, there are no span clashes in these score tables. The "test toks" column will be labeled "test chars" and "test pseudo-toks" in the last two spreadsheets.

The fixed-span score tables have some additions to the core column set. The additional columns are:

 tag_sensitive_accuracy (test toks - refclash - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were tagged correctly, including those which were not tagged at all) tag_sensitive_error_rate 1 - tag_sensitive_accuracy tag_blind_accuracy (test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag) tag_blind_error_rate 1 - tag_blind_accuracy

If user requests confidence data, it will be reported for all four of these additional columns.

Pseudo-token scores

The token-level score elements are generate by whatever tokenizer was used to tokenize the scored documents. If no tokenizer was used, you won't be able to produce token-level scores. Character scores, on the other hand, are always available, because the characters themselves serve as the basic elements.

We've included character-level scoring to provide sub-tag-level granularity in situations where tokenization hasn't been performed or isn't available for some reason (although nothing stops you from using these methods alongside token-level scores). In addition, in the interest of producing something more "token-like" in the absence of actual tokenization, we've designed a notion of "pseudo-token". To compute the pseudo-tokens for a document, collect the set of start and end indices for the content annotations in both the reference and hypothesis documents, order the indices, and count the whitespace-delimited tokens in each span, including the edge spans of the document. This count will be, at minimum, the number of whitespace-delimited tokens in the document as a whole, but may be greater, if annotation boundaries don't abut whitespace.

For example, consider this deeply artificial example:

ref: the future <NP>President of the United State</NP>shyp: the<NP> future President of the Unit</NP>ed States

The pseudo-tokens in this document are computed as follows:

• First, find all the annotations. In the reference document, there's an NP annotation at 11 - 40; in the hypothesis, an NP annotation at 3 - 32.
• Next, order all the indices. The sequence here is [3, 11, 32, 40].
• Now, tokenize each interval. There's 1 token from 0 - 3, 1 token from 3 - 11, 4 tokens from 11 - 32 ("President of the Unit"), 2 tokens from 32 - 40 ("ed State"), and 1 token at the end ("s").
• Add them up. The total number of pseudo-tokens is 9: 1 spurious, 4 match, and 2 missing, and 2 not involved in any annotation.

The granularity of pseudo-tokens is hopefully more informative than character granularity for those languages which are substantially whitespace-delimited, without having to make any complex, and perhaps irrelevant, decisions about tokenization. Using both the whitespace boundaries and the annotation boundaries as region delimiters allows us to deal with the minimum level of granularity that the pair of documents in question requires to account for all the annotation contrasts. We recognize that this is a novel approach, but we hope it will be useful.

Note: unlike token and character scores, the number of pseudo-tokens is a function of the overlaps between the reference and hypothesis. Therefore, the actual number of pseudo-tokens in the document will vary slightly depending on the performance and properties of your tagger. Do not be alarmed by this.

Details

The detail spreadsheet is intended to provide a span-by-span assessment of the scoring inputs.

 file the name of the hypothesis from which the entry is drawn type one of missing, spurious, match (the meaning of these values should be clear from the preceding discussion), tagclash, overlap, overmark, undermark, tagplusovermark, tagplusundermark (from the point of view of the hypothesis document; i.e., overmark corresponds to hypovermark above, etc.) reflabel the label on the span in the reference document refstart the start index, in characters, of the span in the reference document refend the end index, in characters, of the span in the reference document hyplabel the label on the span in the hypothesis document hypstart the start index, in characters, of the span in the hypothesis document hypend the end index, in characters, of the span in the hypothesis document refcontent the text between the start and end indices in the reference document hypcontent the text between the start and end indices in the hypothesis document

Usage

Example 2

Let's say that instead of printing a table to standard output, you want to produce CSV output with embedded formulas, and you want all three spreadsheets.

Unix:% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref \--csv_output_dir$PWD --details --by_tokenWindows native:> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref \--csv_output_dir %CD% --details --by_token

This invocation will not produce any table on standard output, but will leave three files in the current directory: bytag.csv, bytoken.csv, and details.csv.

Example 3

Let's say you have two directories full of files. /path/to/hyp contains files of the form file<n>.txt.json, and /path/to/ref contains files of the form file<n>.json. You want to compare the corresponding files to each other, and you want tag and token scoring, but not details, and you intend to view the spreadsheet in OpenOffice.

Unix:% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \--csv_output_dir$PWD --oo_separator --by_tokenWindows native:> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \--csv_output_dir %CD% --oo_separator --by_token

For each file in /path/to/hyp, this invocation will prepare a candidate filename to look for in /path/to/ref by removing the .txt.json suffix and adding the .json suffix. The current directory will contain bytag.csv and bytoken.csv.

Example 4

Let's say that you're in the same situations as example 3, but you want confidence information included in the output spreadsheets:

Unix:% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \--csv_output_dir$PWD --oo_separator --by_token --compute_confidence_dataWindows native:> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \--csv_output_dir %CD% --oo_separator --by_token --compute_confidence_data

Example 5

Let's say that you're in the same situation as example 3, but your documents contain lots of tags, but you're only interested in scoring the tags listed in the "Named Entity" task. Furthermore, you're going to import the data into a tool other than Excel, so you want the values calculated for you rather than having embedded equations:

Unix:% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \--csv_output_dir$PWD --no_csv_formulas --by_token --task "Named Entity"Windows native:> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \--csv_output_dir %CD% --no_csv_formulas --by_token --task "Named Entity"

Example 6

Let's say you're in the same situation as example 3, but your reference documents are XML inline documents, and are of the form file<n>.xml. Do this:

Unix:% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \--ref_fsuff_off '.txt.json' --ref_fsuff_on '.xml' \--csv_output_dir$PWD --oo_separator --by_token --ref_file_type xml-inlineWindows native:> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \--ref_fsuff_off ".txt.json" --ref_fsuff_on ".xml" \--csv_output_dir %CD% --oo_separator --by_token --ref_file_type xml-inline

Note that --ref_fsuff_on has changed, in addition to adding the --ref_file_type option.