The scoring engine compares two tagged files, or two directories
of tagged files. Typically, one input is the hypothesis (an
automatically tagged file) and the other is the reference (a
gold-standard tagged file). But this tool can be used to compare
any two inputs.
Note: In version 1.3, no
document which the scoring engine examines may have overlapping
content annotations. If a reference or hypothesis document has
overlapping content annotations, the results of the scoring engine
are undefined (and the engine may actually fail). This shortcoming
will be addressed in version 2.0.
There are several spreadsheets which can be produced: tag-level
scores, token-level scores, character-level scores,
"pseudo-token"-level scores, and details. By default, only the
tag-level scores are produced.
The four score tables have the following columns:
tag |
The label which is being
scored in this row. The final row will be a cumulative
score, with label "<all>". If the --task option is
specified (see below), the task.xml file may specify an
"alias" for a tag plus some attribute subset (e.g., for
named entity, an ENAMEX tag with attribute "type" =
"PERSON", with an alias of "PERSON"). |
test docs |
The number of test
(hypothesis) documents. This value will be the same for all
rows. |
test toks |
The number of tokens in the
test documents. This value will be the same for all rows. |
match |
The number of elements for
this tag which occur with the same label and same span
extent in the hypothesis document and its corresponding
reference document. |
refclash |
The number of elements which bear this tag in the reference document which overlap with a tag in the corresponding hypothesis document, but does not match the tag, or the span extent, or both. Note that a count in one of these columns will be mirrored by a corresponding count from the point of view of the hypothesis document. Because only the tag-level scores permit span mismatches, this score reduces to tag clash in all other score tables. |
missing |
The number of elements which
bear this tag in the reference document but do not overlap
with any tagged span in the corresponding hypothesis
document. |
refonly |
refclash + missing |
reftotal |
refonly + match |
hypclash |
The number of elements which bear this tag in the hypothesis document which overlap with a tag in the corresponding reference document, but does not match the tag, or the span extent, or both. Note that a count in one of these columns will be mirrored by a corresponding count from the point of view of the reference document. Because only the tag-level scores permit span mismatches, this score reduces to tag clash in all other score tables. |
spurious |
The number of elements which bear this tag in the hypothesis document but do not overlap with any tagged span in the corresponding reference document. |
hyponly |
hypclash + spurious |
hyptotal |
hyponly + match |
precision |
match / hyptotal |
recall |
match / reftotal |
fmeasure |
2 * ((precision * recall) /
(precision + recall)) |
For tag-level scores, the elements counted in the match,
refclash, missing, hypclash and spurious columns are span
annotations; for the other scores, the elements counted are the
basic elements for the table (tokens, pseudo-tokens, or
characters).
The user can also request confidence data via the --compute_confidence_data option. To compute confidence data, the scorer produces 1000 alternative score sets. Each score set is created by making M random selections of file scores from the core set of M file scores. (This procedure will naturally have multiple copies of some documents and no copies of others in each score set, which is the source of the variation for this computation.) The scorer then computes the overall metrics for each alternative score set, and computes the mean and variance over the 1000 instances of each of the precision, recall, and fmeasure metrics. This "sampling with replacement" yields a more stable mean and variance. This procedure adds three columns (mean, variance and standard deviation) to the spreadsheet for each of the metrics; these columns appear immediately to the right of the column for the metric.
The tag-level scores, unlike the other three scores, admin span
variation. You may optionally obtain a detailed breakdown of the
tag level errors using the --tag_span_details option. This option
inserts the following columns before the refclash and hypclash
columns:
reftagclash |
The number of span
annotations which bear this tag in the reference document
whose span matches an annotation in the hypothesis which has
a different tag. Annotation pairs which trigger this column
also trigger the corresponding hyptagclash column for the
corresponding hypothesis tag. |
refovermark |
The number of span
annotations in the reference whose span completely covers
(but does not match) an annotation in the hypothesis which
has the same tag. Annotation pairs which trigger this column
also trigger the corresponding hypundermark column for the
same tag. |
refundermark |
The number of span annotations in the reference whose span is completely covered by (but does not match) an annotation in the hypothesis which has the same tag. Annotation pairs which trigger this column also trigger the corresponding hypovermark column for the same tag. |
refoverlap |
The number of span
annotations in the reference whose span overlaps an
annotation in the hypothesis which has the same tag, but
does not qualify for refovermark or refundermark. Annotation
pairs which trigger this column also trigger the
corresponding hypoverlap column for the same tag. |
reftagplusovermark |
The number of span annotations in the reference whose span completely covers (but does not match) an annotation in the hypothesis which has a different tag. Annotation pairs which trigger this column also trigger the corresponding hyptagplusundermark column for the corresponding hypothesis tag. |
reftagplusundermark |
The number of span annotations in the reference whose span is completely covered by (but does not match) an annotation in the hypothesis which has a different tag. Annotation pairs which trigger this column also trigger the corresponding hyptagplusovermark column for the corresponding hypothesis tag. |
reftagplusoverlap |
The number of span annotations in the reference whose span overlaps an annotation in the hypothesis which has a different tag, but does not qualify for reftagplusovermark or reftagplusundermark. Annotation pairs which trigger this column also trigger the corresponding hyptagplusoverlap column for the corresponding hypothesis tag. |
refclash is the sum of these seven columns.
hyptagclash |
The number of span annotations which bear this tag in the hypothesis document whose span matches an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagclash column for the corresponding reference tag. |
hypovermark |
The number of span annotations in the hypothesis whose span completely covers (but does not match) an annotation in the reference which has the same tag. Annotation pairs which trigger this column also trigger the corresponding refundermark column for the same tag. |
hypundermark |
The number of span annotations in the hypothesis whose span is completely covered by (but does not match) an annotation in the reference which has the same tag. Annotation pairs which trigger this column also trigger the corresponding refovermark column for the same tag. |
hypoverlap |
The number of span annotations in the hypothesis whose span overlaps an annotation in the reference which has the same tag, but does not qualify for hypovermark or hypundermark. Annotation pairs which trigger this column also trigger the corresponding refoverlap column for the same tag. |
hyptagplusovermark |
The number of span annotations in the hypothesis whose span completely covers (but does not match) an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagplusundermark column for the corresponding reference tag. |
hyptagplusundermark |
The number of span annotations in the hypothesis whose span is completely covered by (but does not match) an annotation in the reference which has a different tag. Annotation pairs which trigger this column also trigger the corresponding reftagplusovermark column for the corresponding reference tag. |
hyptagplusoverlap |
he number of span annotations in the hypothesis whose span overlaps an annotation in the reference which has a different tag, but does not qualify for hyptagplusovermark or hyptagplusundermark. Annotation pairs which trigger this column also trigger the corresponding reftagplusoverlap column for the corresponding reference tag. |
hypclash is the sum of these seven columns.
The token, character and pseudo-token tables each use a different basic element for their element counts. Because these elements are fixed across the reference and hypothesis documents, there are no span clashes in these score tables. The "test toks" column will be labeled "test chars" and "test pseudo-toks" in the last two spreadsheets.
The fixed-span score tables have some additions to the core
column set. The additional columns are:
tag_sensitive_accuracy |
(test toks - refclash -
missing - spurious)/test toks (essentially, the fraction of
tokens in the reference which were tagged correctly,
including those which were not tagged at all) |
tag_sensitive_error_rate |
1 - tag_sensitive_accuracy |
tag_blind_accuracy |
(test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag) |
tag_blind_error_rate |
1 - tag_blind_accuracy |
If user requests confidence data, it will be reported for all
four of these additional columns.
The token-level score elements are generate by whatever tokenizer
was used to tokenize the scored documents. If no tokenizer was
used, you won't be able to produce token-level scores. Character
scores, on the other hand, are always available, because the
characters themselves serve as the basic elements.
We've included character-level scoring to provide sub-tag-level
granularity in situations where tokenization hasn't been performed
or isn't available for some reason (although nothing stops you
from using these methods alongside token-level scores). In
addition, in the interest of producing something more "token-like"
in the absence of actual tokenization, we've designed a notion of
"pseudo-token". To compute the pseudo-tokens for a document,
collect the set of start and end indices for the content
annotations in both the reference and hypothesis documents, order
the indices, and count the whitespace-delimited tokens in each
span, including the edge spans of the document. This count will
be, at minimum, the number of whitespace-delimited tokens in the
document as a whole, but may be greater, if annotation boundaries
don't abut whitespace.
For example, consider this deeply artificial example:
ref: the future <NP>President of the United State</NP>s
hyp: the<NP> future President of the Unit</NP>ed States
The pseudo-tokens in this document are computed as follows:
The granularity of pseudo-tokens is hopefully more informative
than character granularity for those languages which are
substantially whitespace-delimited, without having to make any
complex, and perhaps irrelevant, decisions about tokenization.
Using both the whitespace boundaries and the annotation boundaries
as region delimiters allows us to deal with the minimum level of
granularity that the pair of documents in question requires to
account for all the annotation contrasts. We recognize that this
is a novel approach, but we hope it will be useful.
Note: unlike token and
character scores, the number of pseudo-tokens is a function of the
overlaps between the reference and hypothesis. Therefore, the
actual number of pseudo-tokens in the document will vary slightly
depending on the performance and properties of your tagger. Do not be alarmed by this.
The detail spreadsheet is intended to provide a span-by-span
assessment of the scoring inputs.
file |
the name of the hypothesis
from which the entry is drawn |
type |
one of missing, spurious,
match (the meaning of these values should be clear from the
preceding discussion), tagclash, overlap, overmark,
undermark, tagplusovermark, tagplusundermark (from the point
of view of the hypothesis document; i.e., overmark
corresponds to hypovermark above, etc.) |
reflabel |
the label on the span in the
reference document |
refstart |
the start index, in
characters, of the span in the reference document |
refend |
the end index, in characters,
of the span in the reference document |
hyplabel |
the label on the span in the hypothesis document |
hypstart |
the start index, in characters, of the span in the hypothesis document |
hypend |
the end index, in characters, of the span in the hypothesis document |
refcontent |
the text between the start
and end indices in the reference document |
hypcontent |
the text between the start
and end indices in the hypothesis document |
Unix:
% $MAT_PKG_HOME/bin/MATScore
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd
Usage: MATScore [options]
--task <task> |
Optional. If specified, the
scorer will use the tags (or tag+attributes) specified in
the named task. |
--content_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer will try to use the metadata in the
document to determine which annotations are content
annotations and which are token annotations. If this
metadata is absent (e.g., if the 'metadata' slot in a
mat-json document is unpopulated), the scorer requires
additional, external information. Use this flag to provide a
commma-separated sequence of annotation labels which should
be treated as content annotations. Ignored if --task is
present. |
--token_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer will try to use the metadata in the
document to determine which annotations are content
annotations and which are token annotations. If this
metadata is absent (e.g., if the 'metadata' slot in a
mat-json document is unpopulated), the scorer requires
additional, external information. Use this flag to provide a
commma-separated sequence of annotation labels which should
be treated as token annotations. Ignored if --task is
present. |
--equivalence_class
equivlabel oldlabel,oldlabel,... |
Optional and repeatable. In
some cases, you may wish to collapse two or more labels into
a single equivalence class when you run the scorer. The
first argument to this parameter is the label for the
equivalence class; the second argument is a comma-separated
sequence of existing annotation labels. Note: when you're
specifying the existing labels, and you want to refer to an
attribute set, use the value of the 'name' attribute of the
attr_set from the task.xml file. |
--ignore label,label,... |
Optional. In some cases, you
may wish to ignore some labels entirely. The value of this
parameter is a comma-separated sequence of annotation
labels. If an annotation in the reference or hypothesis
bears this label, it will be as if the annotation is simply
not present. Note: when you're specifying the annotation
labels, and you want to refer to an attribute set, use the
value of the 'name' attribute of the attr_set from the
task.xml file. |
--file <file> |
The hypothesis file to
evaluate. Must be paired with --ref_file. Either this or
--dir must be specified. |
--dir <dir> |
A directory of files to
evaluate. Must be paired with --ref_dir. Either this or
--file must be specified. |
--file_re <re> |
A Python regular expression
to filter the basenames of hypothesis files when --dir is
used. Optional. The expression should match the entire
basename. |
--file_type <t> |
The file type of the
hypothesis document(s). One of the readers. Default is
mat-json. |
--encoding <e> |
Hypothesis file character
encoding. Default is the default encoding of the file type.
Ignored for file types such as mat-json which have fixed
encodings. |
--ref_file <file> |
The reference file to compare
the hypothesis to. Must be paired with --file. Either this
or --ref_dir must be specified. |
--ref_dir <dir> |
A directory of files to
compare the hypothesis to. Must be paired with --dir. Either
this or --ref_file must be specified. |
--ref_fsuff_off <suff> |
When --ref_dir is used, each
qualifying file in the hypothesis dir is paired, by default,
with a file in the reference dir with the same basename.
This parameter specifies a suffix to remove from the
hypothesis file before searching for a pair in the reference
directory. If both this and --ref_fsuff_on are present, the
removal happens before the addition. |
--ref_fsuff_on <suff> |
When --ref_dir is used, each qualifying file in the hypothesis dir is paired, by default, with a file in the reference dir with the same basename. This parameter specifies a suffix to add to the hypothesis file before searching for a pair in the reference directory. If both this and --ref_fsuff_off are present, the removal happens before the addition. |
--ref_file_type <t> |
The file type of the reference document(s). One of the readers. Default is mat-json. |
--ref_encoding <e> |
Reference file character
encoding. Default is the default encoding of the file type.
Ignored for file types such as mat-json which have fixed
encodings. |
Note that all the CSV files created by the scorer are in UTF-8
encoding.
--tag_span_details |
By default, the tag scores,
like the other scores, present a single value for all the
mismatches. If this option is specified, the tag scores will
provide a detailed breakdown of the span errors, as
described above. |
--details |
If present, generate a
separate spreadsheet providing detailed alignments of
matches and errors. See this special note on viewing
CSV files containing natural language text. |
--confusability |
If present, generate a
separate spreadsheet providing a confusability matrix for
all annotation pairs which would be registered as a match or
a hyp/reftagclash. |
--by_token |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
token-level scores. Note:
in order for token-level scoring to work, the hypothesis
document must be contain token annotations, and the content
annotation boundaries in both documents must align with
token annotation boundaries. If there are no token
annotations, no token-level scores will be generated; if one
or both documents contain token annotations but they're not
aligned with content annotations, the behavior is undefined. |
--by_pseudo_token |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
scores using what we're call 'pseudo-tokens', which is
essentially the spans created by the union of whitespace
boundaries and span boundaries. For English and other
Roman-alphabet languages, this score should be very, very
close to the token-level score, without requiring the
overhead of having actual token annotations in the document. |
--by_character |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
character-scores. For languages like Chinese, this score may
provide some useful sub-phrase metrics without requiring the
overhead of having token annotations in the document. |
--compute_confidence_data |
If present, the scorer will
compute means and variances for the various metrics provided
in the tag and token spreadsheets, if --csv_output_dir is
specified. |
--csv_output_dir <dir> |
By default, the scorer
formats text tables to standard output. If this flag is
present, the scores (if requested) will be written as CSV
files to <dir>/bytag.csv, <dir>/bytoken.csv,
<div>/bypseudotoken.csv, <dir>/bychar.csv,
<dir>/details.csv, and <dir>/confusability.csv. |
--no_csv_formulas |
By default, the scorer
produces CSV files with spreadsheet equations for computed
values. If this flag is present, the CSV files will contain
actual values instead. |
--oo_separator |
By default, the scorer uses
Excel-style formula separators in its spreadsheet equations.
If this flag is also present, the scorer will use OpenOffice
formula separators. (The formula formats are incompatible,
and the formulas will be recognized in either Excel or
OpenOffice, but not both.) |
The readers referenced in the --file_type and --ref_file_type options may introduce additional options, which are described here. These additional options must follow the --file_type and --ref_file_type options. The options for the reference file types are all prepended with a ref_ prefix; so for instance, to specify the --xml_input_is_overlay option for xml-inline reference documents, use the option --ref_xml_input_is_overlay.
Let's say you have two files, /path/to/ref and /path/to/hyp,
which you want to compare. The default settings will print a table
to standard output.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref
Let's say that instead of printing a table to standard output,
you want to produce CSV output with embedded formulas, and you
want all three spreadsheets.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref \
--csv_output_dir $PWD --details --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref \
--csv_output_dir %CD% --details --by_token
This invocation will not produce any table on standard output,
but will leave three files in the current directory: bytag.csv,
bytoken.csv, and details.csv.
Let's say you have two directories full of files. /path/to/hyp
contains files of the form file<n>.txt.json, and
/path/to/ref contains files of the form file<n>.json. You
want to compare the corresponding files to each other, and you
want tag and token scoring, but not details, and you intend to
view the spreadsheet in OpenOffice.
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --oo_separator --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --oo_separator --by_token
For each file in /path/to/hyp, this invocation will prepare a
candidate filename to look for in /path/to/ref by removing the
.txt.json suffix and adding the .json suffix. The current
directory will contain bytag.csv and bytoken.csv.
Let's say that you're in the same situations as example 3, but
you want confidence information included in the output
spreadsheets:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --oo_separator --by_token --compute_confidence_data
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --oo_separator --by_token --compute_confidence_data
Let's say that you're in the same situation as example 3, but
your documents contain lots of tags, but you're only interested in
scoring the tags listed in the "Named Entity" task. Furthermore,
you're going to import the data into a tool other than Excel, so
you want the values calculated for you rather than having embedded
equations:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --no_csv_formulas --by_token --task "Named Entity"
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --no_csv_formulas --by_token --task "Named Entity"
Let's say you're in the same situation as example 3, but your
reference documents are XML inline documents, and are of the form
file<n>.xml. Do this:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.xml' \
--csv_output_dir $PWD --oo_separator --by_token --ref_file_type xml-inline
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".xml" \
--csv_output_dir %CD% --oo_separator --by_token --ref_file_type xml-inline
Note that --ref_fsuff_on has changed, in addition to adding the
--ref_file_type option.