The scoring engine compares two tagged files, or two directories of
tagged files. Typically, one input is the hypothesis (an automatically
tagged file) and the other is the reference (a gold-standard tagged
file). But this tool can be used to compare any two inputs.
There are three spreadsheets which can be produced: tag-level
scores, token-level scores, and details. By default, only the tag-level
scores are produced.
The tag-level score table has the following columns:
tag |
The label which is being scored
in this row. The final row will be a cumulative score, with label
"<all>". If the --task option is specified (see below), the
task.xml file may specify an "alias" for a tag plus some attribute
subset (e.g., for named entity, an ENAMEX tag with attribute "type" =
"PERSON", with an alias of "PERSON"). |
test docs |
The number of test (hypothesis)
documents. This value will be the same for all rows. |
test toks |
The number of tokens in the test
documents. This value will be the same for all rows. |
match |
The number of span annotations
for this tag which occur with the same label and same span extent in
the hypothesis document and its corresponding reference document. |
refclash |
The number of span annotations
which bear this tag in the reference document which overlap with a tag
in the corresponding hypothesis document, but does not match the tag,
or the span extent, or both. Note that a count in this column may be
mirrored by a corresponding count from the point of view of the
hypothesis document in the hypclash column. |
missing |
The number of span annotations
which bear this tag in the reference document but do not overlap with
any tagged span in the corresponding hypothesis document. |
refonly |
refclash + missing |
reftotal |
refonly + match |
hypclash |
The number of span annotations which bear this tag in the hypothesis document which overlap with a tag in the corresponding reference document, but does not match the tag, or the span extent, or both. Note that a count in this column may be mirrored by a corresponding count from the point of view of the reference document in the refclash column. |
spurious |
The number of span annotations which bear this tag in the hypothesis document but do not overlap with any tagged span in the corresponding reference document. |
hyponly |
hypclash + spurious |
hyptotal |
hyponly + match |
precision |
match / hyptotal |
recall |
match / reftotal |
fmeasure |
2 * ((precision * recall) /
(precision + recall)) |
The user can also request confidence information. To compute
confidence information, the scorer produces 1000 alternative score
sets. Each score set is created by making M random selections of file
scores from the core set of M file scores. The scorer then computes the
overall metrics for each alternative score set, and computes the mean
and variance over the 1000 instances of each of the precision, recall,
and fmeasure metrics. This "sampling with replacement" yields a more
stable mean and variance. This procedure adds three columns (mean,
variance and standard deviation) to the
spreadsheet for each of the metrics; these columns appear immediately
to the right of the column for the metric.
The token-level score table has the same columns as the tag-level
table, with some reinterpretations and additions. For all the columns
in the tag-level score table which count span annotations, the
corresponding columns in the token-level score table counts tokens in
those annotations. Note that what this means for refclash and hypclash
is that these can only reflect tag clashes, never extent clashes,
because the tokens and their extents in the pair of documents are
identical. The additional columns are:
tag_sensitive_accuracy |
(test toks - refclash - missing
- spurious)/test toks (essentially, the fraction of tokens in the
reference which were tagged correctly, including those which were not
tagged at all) |
tag_sensitive_error_rate |
1 - tag_sensitive_accuracy |
tag_blind_accuracy |
(test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag) |
tag_blind_error_rate |
1 - tag_blind_accuracy |
The user can also request confidence information. The confidence information is computed in the same way as it is for tag-level scores. Confidence information is reported for all four of these additional columns.
The detail spreadsheet is intended to provide a span-by-span
assessment of the scoring inputs.
file |
the name of the hypothesis from
which the entry is drawn |
type |
one of missing, spurious,
spanclash, tagclash, bothclash, match (the meaning of these values
should be clear from the preceding discussion) |
reflabel |
the label on the span in the
reference document |
refstart |
the start index, in characters,
of the span in the reference document |
refend |
the end index, in characters, of
the span in the reference document |
hyplabel |
the label on the span in the hypothesis document |
hypstart |
the start index, in characters, of the span in the hypothesis document |
hypend |
the end index, in characters, of the span in the hypothesis document |
refcontent |
the text between the start and
end indices in the reference document |
hypcontent |
the text between the start and
end indices in the hypothesis document |
Unix:
% $MAT_PKG_HOME/bin/MATScore
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd
Usage: MATScore [options]
--task <task> |
Optional. If specified, the
scorer will use the tags (or tag+attributes) specified in the named
task. |
--content_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer will try to use the metadata in the document to
determine which annotations are content annotations and which are token
annotations. If this metadata is absent (e.g., if the 'metadata' slot
in a mat-json document is unpopulated), the scorer requires additional,
external information. Use this flag to provide a commma-separated
sequence of annotation labels which should be treated as content
annotations. Ignored if --task is present. |
--token_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer will try to use the metadata in the document to
determine which annotations are content annotations and which are token
annotations. If this metadata is absent (e.g., if the 'metadata' slot
in a mat-json document is unpopulated), the scorer requires additional,
external information. Use this flag to provide a commma-separated
sequence of annotation labels which should be treated as token
annotations. Ignored if --task is present. |
--file <file> |
The hypothesis file to evaluate.
Must be paired with --ref_file. Either this or --dir must be specified. |
--dir <dir> |
A directory of files to
evaluate. Must be paired with --ref_dir. Either this or --file must be
specified. |
--file_re <re> |
A Python regular expression to
filter the basenames of hypothesis files when --dir is used. Optional.
The expression should match the entire basename. |
--file_type <t> |
The file type of the hypothesis
document(s). One of the readers.
Default
is
mat-json. |
--ref_file <file> |
The reference file to compare
the hypothesis to. Must be paired with --file. Either this or --ref_dir
must be specified. |
--ref_dir <dir> |
A directory of files to compare
the hypothesis to. Must be paired with --dir. Either this or --ref_file
must be specified. |
--ref_fsuff_off <suff> |
When --ref_dir is used, each
qualifying file in the hypothesis dir is paired, by default, with a
file in the reference dir with the same basename. This parameter
specifies a suffix to remove from the hypothesis file before searching
for a pair in the reference directory. If both this and --ref_fsuff_on
are present, the removal happens before the addition. |
--ref_fsuff_on <suff> |
When --ref_dir is used, each qualifying file in the hypothesis dir is paired, by default, with a file in the reference dir with the same basename. This parameter specifies a suffix to add to the hypothesis file before searching for a pair in the reference directory. If both this and --ref_fsuff_off are present, the removal happens before the addition. |
--ref_file_type <t> |
The file type of the reference document(s). One of the readers. Default is mat-json. |
Note that all the CSV files created by the scorer are in UTF-8
encoding.
--details |
If present, generate a separate
spreadsheet providing detailed alignments of matches and errors. See
this special note on viewing CSV
files containing natural language text. |
--by_token |
By default, the scorer generates aggregate tag-level scores. If this flag is present, generate a separate spreadsheet showing aggregate token-level scores. |
--compute_confidence_data |
If present, the scorer will
compute means and variances for the various metrics provided in the tag
and token spreadsheets, if --csv_output_dir is specified. |
--csv_output_dir <dir> |
By default, the scorer formats
text tables to standard output. If this flag is present, the scores
will be written as CSV files to <dir>/bytag.csv,
<dir>/bytoken.csv, and <dir>/details.csv. |
--no_csv_formulas |
By default, the scorer produces
CSV files with spreadsheet equations for computed values. If this flag
is present, the CSV files will contain actual values instead. |
--oo_separator |
By default, the scorer uses
Excel-style formula separators in its spreadsheet equations. If this
flag is also present, the scorer will use OpenOffice formula
separators. (The formula formats are incompatible, and the formulas
will be recognized in either Excel or OpenOffice, but not both.) |
The readers referenced in the --file_type and --ref_file_type options may introduce additional options, which are described here. These additional options must follow the --file_type and --ref_file_type options. The options for the reference file types are all prepended with a ref_ prefix; so for instance, to specify the --xml_input_is_overlay option for xml-inline reference documents, use the option --ref_xml_input_is_overlay.
Let's say you have two files, /path/to/ref and /path/to/hyp, which
you want to compare. The default settings will print a table to
standard output.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref
Let's say that instead of printing a table to standard output, you
want to produce CSV output with embedded formulas, and you want all
three spreadsheets.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref \
--csv_output_dir $PWD --details --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref \
--csv_output_dir %CD% --details --by_token
This invocation will not produce any table on standard output, but
will leave three files in the current directory: bytag.csv,
bytoken.csv, and details.csv.
Let's say you have two directories full of files. /path/to/hyp
contains files of the form file<n>.txt.json, and /path/to/ref
contains files of the form file<n>.json. You want to compare the
corresponding files to each other, and you want tag and token scoring,
but not details, and you intend to view the spreadsheet in OpenOffice.
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --oo_separator --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --oo_separator --by_token
For each file in /path/to/hyp, this invocation will prepare a
candidate filename to look for in /path/to/ref by removing the
.txt.json suffix and adding the .json suffix. The current directory
will contain bytag.csv and bytoken.csv.
Let's say that you're in the same situations as example 3, but you
want confidence information included in the output spreadsheets:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --oo_separator --by_token --compute_confidence_data
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --oo_separator --by_token --compute_confidence_data
Let's say that you're in the same situation as example 3, but your
documents contain lots of tags, but you're only interested in scoring
the tags listed in the "Named Entity" task. Furthermore, you're going
to import the data into a tool other than Excel, so you want the values
calculated for you rather than having embedded equations:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --no_csv_formulas --by_token --task "Named Entity"
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --no_csv_formulas --by_token --task "Named Entity"
Let's say you're in the same situation as example 3, but your
reference documents are XML inline documents, and are of the form
file<n>.xml. Do this:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.xml' \
--csv_output_dir $PWD --oo_separator --by_token --ref_file_type xml-inline
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".xml" \
--csv_output_dir %CD% --oo_separator --by_token --ref_file_type xml-inline
Note that --ref_fsuff_on has changed, in addition to adding the
--ref_file_type option.