Annotation Reporter

Description

The annotation reporter produces concordance-style reports on the content annotations in a given set of documents, either in CSV or text form. The CSV file contains the following columns:

file
the name of the document from which the entry is drawn
start
the start index, in characters, of the span in the document
end
the end index, in characters, of the span in the document
left context
the context to the left of the start index
text
the text in between the start and end indices
label
the label on the span in the document. If the annotation contains attributes and values, these will be represented in the label
right context
the context to the right of the end index

It's also possible to omit the left and right contexts, if you prefer. The text file contains the same columns, except that file, start, and end are collapsed into a single location column. It's also possible to interpolate document-level statistics such as file length and number of annotations per label into these reports.

Because the CSV files contain language data, please consult this special note on how to view them.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATReport

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd

Usage: MATScore [options]

Core options

--task <task>
Name of the task to use. Obligatory if --content_annotations is not used and more than one task is registered
--content_annotations ann,ann,ann...
Optional. If no task is specified, the reporter will try to use the metadata in the document to determine which annotations are content annotations. If this metadata is absent (e.g., if the 'metadata' slot in a mat-json document is unpopulated), the scorer requires additional, external information. Use this flag to provide a commma-separated sequence of annotation labels which should be treated as content annotations. Ignored if --task is present.

Input options

--input_files <file>
A glob-style pattern describing full pathnames to be reported on. May be specified with --input_dir. Can be repeated.
--input_dir <dir>
A directory, all of whose files will be reported on. Can be repeated. May be specified with --input_files.
--file_type <t>
The file type of the document(s). One of the readers. Default is mat-json.
--encoding <e>
The encoding of the input. The default is the appropriate default for the file type.

Output options

--output_dir <dir>
The output directory for the reports. Will be created if it doesn't exist. Required.
--csv
Generate a CSV file in the output directory, with concordance-style data: file, location, content, left and right context, annotation label. At least one of this option and --txt must be provided. The CSV file will be in UTF-8 encoding. See this special note on viewing CSV files containing natural language text.
--txt
Generate a text file in the output directory, with concordance-style data, sorted first by annotation label and then by content. At least one of this option and --csv must be provided. The output file will be in UTF-8 encoding.
--concordance_window <i>
Use the specified value as the window size on each side of the concordance. Default is 32.
--omit_concordance_context
Omit the left and right concordance context from the output.
--file_csv
Generate a separate CSV file consisting of file-level statistics such as file size in characters and number of annotations of each type.
--interpolate_file_info
Instead of a separate CSV file for the file-level statistics, interpolate them into the concordance.

Other options

The readers referenced in the --file_type option may introduce additional options, which are described here. These additional options must follow the --file_type option.

Examples

Example 1

Let's say you have a file, /path/to/file, whose annotations you want to view in a spreadsheet. You want the results to be written to /path/to/output.

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --csv --output_dir /path/to/output

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --csv --output_dir c:\path\to\output

Example 2

Let's say that you only want textual output, and you don't want the concordance columns:

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --txt \
--output_dir /path/to/output --omit_concordance_context


Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --txt \
--output_dir c:\path\to\output --omit_concordance_context

Example 3

Let's say you have a directory full of files. /path/to/files contains files of the form file<n>.json. You want to view them both in CSV and in text, and you want a smaller concordance window of 10 characters.

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.json' \
--csv --txt --output_dir /path/to/output --concordance_window 10

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.json' \
-csv --txt --output_dir c:\path\to\output --concordance_window 10

For each file in /path/to/hyp, this invocation will prepare a candidate filename to look for in /path/to/ref by removing the .txt.json suffix and adding the .json suffix. The current directory will contain bytag.csv and bytoken.csv.