The annotation reporter produces concordance-style reports on the
content annotations in a given set of documents, either in CSV or
text form. The CSV file contains the following columns:
file |
the name of the document from
which the entry is drawn |
start |
the start index, in
characters, of the span in the document |
end |
the end index, in characters,
of the span in the document |
left context |
the context to the left of
the start index |
text |
the text in between the start
and end indices |
label |
the label on the span in the
document. If the annotation contains attributes and values,
these will be represented in the label. |
right context |
the context to the right of
the end index |
It's also possible to omit the left and right contexts, if you
prefer. The text file contains the same columns, except that file,
start, and end are collapsed into a single location column.
This tool also allows you, via the --partition_by_label option,
to generate CSV and text files for each content annotation label
in the document set. In these versions, the annotation ID is
reported in a column after the "end" column, and instead of the
"label" column, the file contains a column for each known
attribute of the annotation type.
It's also possible to interpolate document-level statistics such
as file length and number of annotations per label into these
reports.
Because the CSV files contain language data, please consult this
special note on how to view
them.
Unix:
% $MAT_PKG_HOME/bin/MATReport
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd
Usage: MATReport [options]
--task <task> |
Name of the task to use.
Obligatory if neither --content_annotations nor
--content_annotations_all are provided, and more than one
task is registered. |
--content_annotations
ann,ann,ann... |
Optional. If --task is not
provided, the reporter requires additional, external
information to determine which annotations are content
annotations. Use this flag to provide a comma-separated
sequence of annotation labels which should be treated as
content annotations. |
--content_annotations_all |
Optional. If neither --task
nor --content_annotations are provided, this flag will cause
all labels in the document to be treated as content
annotations. |
--verbose |
If present, the tool will provide detailed
information on its progress. |
--input_files <file> |
A glob-style pattern describing full pathnames to be reported on. May be specified with --input_dir. Can be repeated. |
--input_dir <dir> |
A directory, all of whose
files will be reported on. Can be repeated. May be specified
with --input_files. |
--file_type <t> |
The file type of the
document(s). One of the readers.
Default is mat-json. |
--encoding <e> |
The encoding of the input.
The default is the appropriate default for the file type. |
--output_dir <dir> |
The output directory for the
reports. Will be created if it doesn't exist. Required. |
--csv |
Generate a CSV file in the output directory, with concordance-style data: file, location, content, left and right context, annotation label. At least one of this option or --txt must be provided. The CSV file will be in UTF-8 encoding. See this special note on viewing CSV files containing natural language text. |
--txt |
Generate a text file in the
output directory, with concordance-style data, sorted first
by annotation label and then by content. At least one of
this option or --csv must be provided. The output file will
be in UTF-8 encoding. |
--concordance_window
<i> |
Use the specified value as
the window size on each side of the concordance. Default is
32. |
--omit_concordance_context |
Omit the left and right
concordance context from the output. |
--file_csv |
Generate a separate CSV file
consisting of file-level statistics such as file size in
characters and number of annotations of each type. |
--interpolate_file_info |
Instead of a separate CSV
file for the file-level statistics, interpolate them into
the concordance. |
--include_spanless |
By default, only spanned content annotations
are produced. If this flag is present, spanless annotations
(without position or left or right context, of course) will
be included. If the spanless annotations refer to spanned
annotations, the text context of the referred annotations
will be inserted in the 'text' column. |
--partition_by_label |
If present, in addition to the standard
output file report.csv and/or report.txt, the tool will
generate a separate spreadsheet for each label, with a
column for each attribute. |
The readers referenced in the --file_type option may introduce
additional options, which are described here. These additional
options must follow the --file_type option.
Let's say you have a file, /path/to/file, whose annotations you
want to view in a spreadsheet. You want the results to be written
to /path/to/output.
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --csv --output_dir /path/to/output
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --csv --output_dir c:\path\to\output
Let's say that you only want textual output, and you don't want
the concordance columns:
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --txt \
--output_dir /path/to/output --omit_concordance_context
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --txt \
--output_dir c:\path\to\output --omit_concordance_context
Let's say you have a directory full of files. /path/to/files
contains files of the form file<n>.json. You want to view
them both in CSV and in text, and you want a smaller concordance
window of 10 characters.
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.json' \
--csv --txt --output_dir /path/to/output --concordance_window 10
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.json' \
-csv --txt --output_dir c:\path\to\output --concordance_window 10