The processing engine manages the execution of a sequence of steps
against
a set of files.
Unix:
% $MAT_PKG_HOME/bin/MATEngine
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd
Usage: MATEngine [core options] [input/output/task options] [other options]
Options:
-h, --help show this help message and exit
Core options:
--other_app_dir=dir
additional directory to load a task from. Optional and repeatable.
--settings_file=file
a file of settings to use which overwrites existing settings. The file should a
Bourne shell file which sets variables. Optional.
--task=task name of the task to use. Obligatory if the system knows of more than one task.
Known tasks are: ...
--debug Enable debug output.
...
If no arguments are provided to MATEngine, the help message above is
presented.
The complete list of options is presented once a task argument is
provided. Note that the core options must precede the input, output and
task options, which must precede any other options (this is because the
later options are added progressively as the earlier options are
discovered).
The MAT engine is embedded in a number of other locations, such as
the specification of workspace operations and the preprocessing and
test corpus processing in the experiment engine. Accordingly, we
describe both the command line options and their XML equivalents here
(with the exception of the core options immediately below, which don't
have any XML equivalents).
--other_app_dir <dir> |
If present, a directory to look
in to find a MAT application specification. This directory must contain
a task.xml file which describes the application. This is only necessary
if 'MATManagePluginDirs install' has not been called on the application
directory. |
--task <s> |
The name of a task, as specified
in some task.xml file. Required. The known tasks are reported as the
toplevel entries in the "Available applications" section after the
usage string is printed. |
--settings_file |
A file of settings to use which
overwrites existing settings. The file should be a Bourne shell file
which
sets variables. |
--debug |
Enable debug output. |
--subprocess_debug <i> |
Set the subprocess debug level
to the value provided, overriding the global setting. 0 disables, 2
shows all subprocess activity. |
--subprocess_statistics |
Enable subprocess statistics
(memory/time), if the capability is available and it isn't globally
enabled. |
Once a task argument is present, MATEngine summarizes the workflow
structure for the task before it prints out the full option list:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task 'Named Entity'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task 'Named Entity'
Error: workflow must be specified
Usage: MATEngine [core options] [input/output/task options] [other options]
Named Entity :
available workflows:
Hand annotation : steps zone, tokenize, tag
Review/repair : steps
Demo : steps zone, tokenize, tag
...
The remainder of the options can be grouped into a number of
categories.
The task options control what is done to each input file. A workflow
must be specified. You can either apply new steps (with the --steps
flag), or undo existing steps (with the --undo_through flag). If
neither is specified, the tool operates as a (somewhat expensive)
transducer between the input and output formats.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--workflow <s> |
workflow |
The name of a workflow, as specified in some task.xml file | Required. The known workflows
for a
given task are specified in the "available workflows" subsections in
the listing of available applications printed after the usage string. |
--steps <s> |
steps |
A comma-concatenated sequence of
workflow steps |
Some ordered subset of the steps
in the specified workflow. The steps for a given workflow are
specified in the "available workflows" subsections in the listing of
available applications printed after the usage string. If no steps are
specified, none will be
applied. |
--undo_through <s> |
undo_through |
A step in the current workflow | All possible steps already done
in the document which follow this step
are undone, including this step, before any of the steps in --steps are
applied. You can use this flag in conjunction with --steps to rewind
and then reapply operations. |
--print_steps <s> |
print_steps |
A comma-concatenated sequence of workflow steps | Some subset of the steps in the
specified workflow. Verbose details about these steps will be printed.
The steps should be concatenated with a comma. |
The input options specify the input files. You can specify
individual files, or directories (possibly filtering their contents
using a regular expression). You must specify a file type. For raw
files, you
can also specify an input character encoding.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--input_file <f> |
The file to process. Either this
or --input_dir must be specified. A single dash ('-') will cause the
engine to read from standard input. |
||
--input_dir <d> |
The directory to process. Either
this or --input_file must be specified. |
||
--input_file_re <s> |
If --input_dir is specified, a
regular expression to match the filenames in the directory against. The
pattern must cover the entire filename (and only the filename, not the
full path). |
||
--input_encoding <e> |
Input character encoding for raw
files. Default is ascii. |
||
--input_file_type <t> |
The file type of the input. One
of the available readers and writers.
Required. |
The output options specify how the result is saved. If you don't
specify any output options, the result will be ignored. You can specify
an output file for an input file, or an output directory and/or name
mapping for an input directory. You must also specify the output
format; usually, you'll want this to be one of the rich formats, but
"raw" is useful
in some rare circumstances. Finally, you can specify an output
character encoding for raw files.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--output_file <f> |
Where to save the output.
Optional. Must be paired with --input_file. A single dash ('-') will
cause the engine to write to standard output. |
||
--output_dir <d> |
Where to save the output.
Optional. Must be paired with --input_dir. |
||
--output_fsuff <s> |
The suffix to add to each
filename when --output_dir is specified. If absent, the name of each
file will be identical to the name of the file in the input directory. |
||
--output_file_type <t> |
The type of the file to save.
One of the available readers and
writers. Required if
either
--output_file or --output_dir is specified. |
||
--output_encoding <e> |
Output character encoding for
raw files. Default is ascii. |
The readers and writers described above may introduce additional
options, which
are described here. These
options must follow the input and output options.
Finally, it's possible for individual
step
implementations to contribute command-line arguments to
MATEngine. These command-line specifications override those found in
the task.xml file. At the moment, only steps having the
MAT.JavaCarafe.CarafeTagStep (i.e., automated tagging with Carafe)
implementation contribute command line arguments. The general options
for automated tagging are:
Command line option |
XML attribute | Value |
Description |
---|---|---|---|
--tagger_local |
tagger_local |
"yes" (XML) |
Don't try to contact a remote
tagger server; rather, start up a local command. |
--tagger_model <f> |
tagger_model |
string |
Provide a tagger model file.
Obligatory if no model is specified in the task step and no default
model is present in the task. |
In addition, the Carafe tagger
provides other tagging options.
Let's say you have a task named "My Task", with a workflow named
"All" which contains steps "zone", "tokenize" and "tag" as in the sample task. In order to zone and tokenize
a raw document /path/to/my/document.txt and save the result to a MAT
JSON document:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json
Let's say you want to undo the tokenize step from the document above:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.notoks.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.notoks.json \
--output_file_type mat-json
Let's say you want to process the document as in example 1, but you
don't have any interest in saving the results (e.g., you're just
testing to see if it breaks):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw
Let's say you want to process the document as in example 1, and you
want to see the result, but you don't want to bother saving it to a
file:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file - --output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file - --output_file_type mat-json
Let's say you want to process the document as in example 1, but you
want to see the intermediate results:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json --print_steps 'zone,tokenize'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json --print_steps "zone,tokenize"
Let's say you have the output of example 1, but you want to
retokenize it. You can simultaneously specify the undo and redo steps:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --steps tokenize \
--input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.retoks.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --steps tokenize \
--input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.retoks.json \
--output_file_type mat-json
Let's say you have a directory full of text files in
/path/to/my/documents which you want to process, and you want the
results to have the identical names, but in /path/to/my/jsondocuments:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/jsondocuments \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\jsondocuments \
--output_file_type mat-json
Let's say you want to process your documents as in example 7, but
you want to save them back to /path/to/my/documents, with an additional
suffix:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json"
Let's say you have a directory like the one that would be created in
example 8, with raw and MAT JSON documents intermixed. But all the
files you want to process end with ".txt":
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json' \
--input_file_re '.*[.]txt'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json" \
--input_file_re ".*[.]txt"
Note that the regular expression is a Python regular expression, and
that it must be enclosed in single quotes on the command line to
suppress any bash command-line preprocessing.
Let's say your "tag" step in the "All" workflow is implemented as a
Carafe tag step. You can provide a Carafe model, and ensure that the
engine starts up Carafe itself rather than trying to contact MATWeb, as follows:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize,tag' --input_file /path/to/my/document.txt \
--tagger_local --tagger_model /path/to/my/model \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize,tag" --input_file c:\path\to\my\document.txt \
--tagger_local --tagger_model c:\path\to\my\model \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json
Note that this model overrides any model specified in the task file.
Like example 10, except on the output of example 1 (that is, zoning
and tokenization are already done):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'tag' --input_file /path/to/my/document.txt.json \
--tagger_local --tagger_model /path/to/my/model \
--input_file_type mat-json --output_file /path/to/my/document.txt.tagged.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "tag" --input_file c:\path\to\my\document.txt.json \
--tagger_local --tagger_model c:\path\to\my\model \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.tagged.json \
--output_file_type mat-json
Let's say that you have some XML documents which contain XML content
annotations, and you have an "align" step which will align the
annotation boundaries with token boundaries after you've tokenized the
document. Furthermore, you want all the tags which aren't names of
annotations in your task to be preserved in the signal:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' --input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" --input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json
Let's say that you want to do example 12 in two steps: first convert
to MAT JSON format, then process. To do the conversion, simply call
MATEngine without any steps:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document_unprocessed.xml.json \
--output_file_type mat-json
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' \
--input_file /path/to/my/document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document_unprocessed.xml.json \
--output_file_type mat-json
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" \
--input_file c:\path\to\my\document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json