In Tutorial 2, you used a command-line tool to build a model. In this tutorial, we'll use the command-line engine to process documents using, among other things, the model you built. We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. Like Tutorial 2, we're going to do this tutorial in file mode. And because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
In this tutorial, we're going to make use of the models we build in
Tutorial 2, and we're also going to use MATEngine.
First, let's review some of the arguments to MATEngine.
In a shell:
Unix:
% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity'
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity"
The --task directive is the first, and most important, directive.
Unless you only have one task installed, you'll always need it. But
you'll need more, and this is some of what you'll see (we've edited the
help message for this example; see the MATEngine page for examples and
complete documentation).
Error: workflow must be specified
Usage: MATEngine [options] ...
Named Entity :
available workflows:
Hand annotation : steps zone, tokenize, tag
Review/repair : steps
Demo : steps zone, tokenize, tag
Input options:
--input_file=file The file to process. Either this or --input_dir must
be specified. A single dash ('-') will cause the engine to read
from standard input.
--input_dir=dir The directory to process. Either this or --input_file
must be specified.
--input_encoding=encoding
Input character encoding for raw files. Default is
ascii.
--input_file_type=raw | mat-json
The file type of the input. Either raw (a raw file) or
mat-json (a rich JSON file produced as the output of
this engine or the annotation tool). Required.
Output options:
--output_file=file Where to save the output. Optional. Must be paired
with --input_file. A single dash ('-') will cause the engine to
write to standard output.
--output_dir=dir Where to save the output. Optional. Must be paired
with --input_dir.
--output_fsuff=suffix
The suffix to add to each filename when --output_dir
is specified. If absent, the name of each file will be
identical to the name of the file in the input
directory.
--output_file_type=raw | mat-json
The type of the file to save. Either raw (a raw file)
or mat-json (a rich JSON file). Required if either
--output_file or --output_dir is specified.
--output_encoding=encoding
Output character encoding for raw files. Default is
ascii.
Task options:
--workflow=workflow
The name of a workflow, as specified in some task.xml
file. Required if more than one workflow is available.
See above for available workflows.
--steps=step,step,...
Some ordered subset of the steps in the specified
workflow. The steps should be concatenated with a
comma. See above for available steps.
--undo_through=step
A step in the current workflow. All possible steps
already done in the document which follow this step
are undone, including this step, before any of the
steps in --steps are applied. You can use this flag in
conjunction with --steps to rewind and then reapply
operations.
The input and output
options should be self-explanatory. All raw files require an encoding
to be specified, which defaults to ASCII if not provided. Input
and output both require a file type ("raw" or "mat-json").
At the top of the help message, you'll see a listing for the "Named
Entity" task, showing you the named workflows and the steps in each
workflow. The step is the basic unit, and steps are ordered in
workflows. In order to do anything with the MATEngine, you need to
specify a workflow and some set of steps. For now, that's all you need
to know; the documentation on tasks,
workflows
and
steps provides more detail, as does the documentation
on the sample task.
Back in Tutorial 1,
we used the UI to prepare a document for hand tagging, because it was
less complex than using the command-line engine. Now, we'll show you
how to do it.
In order to prepare a document for tagging, you can use either the
"Demo" or the "Hand annotation" workflow in the Named Entity task (the
meanings of the workflows and steps may be different in other tasks).
In this task, the first two steps are the same, and have the same
realization; "zone" marks the appropriate taggable regions in the
document, and "tokenize" identifies the word units in the document
(because the annotation and training engine uses words as its basic
elements). Let's prepare our raw document voa2.txt:
Unix:
% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'zone,tokenize' \
--input_file $PWD/sample/ne/resources/data/raw/voa2.txt --input_file_type raw \
--output_file ./voa2_txt.json --output_file_type mat-json
zone : voa2.txt
tokenize : voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file %CD%\sample\ne\resources\data\raw\voa2.txt --input_file_type raw \
--output_file %CD%\voa2_txt.json --output_file_type mat-json
zone : voa2.txt
tokenize : voa2.txt
So what we did here was apply the zone and tokenize steps, in the
Demo workflow in the "Named Entity" task, to the raw input file
voa2.txt, saving the result as a rich annotated document voa2_txt.json.
Notice that the command reports which steps it's applying.
Note that we can do multiple steps simultaneously; the only reason
we're preparing the document separately from tagging it is for
illustration.
If you want to review this document, the easiest way is to load it
into the UI; it should be identical to the output of step 2 in Tutorial 3. You can also examine it in your
favorite editor, but it'll be fairly difficult to read, even if you're
familiar with the MAT JSON annotated
file format.
In the same workflow, we'll now perform the "tag" step on the file
we just created.
First, let's see what happens when we try to zone and tag the
document again:
Unix:
% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'zone,tokenize' \
--input_file ./voa2_txt.json --input_file_type mat-json --output_file ./voa2_txt.json \
--output_file_type mat-json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file %CD%\voa2_txt.json --input_file_type mat-json --output_file %CD%\voa2_txt.json \
--output_file_type mat-json
You'll notice that the engine reports nothing, because the input
annotated document has the applied steps recorded in it, and the
document won't repeat its steps.
Next, let's review some more of the command line options (again,
we've edited down the options for the purposes of this discussion):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task 'Named Entity' --workflow Demo
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "Named Entity" --workflow Demo
Usage: MATEngine [options] ...
Options for step 'tokenize' (workflows Hand annotation, Align, Demo):
--heap_size=HEAP_SIZE
If present, specifies the -Xmx argument for the Java JVM
Options for step 'tag' (workflows Demo):
See also --heap_size in Options for step 'tokenize' (workflows Hand annotation, Align, Demo)
--tagging_pre_models=TAGGING_PRE_MODELS
if present, a comma-separated list of glob-style patterns specifying the models to include as pre-
taggers.
--tagger_local don't try to contact a remote tagger server; rather, start up a local command.
--tagger_model=TAGGER_MODEL
provide a tagger model file. Obligatory if no model is specified in the task step.
--prior_adjust=PRIOR_ADJUST
Bias the Carafe tagger to favor precision (positive values) or recall (negative values). Default is
-1.0 (slight recall bias). Practical range of values is usually +-6.0.
We can control the "tag" step with the command line options shown
here. Right now, the option we're interested in is --tagger_local,
because we don't want the engine to try to contact the Web server to
tag the document. In this step, we're going to take advantage of the
fact that we built a default model in Tutorial
2.
Unix:
% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'tag' --input_file ./voa2_txt.json \
--input_file_type mat-json --output_file ./voa2_txt.json --output_file_type mat-json \
--tagger_local
tag : voa2_txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" --input_file %CD%\voa2_txt.json \
--input_file_type mat-json --output_file %CD%\voa2_txt.json --output_file_type mat-json \
--tagger_local
tag : voa2_txt.json
Notice that it reports that the tag step is performed. If you try to
repeat this command, you'll see that nothing happens, because the
document "knows" it's been tagged.
If you load this document into the UI, you'll see that it looks
identical to the output of step 3 in Tutorial
3.
You can undo steps and redo them in the same command. Let's say, for
instance, you want to redo tagging, as in step 4 in Tutorial 3. You can use the --undo_through
directive to achieve this. In addition, we're going to use the other
model you built, in step 1 of Tutorial 2.
Unix:
% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'tag' --input_file ./voa2_txt.json \
--input_file_type mat-json --output_file ./voa2_txt.json --output_file_type mat-json \
--tagger_local --tagger_model /tmp/ne_model --undo_through tag
tag : voa2_txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" --input_file %CD%\voa2_txt.json \
--input_file_type mat-json --output_file %CD%\voa2_txt.json --output_file_type mat-json \
--tagger_local --tagger_model %TMP%\ne_model --undo_through tag
tag : voa2_txt.json
The --tagger_model directive allows us to specify an explicit model
to use, and the --undo_through directive undoes all the steps through
the step listed. You'll notice that if you omit --undo_through, nothing
will happen (because the document is already tagged), but with
--undo_through, the document is tagged again (because --undo_through
happens before --steps).
Recall that we have a version of this file which has already been
tagged. We can treat that version as the reference file, and this
version we just tagged as the hypothesis file, and run the scoring tool:
Unix:
% cd $MAT_PKG_HOME
% bin/MATScore --file ./voa2_txt.json --ref_file ./sample/ne/resources/data/json/voa2.txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATScore.cmd --file %CD%\voa2_txt.json --ref_file %CD%\sample\ne\resources\data\json\voa2.txt.json
The scorer will print a table to standard output describing the
precision, recall, and F-measure at the tag level for this file
comparison. The scorer has a large range of options; see the
documentation for MATScore for details and
examples.
If you're not planning on doing any other tutorials, and you don't
want the "Named Entity" task hanging around, remove it as shown in the
final step of Tutorial 1.