For the most part, you can't do anything substantial with the MAT
toolkit without defining a task.
A task is a set of activities, called workflows, which can be broken down into steps. Each task has a set of
annotations that its
activities share; each step in a task can participate in multiple
workflows, and each step makes the "same" contribution to each
workflow it participates in. All these concepts are interrelated,
and it's difficult to discuss one without the other, but we'll try
to describe them in the most sensible order.
Steps are atomic actions in your workflows. The most common type
of step adds a category of annotation; e.g., a tokenization step
adds token annotations. Each step has an implementation (either
self-contained, or a wrapper for an external tool) which is a
Python class. We provide a number of useful step implementations
which you can use in your workflow. If you want to define your own
steps, you'll have to consult the advanced topics.
Here are the step implementations, along with their common step
names, that MAT provides "out of the box":
Step implementation name |
common step name |
Description |
---|---|---|
MAT.PluginMgr.WholeZoneStep |
zone |
This step assigns a single
zone annotation with label "zone" and attribute
"region_type" with value "body", to the entire document.
This step also adds administrative SEGMENT
annotations to track annotation progress. The options for this step are described immediately below. |
MAT.JavaCarafe.CarafeTokenizationStep |
tokenize |
This step runs the Carafe
tokenizer on the relevant document, generating token
annotations with label "lex" in such a way that the zone
boundaries are not crossed. The options for this step are described here. |
MAT.JavaCarafe.CarafeTagStep |
tag |
This step runs the Carafe
tagger, adding content tags to the document. The options for this step are described here. |
MAT.PluginMgr.TagStep |
hand tag |
This step is the parent of
all tag steps. It serves as a placeholder implementation for
hand annotation in those workflows that do not have
automated content tagging, and to implement "undo" in
automated tag steps. Any steps with this implementation must
be designated by_hand="yes" in the task.xml file. This step has no available options. |
MAT.PluginMgr.AlignStep |
align |
This step is intended to work
with documents which have been imported from other formats
(e.g., XML inline), which have content annotations which may
not align with token boundaries. This step aligns the
content annotation boundaries with with the token boundaries
by expanding the annotations to the nearest token
boundaries. This alignment is expected in the UI annotation
tool (and, in fact, by may trainable tagging engines,
including Carafe). Insert a step with this implementation in
your workflows which are intended to manage imported
documents. This step has no available options. |
See the sample 'Named Entity' task
for a detailed example of how these steps are used in workflows.
The options these step implementations can bear can be specified
in the task.xml file or in the invocation of the MAT engine.The
one general-purpose step which has options is
MAT.PluginMgr.WholeZoneStep:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--mark_gold |
mark_gold |
"yes" (XML) |
If present, mark the document
segments as gold-standard data (annotator =
"GOLD_STANDARD", status = "reconciled") |
The UI also makes available a separate "mark gold" step, which
has no backend implementation.
Once you have a set of step implementations to draw from, you can
create mnemonic names for them and assemble them into workflows.
Four extremely common and obvious workflows are
If you create other, custom steps, you may have other workflows.
One quirk of the mnemonic names for steps is that they're global
to the task. The implementation of, say, "tokenize" can differ
from workflow to workflow, but when you apply different workflows
to a document, the document knows what's already been done by
virtue of the named steps that have been applied. So it's not a
good idea for the effect of step implementations to differ among
workflows. The implementations can provide different methods for achieving the same
effect (e.g., different automated taggers, or hand vs. automated
tagging), but they should not vary any further; the tags which are
added by any implementation of a named step should be the same.
In general, tasks provide a customization bundle for your use of
MAT. In this document, we've described two of the most prominent
customizations: defining steps and defining workflows (we've
discussed annotations
elsewhere). There are many other things you can customize:
For relevant examples of these, please consult "The sample tasks", "Creating a new task", "Creating a new demo", and the
documentation for the task XML and
the demo XML.