For the most part, you can't do anything substantial with the MAT
toolkit without defining a task.
A
task
is
a
set
of
activities, called workflows,
which
can
be
broken
down
into steps.
Each task has a set of annotations
that its activities share; each step in a task can participate in
multiple workflows, and each step makes the "same" contribution to each
workflow it participates in. All these concepts are interrelated, and
it's difficult to discuss one without the other, but we'll try to
describe them in the most sensible order.
Annotations, in MAT, are labeled spans of a document. These labeled
spans may also have attributes and values associated with them, and
each annotation has a category. The semantics of the labels are defined
by the MAT toolkit and by the task. (Other tools and infrastructures
have a richer notion of annotation, where attribute values are typed
and annotations can refer to other annotations instead of text spans;
MAT doesn't yet address either of these enhancements.)
Let's take the sample task as an example. Its annotation labels are:
Annotation |
Category (see below) |
Use |
---|---|---|
lex |
token |
each lex tag delimits a
non-whitespace token (the basic element used in the Carafe trainer and
tagger) |
untaggable |
untaggable |
denotes a region where no
content annotations will be found |
zone |
zone |
delimits a contiguous zone in
which content annotations will be found |
PERSON |
content |
a proper name of a person |
LOCATION |
content |
a proper name of a location |
ORGANIZATION |
content |
a proper name of an organization |
The semantics of the content labels are defined by the task. The
sample task is about
annotation of named entities, and the three content annotation labels
shown here have a standardized interpretation in named entity
annotation, for which detailed guidelines have been developed about
when to assign each of these annotation labels. Good guidelines are
crucial for maximizing agreement among human annotators when preparing
gold-standard corpora. When you develop your own task, you should
develop similar guidelines; this skill is quite sophisticated, and is
outside the scope of this documentation.
The non-content categories above have semantics assigned by the MAT
toolkit. In some cases, these semantics are crucial for using MAT
correctly. The non-content categories (known as "structure categories")
can be inherited from the root task, which is what most tasks will do.
This is what you should do, because there are a number of implicit
dependencies on the names of these structure categories. We attempt to
document those dependencies here.
Zones correspond to large chunks of the document where annotations
can be found. If you're using Carafe, you must have a label named
"zone" of category "zone", and that label must have an attribute called
"region_type", and the value of that attribute is among the values
passed to Carafe to tell Carafe where to look when training and
tagging. MAT, by default, uses "body" as the value for "region_type".
The simplest zoning of a document is to assign a single zone of
region_type "body" which encompasses the whole document, and there is a
zone step available which does this. If you want to get more
sophisticated (e.g., exclude HTML or XML tags), you'll have to consult
the advanced topics.
These annotations are the complement of the zone annotations. You
have to have one of these, and at the moment, it must be named
"untaggable", by default. MAT uses these annotations to rule out
possible hand annnotations in the hand annotation tool. They're
assigned automatically by the infrastructure of the zoner, if you
inherit from the default zone class in your implementation of steps
(see below).
Tokens correspond, pretty much, to words. In MAT, token annotations
are used as the basis for most computation. When you hand-annotate your
documents, you are encouraged to require that token annotations are
present, so that the MAT annotation tool can determine the possible
boundaries of your proposed annotation. This is because the Carafe
trainer and tagger both use tokens (not characters) as the "atoms" of
computation, and as a result, any annotation whose boundaries do not
coincide with token boundaries must be modified or discarded during the
training phase (because the element at the relevant edge doesn't
correspond to an "atom" as far as Carafe is concerned). Similarly, you
must use the same automatic tokenization process for training and
tagging, for obvious reasons.
Carafe is one of many trainable annotation tools which rely on
tokens as atoms; however,
most hand annotation tools don't try to ensure in advance that
annotation boundaries match token boundaries, and the users of such
tools have to make accommodations later in their workflows.
The MAT toolkit comes with a default English tokenizer, which we
describe below when we talk about steps. There's nothing special about
this tokenizer; you can replace it with your own, as
long as you use the same tokenizer throughout your task. If you inherit
your structure annotations from the core task, and
use the default tokenizer, you don't have to think about this any
further. If you don't, your tokenizer and task have to make sure of
several things:
Steps are atomic actions in your workflows. The most common type of
step adds a category of annotation; e.g., a tokenization step adds
token annotations. Each step has an implementation, written in Python.
We provide a number of useful step implementations which you can use in
your workflow. If you want to define your own steps, you'll have to
consult the advanced topics.
Here are the steps that MAT provides "out of the box":
Step implementation name |
Description |
---|---|
MAT.PluginMgr.WholeZoneStep |
This step assigns a single zone
annotation with label "zone" and attribute "region_type" with value
"body", to the entire document. |
MAT.JavaCarafe.CarafeTokenizationStep |
This step runs the Carafe
tokenizer on the relevant document, generating token annotations with
label "lex" in such a way that the zone boundaries are not crossed. |
MAT.JavaCarafe.CarafeTagStep |
This step runs the Carafe
tagger, adding content tags to the document. |
MAT.PluginMgr.HandAnnotationTagStep |
This step is a placeholder for
hand annotation in those workflows that do not have automated content
tagging. |
MAT.PluginMgr.AlignStep |
This step is intended to work
with documents which have been imported from other formats (e.g., XML
inline), which have content annotations which may not align with token
boundaries. This step aligns the content annotation boundaries with
with the token boundaries by expanding the annotations to the nearest
token boundaries. This alignment is expected in the UI annotation tool
(and, in fact, by may trainable tagging engines, including Carafe).
Insert a step with this implementation in your workflows which are
intended to manage imported documents. |
See the sample task for a detailed
example of how these are used.
Any step implemented with the MAT.PluginMgr.HandAnnotationTagStep is special, in that the user performs it, not the MAT engine. Such a step is mostly a placeholder for your annotation activity. Behind the scenes, the document is marked as being tagged as soon as you insert the first tag, but the UI doesn't advance you past this step while the document is still open. When you reopen the document, the UI will show the step as having already been performed.
Steps can also take key-value pair arguments which can be specified
in the task.xml file or in the invocation of the MAT engine. At the
moment, both MAT.JavaCarafe.CarafeTagStep and
MAT.JavaCarafe.CarafeTokenizationStep take key-value pair arguments.
See
the discussion of the Carafe engine
for a description of these
options.
Once you have a set of step implementations to draw from, you can
create mnemonic names for them and assemble them into workflows. Three
extremely common and obvious workflows are
If you create other, custom steps, you may have other workflows.
One quirk of the mnemonic names for steps is that they're global to
the task. The implementation of, say, "tokenize" can differ from
workflow to workflow, but when you apply different workflows to a
document, the document knows what's already been done by virtue of the
named steps that have been applied. So it's not a good idea for the
effect of step implementations to differ among workflows; the
implementations can provide different methods
for achieving the same effect (e.g., different automated taggers, or
hand vs. automated tagging), but they should not vary any further.
In general, tasks provide a customization bundle for your use of
MAT. In this document, we've described the three most prominent
customizations: defining steps, defining workflows, and defining
annotations. There are many other things you can customize:
For relevant examples of these, please consult "The sample task", "Creating a new task", "Creating a new demo", and the
documentation for the task XML and the demo XML.