The basic unit of document enrichment in MAT is the annotation. There are two
types of annotations in MAT: span
annotations and spanless
annotations. Span annotations are anchored to a
particular contiguous span in the document, and make some implicit
assertion about it; e.g., the span from character 10 to character
15 is a noun phrase. Spanless annotations are not anchored to a
span, and are used to make assertions about the entire document,
or to make assertions about other annotations; e.g., annotation 1
and annotation 2 refer to the same entity.
The task specification for each task contains one or more annotation set descriptors, which
define the annotations available in the task. In MAT 2.0, there's
very little reason, if any, to have multiple annotation set
descriptors defined; however, upcoming versions of MAT will
exploit this capability extensively.
In MAT, annotations have labels, and may have attributes and
values associated with them. Each attribute has a type and
an aggregation.
The types are:
See here
for examples of defining attributes of various types.
It is possible for any attribute to have a null value (Java and JavaScript null, Python None), or not to be present. These conditions are intended to be equivalent, and the various document libraries mostly treat them that way; however, there may be times in Python where an attribute which has never been set will raise a KeyError.
The aggregations are:
MAT does not currently limit the possible combinations of these
types and aggregations, even if some of them are nonsensical
(e.g., a set of booleans). However, the annotation UI has not been
enriched with the ability to edit all 15 possible combinations;
some of the less common ones (e.g., sets of strings which are not
restricted by a choice list) have not been implemented yet. We're
committed to supporting all of them, eventually.
MAT also supports the notion of effective label. An
effective label is a notional label (e.g., "PERSON") which is
presented to the user in its notional form, but is implemented as
an annotation label plus an attribute value of a string attribute
which is specified to have an exhaustive list of choices. Labels
of this sort are also known as "ENAMEX-style" labels, after the
MUC convention of defining "PERSON" as "ENAMEX" + type="PER", and
"ORGANIZATION" and "LOCATION" analogously.
Simple span annotations are span annotations which have
either no attributes, or a single attribute/value pair which
defines the effective label. This notion is important because it
is the class of annotations which the Carafe engine can build models
for. While it's possible to use more complex engines with MAT (we
don't provide details on how to do it, but it's possible), MAT
"out of the box" has more limited capabilities for annotations
which are more complex than this (i.e., spanless annotations, or
annotations with more attributes). In these case, Carafe will
happily build and apply models for the simple span subset of the
complex annotations it's given, but this capability isn't enough
to support the full tag-a-little,
learn-a-little loop, or the experiment harness.
Here's a complete list of these limitations:
Annotation categories are one way of notifying MAT about the role
of the various annotations in the MAT task. Annotations in the
"content" category are, in some sense, what the task is about;
these are the annotations you'll be adding by hand, or using the Carafe engine to add
automatically.
The other categories above have semantics assigned by MAT. These
semantics are crucial for using MAT correctly. The annotations in
the "token" and "zone" categories (known as "structure
categories") can be inherited from the root task, which is what
most tasks will do. This is what you should do in most cases,
since it's unusual to need to define your own structural
annotations. We attempt to document MAT's dependencies on these
categories below. The annotations in the "admin" category are
special; all tasks will have these annotations.
The semantics of the content labels are defined by the task. For
instance, the sample task is about annotation of named entities,
and the three content annotation labels shown here have a
standardized interpretation in named entity annotation, for which
detailed guidelines have been developed about when to assign each
of these annotation labels. You may not add annotations named
SEGMENT or VOTE, since these are reserved names of admin
annotations.
Good guidelines are crucial for maximizing agreement among human
annotators when preparing gold-standard corpora. When you develop
your own task, you should develop similar guidelines; this skill
is quite sophisticated, and is outside the scope of this
documentation.
Zones correspond to large chunks of the document where annotations can be found. The default zone annotation in MAT is the "zone" label, which has an attribute called "region_type", whose value is typically "body" (it's possible for the Carafe engine to use the zone attributes as a training feature, but we don't use that capability at the moment). The implementation of your task provides the Carafe engine with information about the default zone annotation; you can change this implementation (and you must if you have a custom zone annotation) as described in the advanced topics. You may not add annotations named SEGMENT or VOTE, since these are reserved names of admin annotations.
The simplest zoning of a document is to assign a single zone of
region_type "body" which encompasses the whole document, and there
is a zone step available which does this. If you want to get more
sophisticated (e.g., exclude HTML or XML tags), you'll have to
consult the advanced
topics.
Zone tags are also used in the UI to induce the creation of
untaggable regions, which are regions that have no zone
annotation. These regions are indicated visually by graying out
the document text. If the document has any zone annotations at
all, the UI will create these regions.
Tokens correspond, pretty much, to words. In MAT, token
annotations are used as the basis for most computation. When you
hand-annotate your documents, you are encouraged to require that
token annotations are present, so that the MAT annotation tool can
determine the possible boundaries of your proposed annotation.
This is because the Carafe trainer and tagger both use tokens (not
characters) as the "atoms" of computation, and as a result, any
annotation whose boundaries do not coincide with token boundaries
must be modified or discarded during the training phase (because
the element at the relevant edge doesn't correspond to an "atom"
as far as Carafe is concerned). Similarly, you must use the same
automatic tokenization process for training and tagging, for
obvious reasons.
Carafe is one of many trainable annotation tools which rely on
tokens as atoms; however, most hand annotation tools don't try to
ensure in advance that annotation boundaries match token
boundaries, and the users of such tools have to make
accommodations later in their workflows.
The MAT toolkit comes with a default English tokenizer, which we
describe when we talk about steps.
There's nothing special about this tokenizer; you can replace it
with your own, as long as you use the same tokenizer throughout
your task. If you inherit your structure annotations from the core
task, and use the default tokenizer, you don't have to think about
this any further. If you don't, your tokenizer and task have to
make sure of several things:
As an example, let's take a look at the annotations in the sample
'Named Entity' task.
Annotation label |
Category |
Use |
---|---|---|
lex |
token |
each lex tag delimits a
non-whitespace token (the basic element used in the Carafe
trainer and tagger) |
zone |
zone |
delimits a contiguous zone in
which content annotations will be found |
PERSON |
content |
a proper name of a person |
LOCATION |
content |
a proper name of a location |
ORGANIZATION |
content |
a proper name of an
organization |
SEGMENT |
admin |
administrative info about the
progress of annotation (present in all tasks) |
VOTE |
admin |
administrative info about reconciliation
(present in all tasks) |
The final category of annotations, admin annotations, are crucial to the operation of
the 2.0 version of MAT. However, both of them work behind the
scenes for you. If you don't care about the inner bookkeeping that
MAT uses, you may skip this section; all you really need to know
is that using the two admin annotations, SEGMENT and VOTE, MAT can
keep detailed track of how the annotation of various portions of a
document is progressing.
These admin annotations play a much smaller role in MAT 2.0
than we originally anticipated. We had originally expected MAT
2.0 to contain support for partial hand annotation and
correction (e.g., annotating only certain regions of documents),
for active learning, and for reconciliation workflows in
workspaces. However, our exploration of active learning has
convinced us that there's no evidence that it reduces annotator
time for document-oriented annotation tasks, and our extensions
for spanless annotation have forced us to reconsider our
original strategy for reconciliation. We've issued a
reconciliation tool for simple span tasks with MAT 2.0, but this
tool will be replaced in the next version of MAT with a more
general-purpose reconciliation tool based on the new comparison tool and the new scorer. The new reconciliation
workflows will still require SEGMENTs and VOTEs, but they'll
look different than they do now.
This section should be regarded as applying exclusively to MAT
2.0, with considerable future modifications planned.
There are currently two admin annotations: SEGMENT, which records
information about the progress of annotation, and VOTE, which is
used specifically during reconciliation.
Reconciliation is an advanced topic, and we'll talk about VOTE
annotations when we talk about reconciliation. Here, we describe
the role of the SEGMENT annotation.
The SEGMENT is a span annotation. It has two attributes,
"annotator" and "status". These attributes occur in the following
configurations of values:
Value of the annotator
attribute |
Value of the status attribute |
What it means |
---|---|---|
(null) |
"non-gold" |
The segment is untouched. No
human or automated annotator has modified it. |
"MACHINE" |
"non-gold" |
The segment has been
annotated by an automated tool, but no human has marked it
as completed. |
(a user's name) |
"non-gold" |
A user "owns" this segment
and has modified the segment in some way, but is not
prepared to mark it as completed. If a user corrects a
segment which an automated tool has annotated, the user now
"owns" the segment. |
(a user's name) |
"human gold" |
A user "owns" the segment and
has marked it as complete. |
(a user's name) |
"reconciled" |
A user "owns" the segment and
has marked it complete, and the segment has been vetted by
the reconciliation process. |
(Reconciliation adds a few more attributes to the SEGMENT annotation. We'll discuss those later.)
It's possible, then, to have partially annotated documents, with
multiple segments, some of whose segments are gold and some of
whose segments are not gold. The 2.0 version of MAT does not yet
expose this possibility in its UI (except during the
reconciliation process), but the infrastructure to represent this
is already there. These attribute values can be rolled up,
conceptually, to represent documents which are in a variety of
states of completion. These document status values are used
extensively in workspaces.
"reconciled" |
documents all of whose SEGMENTs have status = "reconciled" |
"gold" |
documents all of whose SEGMENTS have status = "reconciled" or status = "human gold" (and at least one "human gold" SEGMENT) |
"partially gold" |
documents some (but not all) of whose SEGMENTs have status = "reconciled" or status = "human gold" |
"uncorrected" |
documents all of whose SEGMENTS have annotator = "MACHINE" |
"partially corrected" |
documents some of whose SEGMENTs have annotator != "MACHINE" and annotator != null (i.e., they've been touched by a human annotator) |
"unannotated" |
documents which have no content annotations in any segment and no SEGMENTs which are owned by any annotator |
All the relevant MAT tools are aware of the SEGMENT annotations.