Annotations and annotation progress

Annotations

The basic unit of document enrichment in MAT is the annotation. There are two types of annotations in MAT: span annotations and spanless annotations. Span annotations are anchored to a particular contiguous span in the document, and make some implicit assertion about it; e.g., the span from character 10 to character 15 is a noun phrase. Spanless annotations are not anchored to a span, and are used to make assertions about the entire document, or to make assertions about other annotations; e.g., annotation 1 and annotation 2 refer to the same entity.

The task specification for each task contains one or more annotation set descriptors, which define the annotations available in the task. In MAT 2.0, there's very little reason, if any, to have multiple annotation set descriptors defined; however, upcoming versions of MAT will exploit this capability extensively.

Annotation attributes and values

In MAT, annotations have labels, and may have attributes and values associated with them. Each attribute has a type and an aggregation.

The types are:

See here for examples of defining attributes of various types.

It is possible for any attribute to have a null value (Java and JavaScript null, Python None), or not to be present. These conditions are intended to be equivalent, and the various document libraries mostly treat them that way; however, there may be times in Python where an attribute which has never been set will raise a KeyError.

The aggregations are:

MAT does not currently limit the possible combinations of these types and aggregations, even if some of them are nonsensical (e.g., a set of booleans). However, the annotation UI has not been enriched with the ability to edit all 15 possible combinations; some of the less common ones (e.g., sets of strings which are not restricted by a choice list) have not been implemented yet. We're committed to supporting all of them, eventually.

MAT also supports the notion of effective label. An effective label is a notional label (e.g., "PERSON") which is presented to the user in its notional form, but is implemented as an annotation label plus an attribute value of a string attribute which is specified to have an exhaustive list of choices. Labels of this sort are also known as "ENAMEX-style" labels, after the MUC convention of defining "PERSON" as "ENAMEX" + type="PER", and "ORGANIZATION" and "LOCATION" analogously.

Simple span annotations vs. complex annotations

Simple span annotations are span annotations which have either no attributes, or a single attribute/value pair which defines the effective label. This notion is important because it is the class of annotations which the Carafe engine can build models for. While it's possible to use more complex engines with MAT (we don't provide details on how to do it, but it's possible), MAT "out of the box" has more limited capabilities for annotations which are more complex than this (i.e., spanless annotations, or annotations with more attributes). In these case, Carafe will happily build and apply models for the simple span subset of the complex annotations it's given, but this capability isn't enough to support the full tag-a-little, learn-a-little loop, or the experiment harness.

Here's a complete list of these limitations:

Annotation categories

Annotation categories are one way of notifying MAT about the role of the various annotations in the MAT task. Annotations in the "content" category are, in some sense, what the task is about; these are the annotations you'll be adding by hand, or using the Carafe engine to add automatically.

The other categories above have semantics assigned by MAT. These semantics are crucial for using MAT correctly. The annotations in the "token" and "zone" categories (known as "structure categories") can be inherited from the root task, which is what most tasks will do. This is what you should do in most cases, since it's unusual to need to define your own structural annotations. We attempt to document MAT's dependencies on these categories below. The annotations in the "admin" category are special; all tasks will have these annotations.

Content annotations

The semantics of the content labels are defined by the task. For instance, the sample task is about annotation of named entities, and the three content annotation labels shown here have a standardized interpretation in named entity annotation, for which detailed guidelines have been developed about when to assign each of these annotation labels. You may not add annotations named SEGMENT or VOTE, since these are reserved names of admin annotations.

Good guidelines are crucial for maximizing agreement among human annotators when preparing gold-standard corpora. When you develop your own task, you should develop similar guidelines; this skill is quite sophisticated, and is outside the scope of this documentation.

Zone annotations

Zones correspond to large chunks of the document where annotations can be found. The default zone annotation in MAT is the "zone" label, which has an attribute called "region_type", whose value is typically "body" (it's possible for the Carafe engine to use the zone attributes as a training feature, but we don't use that capability at the moment). The implementation of your task provides the Carafe engine with information about the default zone annotation; you can change this implementation (and you must if you have a custom zone annotation) as described in the advanced topics. You may not add annotations named SEGMENT or VOTE, since these are reserved names of admin annotations.

The simplest zoning of a document is to assign a single zone of region_type "body" which encompasses the whole document, and there is a zone step available which does this. If you want to get more sophisticated (e.g., exclude HTML or XML tags), you'll have to consult the advanced topics.

Zone tags are also used in the UI to induce the creation of untaggable regions, which are regions that have no zone annotation. These regions are indicated visually by graying out the document text. If the document has any zone annotations at all, the UI will create these regions.

Token annotations

Tokens correspond, pretty much, to words. In MAT, token annotations are used as the basis for most computation. When you hand-annotate your documents, you are encouraged to require that token annotations are present, so that the MAT annotation tool can determine the possible boundaries of your proposed annotation. This is because the Carafe trainer and tagger both use tokens (not characters) as the "atoms" of computation, and as a result, any annotation whose boundaries do not coincide with token boundaries must be modified or discarded during the training phase (because the element at the relevant edge doesn't correspond to an "atom" as far as Carafe is concerned). Similarly, you must use the same automatic tokenization process for training and tagging, for obvious reasons.

Carafe is one of many trainable annotation tools which rely on tokens as atoms; however, most hand annotation tools don't try to ensure in advance that annotation boundaries match token boundaries, and the users of such tools have to make accommodations later in their workflows.

The MAT toolkit comes with a default English tokenizer, which we describe when we talk about steps. There's nothing special about this tokenizer; you can replace it with your own, as long as you use the same tokenizer throughout your task. If you inherit your structure annotations from the core task, and use the default tokenizer, you don't have to think about this any further. If you don't, your tokenizer and task have to make sure of several things:

A sample inventory of annotations

As an example, let's take a look at the annotations in the sample 'Named Entity' task.

Annotation label
Category
Use
lex
token
each lex tag delimits a non-whitespace token (the basic element used in the Carafe trainer and tagger)
zone
zone
delimits a contiguous zone in which content annotations will be found
PERSON
content
a proper name of a person
LOCATION
content
a proper name of a location
ORGANIZATION
content
a proper name of an organization
SEGMENT
admin
administrative info about the progress of annotation (present in all tasks)
VOTE
admin
administrative info about reconciliation (present in all tasks)

Admin annotations, SEGMENTs, and annotation progress

The final category of annotations, admin annotations, are crucial to the operation of the 2.0 version of MAT. However, both of them work behind the scenes for you. If you don't care about the inner bookkeeping that MAT uses, you may skip this section; all you really need to know is that using the two admin annotations, SEGMENT and VOTE, MAT can keep detailed track of how the annotation of various portions of a document is progressing.

These admin annotations play a much smaller role in MAT 2.0 than we originally anticipated. We had originally expected MAT 2.0 to contain support for partial hand annotation and correction (e.g., annotating only certain regions of documents), for active learning, and for reconciliation workflows in workspaces. However, our exploration of active learning has convinced us that there's no evidence that it reduces annotator time for document-oriented annotation tasks, and our extensions for spanless annotation have forced us to reconsider our original strategy for reconciliation. We've issued a reconciliation tool for simple span tasks with MAT 2.0, but this tool will be replaced in the next version of MAT with a more general-purpose reconciliation tool based on the new comparison tool and the new scorer. The new reconciliation workflows will still require SEGMENTs and VOTEs, but they'll look different than they do now.

This section should be regarded as applying exclusively to MAT 2.0, with considerable future modifications planned.

The details

There are currently two admin annotations: SEGMENT, which records information about the progress of annotation, and VOTE, which is used specifically during reconciliation. Reconciliation is an advanced topic, and we'll talk about VOTE annotations when we talk about reconciliation. Here, we describe the role of the SEGMENT annotation.

SEGMENTs are a disjoint cover of the zone annotations in a document. These annotations can capture a number of aspects about the progress of annotation:

The SEGMENT is a span annotation. It has two attributes, "annotator" and "status". These attributes occur in the following configurations of values:

Value of the annotator attribute
Value of the status attribute
What it means
(null)
"non-gold"
The segment is untouched. No human or automated annotator has modified it.
"MACHINE"
"non-gold"
The segment has been annotated by an automated tool, but no human has marked it as completed.
(a user's name)
"non-gold"
A user "owns" this segment and has modified the segment in some way, but is not prepared to mark it as completed. If a user corrects a segment which an automated tool has annotated, the user now "owns" the segment.
(a user's name)
"human gold"
A user "owns" the segment and has marked it as complete.
(a user's name)
"reconciled"
A user "owns" the segment and has marked it complete, and the segment has been vetted by the reconciliation process.

(Reconciliation adds a few more attributes to the SEGMENT annotation. We'll discuss those later.)

It's possible, then, to have partially annotated documents, with multiple segments, some of whose segments are gold and some of whose segments are not gold. The 2.0 version of MAT does not yet expose this possibility in its UI (except during the reconciliation process), but the infrastructure to represent this is already there. These attribute values can be rolled up, conceptually, to represent documents which are in a variety of states of completion. These document status values are used extensively in workspaces.

"reconciled"
documents all of whose SEGMENTs have status = "reconciled"
"gold"
documents all of whose SEGMENTS have status = "reconciled" or status = "human gold" (and at least one "human gold" SEGMENT)
"partially gold"
documents some (but not all) of whose SEGMENTs have status = "reconciled" or status = "human gold"
"uncorrected"
documents all of whose SEGMENTS have annotator = "MACHINE"
"partially corrected"
documents some of whose SEGMENTs have annotator != "MACHINE" and annotator != null (i.e., they've been touched by a human annotator)
"unannotated"
documents which have no content annotations in any segment and no SEGMENTs which are owned by any annotator

All the relevant MAT tools are aware of the SEGMENT annotations.