Tasks, Annotations, Workflows, and Steps

For the most part, you can't do anything substantial with the MAT toolkit without defining a task. A task is a set of activities, called workflows, which can be broken down into steps. Each task has a set of annotations that its activities share; each step in a task can participate in multiple workflows, and each step makes the "same" contribution to each workflow it participates in. All these concepts are interrelated, and it's difficult to discuss one without the other, but we'll try to describe them in the most sensible order.

Annotations

Annotations, in MAT, are labeled spans of a document. These labeled spans may also have attributes and values associated with them, and each annotation has a category. The semantics of the labels are defined by the MAT toolkit and by the task. (Other tools and infrastructures have a richer notion of annotation, where attribute values are typed and annotations can refer to other annotations instead of text spans; MAT doesn't yet address either of these enhancements.)

Let's take the sample task as an example. Its annotation labels are:

Annotation
Category (see below)
Use
lex
token
each lex tag delimits a non-whitespace token (the basic element used in the Carafe trainer and tagger)
untaggable
untaggable
denotes a region where no content annotations will be found
zone
zone
delimits a contiguous zone in which content annotations will be found
PERSON
content
a proper name of a person
LOCATION
content
a proper name of a location
ORGANIZATION
content
a proper name of an organization

Content annotations

The semantics of the content labels are defined by the task. The sample task is about annotation of named entities, and the three content annotation labels shown here have a standardized interpretation in named entity annotation, for which detailed guidelines have been developed about when to assign each of these annotation labels. Good guidelines are crucial for maximizing agreement among human annotators when preparing gold-standard corpora. When you develop your own task, you should develop similar guidelines; this skill is quite sophisticated, and is outside the scope of this documentation.

The non-content categories above have semantics assigned by the MAT toolkit. In some cases, these semantics are crucial for using MAT correctly. The non-content categories (known as "structure categories") can be inherited from the root task, which is what most tasks will do. This is what you should do, because there are a number of implicit dependencies on the names of these structure categories. We attempt to document those dependencies here.

Zone annotations

Zones correspond to large chunks of the document where annotations can be found. If you're using Carafe, you must have a label named "zone" of category "zone", and that label must have an attribute called "region_type", and the value of that attribute is among the values passed to Carafe to tell Carafe where to look when training and tagging. MAT, by default, uses "body" as the value for "region_type".

The simplest zoning of a document is to assign a single zone of region_type "body" which encompasses the whole document, and there is a zone step available which does this. If you want to get more sophisticated (e.g., exclude HTML or XML tags), you'll have to consult the advanced topics.

Untaggable annotations

These annotations are the complement of the zone annotations. You have to have one of these, and at the moment, it must be named "untaggable", by default. MAT uses these annotations to rule out possible hand annnotations in the hand annotation tool. They're assigned automatically by the infrastructure of the zoner, if you inherit from the default zone class in your implementation of steps (see below).

Token annotations

Tokens correspond, pretty much, to words. In MAT, token annotations are used as the basis for most computation. When you hand-annotate your documents, you are encouraged to require that token annotations are present, so that the MAT annotation tool can determine the possible boundaries of your proposed annotation. This is because the Carafe trainer and tagger both use tokens (not characters) as the "atoms" of computation, and as a result, any annotation whose boundaries do not coincide with token boundaries must be modified or discarded during the training phase (because the element at the relevant edge doesn't correspond to an "atom" as far as Carafe is concerned). Similarly, you must use the same automatic tokenization process for training and tagging, for obvious reasons.

Carafe is one of many trainable annotation tools which rely on tokens as atoms; however, most hand annotation tools don't try to ensure in advance that annotation boundaries match token boundaries, and the users of such tools have to make accommodations later in their workflows.

The MAT toolkit comes with a default English tokenizer, which we describe below when we talk about steps. There's nothing special about this tokenizer; you can replace it with your own, as long as you use the same tokenizer throughout your task. If you inherit your structure annotations from the core task, and use the default tokenizer, you don't have to think about this any further. If you don't, your tokenizer and task have to make sure of several things:

Steps

Steps are atomic actions in your workflows. The most common type of step adds a category of annotation; e.g., a tokenization step adds token annotations. Each step has an implementation, written in Python. We provide a number of useful step implementations which you can use in your workflow. If you want to define your own steps, you'll have to consult the advanced topics.

Here are the steps that MAT provides "out of the box":

Step implementation name
Description
MAT.PluginMgr.WholeZoneStep
This step assigns a single zone annotation with label "zone" and attribute "region_type" with value "body", to the entire document.
MAT.JavaCarafe.CarafeTokenizationStep
This step runs the Carafe tokenizer on the relevant document, generating token annotations with label "lex" in such a way that the zone boundaries are not crossed.
MAT.JavaCarafe.CarafeTagStep
This step runs the Carafe tagger, adding content tags to the document.
MAT.PluginMgr.HandAnnotationTagStep
This step is a placeholder for hand annotation in those workflows that do not have automated content tagging.
MAT.PluginMgr.AlignStep
This step is intended to work with documents which have been imported from other formats (e.g., XML inline), which have content annotations which may not align with token boundaries. This step aligns the content annotation boundaries with with the token boundaries by expanding the annotations to the nearest token boundaries. This alignment is expected in the UI annotation tool (and, in fact, by may trainable tagging engines, including Carafe). Insert a step with this implementation in your workflows which are intended to manage imported documents.

See the sample task for a detailed example of how these are used.

Any step implemented with the MAT.PluginMgr.HandAnnotationTagStep is special, in that the user performs it, not the MAT engine. Such a step is mostly a placeholder for your annotation activity. Behind the scenes, the document is marked as being tagged as soon as you insert the first tag, but the UI doesn't advance you past this step while the document is still open. When you reopen the document, the UI will show the step as having already been performed.

Steps can also take key-value pair arguments which can be specified in the task.xml file or in the invocation of the MAT engine. At the moment, both MAT.JavaCarafe.CarafeTagStep and MAT.JavaCarafe.CarafeTokenizationStep take key-value pair arguments. See the discussion of the Carafe engine for a description of these options.

Workflows

Once you have a set of step implementations to draw from, you can create mnemonic names for them and assemble them into workflows. Three extremely common and obvious workflows are

If you create other, custom steps, you may have other workflows.

One quirk of the mnemonic names for steps is that they're global to the task. The implementation of, say, "tokenize" can differ from workflow to workflow, but when you apply different workflows to a document, the document knows what's already been done by virtue of the named steps that have been applied. So it's not a good idea for the effect of step implementations to differ among workflows; the implementations can provide different methods for achieving the same effect (e.g., different automated taggers, or hand vs. automated tagging), but they should not vary any further.

Other things you may find in tasks

In general, tasks provide a customization bundle for your use of MAT. In this document, we've described the three most prominent customizations: defining steps, defining workflows, and defining annotations. There are many other things you can customize:

For relevant examples of these, please consult "The sample task", "Creating a new task", "Creating a new demo", and the documentation for the task XML and the demo XML.