Tasks, training and automated tagging

The task you use will contain a number of different sorts of information:

The annotations and attributes available to you, and how they're displayed
The workflows that you can apply to your documents, either using the workflow engine or the MAT UI
The UI customizations that will be applied in your task (e.g., whether text flows right to left or left to right, what special code should be run to modify your UI display)
The settings that the model builder will use to build your tagging models
The operations you apply to implement various workspace operations
The settings that MAT will use to create comparison views and score documents

We'll talk about workspaces in a bit; right now, we're going to talk about workflows, model building, and automated tagging.

Steps and workflows

In MAT, each workflow consists of a series of steps. These steps are global in the task; the workflows put subsets of them in fixed orders, depending on your activity. In MAT 2.0, you might encounter the following workflows:

a workflow to do mixed-initiative annotation (typically called "Demo")
a workflow to do hand annotation, with tokenization (typically called "Hand annotation")
a workflow to do hand annotation, without tokenization (typically called "Tokenless hand annotation")
a workflow to review and correct documents (typically called "Review/repair")

There will be others, but these are the important ones.

In these workflows, you'll typically find these steps:

step name	purpose	details
"zone"	a step for zoning the document	This step adds zone and admin annotations. The document zones are the areas that the subsequent steps should pay attention to. The simplest zone step simply marks the entire document as relevant.
"tokenize"	a step for tokenizing the document	This step adds token annotations. Tokens are basically words, and the automated engine which comes with MAT uses tokens, rather than characters, as its basis for analysis. If you're going to use the automated engine, either to build a model or to do automated annotation, you have to have tokens. MAT comes with a default tokenizer for English.
"hand tag"	a step for doing hand annotation	This step is available (obviously) only in the MAT UI, and in it you can add (by hand) the content annotations in your task.
"tag"	a step for doing automated annotation	This step allows you to apply previously-created models to your document to add content annotations automatically. If you're in the UI, his step also provides you with the opportunity to correct the output of automated tagging.
"mark gold"	a step for marking a document gold (i.e., done)	This step modifies the admin annotations. In this case, completing this step indicates that the annotator judges that these annotations are complete and correct.

There are other possible steps, but these are the ones you'll encounter most frequently.

In tutorial 1, you saw how to apply these steps in the MAT UI, and in tutorial 5, you saw how to apply them on the command line using the MATEngine tool.

Training and automated tagging

As we saw in tutorial 2, tutorial 3, and tutorial 5, we can build a model using hand-annotated or hand-corrected documents, and apply these models to other, unannotated documents.

The training engine that comes with MAT, Carafe, only works on what we've called simple span annotations: spanned annotations with labels or effective labels and no other attributes. (The person who configured your task may have set up a different engine, one which can build models for more complex annotations; she'll tell you if she did that.) Approximately, Carafe analyzes the annotated documents and computes the likelihoods of the various labels occurring in the various contexts it encounters, as defined by a set of features (e.g., what the word is, whether it's capitalized, whether it's alphanumeric, what words precede and follow) it extracts from the documents it builds a model from. (The specific technique it uses is conditional random fields.) You can then present a new document to Carafe, and based on the features it finds in that new document, it will insert annotations in the locations the model predicts should be there.

In general, the more documents you use to train an engine like Carafe, and the more exemplars of each annotation label it finds in the training documents, and the greater the variety of contexts those labels occur in in the training documents, the better a job the engine will do of predicting where the annotations should be in new, unannotated documents.

These engines are not likely to do a perfect job. There are ways to improve the engine's performance other than providing more data; these engines, including Carafe, can be tuned in a wide variety of ways. MAT doesn't help you do that. MAT is a tool for corpus development and human annotator support; its goal is not to help you produce the best automated tagging system. If you're brave, you can tune Carafe in all sorts of ways, and MAT tries not to hinder your ability to do that, if you know what you're doing; but it's not the point of the toolkit.

The other thing you need to know is that while Carafe only works on simple span annotations, complex annotations won't cause it to break; it'll just ignore everything it can't handle. So if your task has spanless annotations, and spanned annotations with lots of attributes, Carafe will happily build a model for the spanned labels alone, and you can use your complex annotated data to train that simple model, and you can use that simple model to automatically insert those simple span annotations, and insert the remainder of the annotations and attributes by hand.