The task you use will contain a number of different sorts of
information:
We'll talk about workspaces in a
bit; right now, we're going to talk about workflows, model
building, and automated tagging.
In MAT, each workflow consists of a series of steps.
These steps are global in the task; the workflows put subsets of
them in fixed orders, depending on your activity. In MAT 2.0, you
might encounter the following workflows:
There will be others, but these are the important ones.
In these workflows, you'll typically find these steps:
step name |
purpose |
details |
---|---|---|
"zone" |
a step for zoning the document | This step adds zone and admin annotations. The document zones are the areas that the subsequent steps should pay attention to. The simplest zone step simply marks the entire document as relevant. |
"tokenize" |
a step for tokenizing the document | This step adds token annotations. Tokens are
basically words, and the automated engine which comes with
MAT uses tokens, rather than characters, as its basis for
analysis. If you're going to use the automated engine,
either to build a model or to do automated annotation, you
have to have tokens. MAT comes with a default tokenizer for
English. |
"hand tag" |
a step for doing hand annotation | This step is available (obviously) only in the MAT UI, and in it you can add (by hand) the content annotations in your task. |
"tag" |
a step for doing automated annotation | This step allows you to apply
previously-created models to your document to add content
annotations automatically. If you're in the UI, his step
also provides you with the opportunity to correct the output
of automated tagging. |
"mark gold" |
a step for marking a document gold (i.e., done) | This step modifies the admin annotations. In
this case, completing this step indicates that the annotator
judges that these annotations are complete and correct. |
There are other possible steps, but these are the ones you'll
encounter most frequently.
In tutorial 1, you saw how to apply these steps in the MAT UI, and in tutorial 5, you saw how to apply them on the command line using the MATEngine tool.
As we saw in tutorial 2, tutorial 3, and tutorial 5, we can build a model
using hand-annotated or hand-corrected documents, and apply these
models to other, unannotated documents.
The training engine that comes with MAT, Carafe, only works on what we've
called simple span annotations: spanned
annotations with labels or effective labels and no other
attributes. (The person who configured your task may have set up a
different engine, one which can build models for more complex
annotations; she'll tell you if she did that.) Approximately,
Carafe analyzes the annotated documents and computes the
likelihoods of the various labels occurring in the various
contexts it encounters, as defined by a set of features (e.g.,
what the word is, whether it's capitalized, whether it's
alphanumeric, what words precede and follow) it extracts from the
documents it builds a model from. (The specific technique it uses
is conditional random fields.) You can then present a new
document to Carafe, and based on the features it finds in that new
document, it will insert annotations in the locations the model
predicts should be there.
In general, the more documents you use to train an engine like
Carafe, and the more exemplars of each annotation label it
finds in the training documents, and the greater the variety of
contexts those labels occur in in the training documents, the
better a job the engine will do of predicting where the
annotations should be in new, unannotated documents.
These engines are not likely to do a perfect job. There are ways
to improve the engine's performance other than providing more
data; these engines, including Carafe, can be tuned in a wide
variety of ways. MAT doesn't help you do that. MAT is a tool for
corpus development and human annotator support; its goal is not to
help you produce the best automated tagging system. If you're
brave, you can tune Carafe in all sorts of ways, and MAT tries not
to hinder your ability to do that, if you know what you're doing;
but it's not the point of the toolkit.
The other thing you need to know is that while Carafe only works
on simple span annotations, complex annotations won't cause it to
break; it'll just ignore everything it can't handle. So if your
task has spanless annotations, and spanned annotations with lots
of attributes, Carafe will happily build a model for the spanned
labels alone, and you can use your complex annotated data to train
that simple model, and you can use that simple model to
automatically insert those simple span annotations, and insert the
remainder of the annotations and attributes by hand.