MAT's loosely-coupled suite of tools addresses each stage of the
process of creating corpora, building models, and assessing your
progress. With
MAT, you can
Typically, however, a user is going to want to do all of these
things, as part of a larger goal of building a gold standard corpus (i.e., a body
of annotated documents which are believed to be completely and
correctly annotated according to your set of desired annotations), and
a trained model (a data object
for your automated annotation engine which has been constructed using
your corpus, such that it can be used to annotate additional documents,
with reasonable accuracy). Here's how to use the various pieces of MAT
to accomplish this larger goal.
In the remainder of this document, we'll outline ten large-scale
steps involved in using MAT. We've annotated the TALLAL loop
illustration with those steps which are relevant:
We also present a more detailed
illustration of the lifecycle of document sets, again annotated with
those steps which are relevant:
Please refer back to these illustrations
as you read the remainder of this document.
This step may already have been done for you, especially if you've
received this toolkit as a tarball. If not, we provide documentation
about how to do this in simple cases (see here
and here).
Next, you need to make a decision about how you want to keep track
of your documents (what they're named, whether they've been completely
annotated, where they live in the file system, etc.). You can manage
them yourself, which gives you complete control but also adds some
logistical overhead; we call this file
mode. Or, you can let MAT manage them for you, which takes away
some control but also addresses the logistical overhead; we call this workspace mode. In order to get a
feeling for what's involved in each, you may want to study the
following documentation sections:
Next, you should get yourself organized.
If you chose file mode, you should set aside directories for storing
the annotated documents, and develop some heuristic for keeping track
of which documents you've finished.
If you chose workspace mode, you should create your workspace and
import some documents. The documentation on MATWorkspaceEngine should help you.
If you want to import more documents later, you can do that.
In order to "seed" the TALLAL loop,
you'll need to hand-annotate some documents, and in order to do that,
you'll need to ensure that the documents are ready to be annotated. MAT
insists that in order to hand-annotate documents, at the very least
they must be tokenized - that
is, the words must be identified.
You can do the preparatory steps either using the command line or
the UI. For hand annotation, it's probably easiest to do it in the UI,
since you'll need to load the documents into the UI anyway. You'll need
to select the appropriate workflow, which will likely be named "Hand
annotation". The following documentation sections are relevant:
You can either apply the "tagprep" operation to the relevant
documents in the workspace, or open the workspace in the UI, select the
appropriate document from the "raw, unprocessed" folder, and perform
the "Prepare for hand tagging" operation on the document. The following
documentation sections are relevant:
Assuming the document has been loaded into the UI, this step is the
same in both modes. The documentation on using
the UI describes hand annotation (among other things).
Be sure to save the document when you're done. In workspace mode,
you can save or mark the document as complete; in file mode, you should
press the "Save mat-json" button. You don't have to complete the hand
annotation in one swoop; you can always open the document again and do
more annotation.
Once you've annotated some documents, you can build a model from
those documents. You'll be able to use this model to automatically tag
more documents, which will reduce the time it takes to complete the
annotation.
Use the MATModelBuilder. If your
task is configured appropriately, you'll have the option of saving the
model as the default model for the task.
On the command line, perform the "modelbuild" operation using the MATWorkspaceEngine. For this
operation, you have to option of performing the next step at the same
time.
Next, you'll want to automatically tag your next batch of documents.
On the command line, use the MATEngine
to process a directory of files, or use MATEngine or the UI to process
one file at a time. On the command line, you'll have to know which
steps in which workflow you need to perform to automatically tag; this
depends on the configuration of your task. In the UI, you'll have to
know which workflow to choose. See also:
On the command line, perform the "autotag" operation using the MATWorkspaceEngine. If there are no
files currently in "raw, unprocessed", you'll need to import some more
files into your workspace. See also the documentation on files and workspaces.
This step is very similar to step 5 above. The only difference is
that in workspace mode, you'll be starting in the "autotagged" folder
rather than the "in process" folder.
At this point, you have two paths to creating correct, complete
documents: either completely by hand, through steps 4 and 5, or
semi-automatically, through steps 7 and 8. You also know how to build a
model, which you can do at any time based on your correct, complete
documents.
At any point after you have some correct, complete documents, you
can find out how you're doing. (This process ought to be built into the
model building stage in workspace mode, but it isn't yet.) To do this,
you can use MATExperimentEngine.
You'll configure an XML file which describes which documents you want
to use as your corpus, how many alternative models you want to build,
and which runs you want to perform, and you'll get back detailed,
Excel-compatible spreadsheets describing the performance of your
automated annotation tool.