Working With the Toolkit

MAT's loosely-coupled suite of tools addresses each stage of the process of creating corpora, building models, and assessing your progress. With MAT, you can

Typically, however, a user is going to want to do all of these things, as part of a larger goal of building a gold standard corpus (i.e., a body of annotated documents which are believed to be completely and correctly annotated according to your set of desired annotations), and a trained model (a data object for your automated annotation engine which has been constructed using your corpus, such that it can be used to annotate additional documents, with reasonable accuracy). Here's how to use the various pieces of MAT to accomplish this larger goal.

In the remainder of this document, we'll outline ten large-scale steps involved in using MAT. We've annotated the TALLAL loop illustration with those steps which are relevant:

We also present a more detailed illustration of the lifecycle of document sets, again annotated with those steps which are relevant:

Please refer back to these illustrations as you read the remainder of this document.

Step 1: configure and install your task

This step may already have been done for you, especially if you've received this toolkit as a tarball. If not, we provide documentation about how to do this in simple cases (see here and here).

Step 2: choose an interaction mode

Next, you need to make a decision about how you want to keep track of your documents (what they're named, whether they've been completely annotated, where they live in the file system, etc.). You can manage them yourself, which gives you complete control but also adds some logistical overhead; we call this file mode. Or, you can let MAT manage them for you, which takes away some control but also addresses the logistical overhead; we call this workspace mode. In order to get a feeling for what's involved in each, you may want to study the following documentation sections:

Step 3: organize your files

File mode

If you chose file mode, you should set aside directories for storing the annotated documents, and develop some heuristic for keeping track of which documents you've finished.

Workspace mode

If you chose workspace mode, you should create your workspace and import some documents. The documentation on MATWorkspaceEngine should help you. If you want to import more documents later, you can do that.

Step 4: prepare documents for hand annotation

In order to "seed" the TALLAL loop, you'll need to hand-annotate some documents, and in order to do that, you'll need to ensure that the documents are ready to be annotated. MAT insists that in order to hand-annotate documents, at the very least they must be tokenized - that is, the words must be identified.

File mode

You can do the preparatory steps either using the command line or the UI. For hand annotation, it's probably easiest to do it in the UI, since you'll need to load the documents into the UI anyway. You'll need to select the appropriate workflow, which will likely be named "Hand annotation". The following documentation sections are relevant:

Workspace mode

You can either apply the "tagprep" operation to the relevant documents in the workspace, or open the workspace in the UI, select the appropriate document from the "raw, unprocessed" folder, and perform the "Prepare for hand tagging" operation on the document. The following documentation sections are relevant:

Step 5: hand annotate the document

Assuming the document has been loaded into the UI, this step is the same in both modes. The documentation on using the UI describes hand annotation (among other things).

Be sure to save the document when you're done. In workspace mode, you can save or mark the document as complete; in file mode, you should press the "Save mat-json" button. You don't have to complete the hand annotation in one swoop; you can always open the document again and do more annotation.

Step 6: build a model

Once you've annotated some documents, you can build a model from those documents. You'll be able to use this model to automatically tag more documents, which will reduce the time it takes to complete the annotation.

File mode

Use the MATModelBuilder. If your task is configured appropriately, you'll have the option of saving the model as the default model for the task.

Workspace mode

On the command line, perform the "modelbuild" operation using the MATWorkspaceEngine. For this operation, you have to option of performing the next step at the same time.

Step 7: automatically tag some documents

File mode

On the command line, use the MATEngine to process a directory of files, or use MATEngine or the UI to process one file at a time. On the command line, you'll have to know which steps in which workflow you need to perform to automatically tag; this depends on the configuration of your task. In the UI, you'll have to know which workflow to choose. See also:

Workspace mode

On the command line, perform the "autotag" operation using the MATWorkspaceEngine. If there are no files currently in "raw, unprocessed", you'll need to import some more files into your workspace. See also the documentation on files and workspaces.

Step 8: hand correct the documents

This step is very similar to step 5 above. The only difference is that in workspace mode, you'll be starting in the "autotagged" folder rather than the "in process" folder.

Step 9: lather, rinse, repeat

At this point, you have two paths to creating correct, complete documents: either completely by hand, through steps 4 and 5, or semi-automatically, through steps 7 and 8. You also know how to build a model, which you can do at any time based on your correct, complete documents.

Step 10: check your progress

At any point after you have some correct, complete documents, you can find out how you're doing. (This process ought to be built into the model building stage in workspace mode, but it isn't yet.) To do this, you can use MATExperimentEngine. You'll configure an XML file which describes which documents you want to use as your corpus, how many alternative models you want to build, and which runs you want to perform, and you'll get back detailed, Excel-compatible spreadsheets describing the performance of your automated annotation tool.