Customizing a Task: Advanced Topics

Using a different annotation tool

Let's say that you have your own annotation tool, and you'd rather use that tool than the MAT tool. You'll have to make accommodations for how this tool's assumptions differ from those in MAT.

One of the most crucial assumptions is what happens with tokens. While it's possible to use portions of MAT without tokens, we don't guarantee that the entire suite will work if you don't provide tokens. If tokens are present, we try to enforce the generalization that content annotations are synchronized with token boundaries. That is, the only character indexes in a document which are eligible to be start index of a content annotation are the start indexes of token annotations, and the only character indexes which are eligible to be the end index are the end indexes of token annotations. The reason for this generalization is that the Carafe engine assumes tokens as the basic atomic elements for the training and tagging, and those basic elements must be consistent with hand annotation, so that the hand annotated data is appropriate for the training.

In MAT, this assumption is enforced by the standard configuration of the MAT hand annotation tool; you should not configure a workflow to allow hand annotation unless tokenization is already done, and while it is possible to hand annotate without tokens present, we don't recommend it. The result may be that your training data may not align with the Carafe engine's notion of atomic elements, which will render that portion of your data unusable.

If your hand annotation tool respects token boundaries, and you're willing to ensure that tokenization happens before hand annotation, all you need to do is be able to write out the MAT document format, for which we provide Python and Java libraries. But most hand annotation tools do not expect, respect, or enforce token annotation during other hand annotation phases. If you want to use such a tool, you should try inserting the Align step into your workflow after hand annotation applies. This step expands content annotations to the boundary of whatever tokens the content annotation overlaps at its edges.

Make a special workflow for document preparation

Under construction.

Make a special workspace folder for these documents

Under construction.

Using a different training and tagging engine

It is possible, in MAT, to use a different training and tagging engine than the default Carafe engine. When we get around to documenting how to do this, it will be documented here.

Adding workspace folders

MAT comes with a few predefined workspace folders, and a means for moving documents between them. Under some circumstances, you might want to add a folder. In this example, let's suppose that your task includes a summarization capability that produces new, summarized documents from documents that are already tagged, and that you want to save these summarized documents in your workspace.

In order to do this, you'll have to specialize the core task object, in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

You'll customize your task implementation in your task directory in python/<file>.py, where <file> is a name of your choice. When you refer to this implementation in your task.xml file, you'll refer to it as "<file>.<stepname>" , in the class attribute of your <task> element. For instance, if you place the following in the file python/MyPlugin.py:

from MAT.PluginMgr import PluginTaskDescriptor

class MyTaskDescriptor(PluginTaskDescriptor):
....

you'll refer to it in task.xml as follows:

<task name="My Task" class="MyPlugin.MyTaskDescriptor">
...
</task>

To add a workspace folder, you'll add a workspaceCustomize method to your task descriptor:

class MyTaskDescriptor(PluginTaskDescriptor):

def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)

Here, we add a folder named "summarized", which we can't import documents directly into.

Next, we need to add some behavior. In our example, we want to be able to apply a summarize action to documents in the completed folder in the workspace, and have the results land in the summarized folder. So our Python really looks like this:

from MAT.PluginMgr import PluginTaskDescriptor
from MAT.Workspace import WorkspaceOperation

class MyTaskDescriptor(PluginTaskDescriptor):

def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
workspace.folders["completed"].addOperation("summarize", SummarizationOperation)

class SummarizationOperation(WorkspaceOperation):

name = "summarize"
...

We won't describe the full implementation of operations here; see MAT_PKG_HOME/lib/mat/python/MAT/Workspace.py for examples.

Using task settings

Under construction.

Creating your own steps

Steps in MAT are written in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

They should be defined in your task directory in python/<file>.py, where <file> is a name of your choice. When you refer to those steps in your task.xml file, you'll refer to them as "<file>.<stepname>" .

Here's the skeleton of a step:

from MAT.PluginMgr import PluginStep

class MyStep(PluginStep):

def do(self, annotSet, **kw):
# ... make modifications to annotSet
return annotSet

The annotSet is a rich document, defined in MAT_PKG_HOME/lib/mat/python/MAT/Document.py. This class has methods to add and modify annotations, which is mostly what steps do. For examples of how steps use this API, see the class descendants of PluginStep in MAT_PKG_HOME/lib/mat/python/MAT/PluginMgr.py.

Most steps work by side effect, although it's possible to return a different document than the one you were handed, and MAT will recognize that as a new document. Most toolchains will not take advantage of this capability.

Steps have three methods: undo(), do() and doBatch(). By default, doBatch() calls do() for each document it's passed. You can define a special doBatch() if you have batch-level operations (e.g., if tagging every document in a directory is faster than calling the tagger for each document). All three methods can have keyword arguments, which are defined by an "argList" class variable (see PluginMgr.py for examples). Every keyword argument passed to the engine is passed to every step, so your function signature must always end with **kw.

Clean step

A common action might be to ensure that all files are in ASCII format with Unix line endings. Here's how you'd do that in your task:

from MAT.PluginMgr import CleanStep

class MyCleanStep(CleanStep):

def do(self, annotSet, **kw):
return self.truncateToUnixAscii(annotSet)

The truncateToUnixAscii method is defined on CleanStep, so you should inherit from there.

Note: because this step changes the signal of the document, it must be the first step in any workflow, and it cannot be undone; undoing any step that inherits from CleanStep will raise an error.

Zone step

You may want to establish a single zone in your document, in between <TXT> and </TXT>. Here's how you'd do that:

from MAT.PluginMgr import ZoneStep

class MyZoneStep(ZoneStep):

import re

TXT_RE = re.compile("<TXT>(.*)</TXT>", re.I | re.S)

# AMS drops attribute values, not the attribute itself.
# That should probably be fixed. In any case, I'll get
# none of the n attribute values.

def do(self, annotSet, **kw):
# There's <DOC> and <TXT>, and
# everything in between the <TXT> is fair game.
m = self.TXT_RE.search(annotSet.signal)
if m is not None:
self.addZones(annotSet, [(m.start(1), m.end(1), "body")])
else:
self.addZones(annotSet, [(0, len(annotSet.signal), "body")])

return annotSet

The addZones method is defined on the ZoneStep class. For its implementation and more examples of its use, see PluginMgr.py.

Tokenize step

Under construction.

Defining your own reader/writer

It's not too difficult to define your own reader/writer. The file MAT_PKG_HOME/lib/mat/python/MAT/XMLIO.py provides a good example. The template looks like this:

from MAT.DocumentIO import declareDocumentIO, DocumentFileIO, SaveError
from MAT.Document import LoadError

class MyIO(DocumentFileIO):

def deserialize(self, s, annotDoc):
....

def writeToUnicodeString(self, annotDoc):
....

declareDocumentIO("my-io", MyIO, True, True)

The arguments to deserialize() are the input data from the file and an annotated document to populate; see XMLIO.py for an example of how to populate it. writeToUnicodeString() should return a Unicode string which serializes the annotated document passed in. In order to do this, you'll have to familiarize yourself with the API for manipulating documents and annotations, which is not documented but reasonably easy to understand from the source code. Once you do all this, the file type name you assign to the class via the call to declareDocumentIO() will be globally available.

You can also define command-line arguments which will be accepted by the tools when this file type is used. XMLIO.py also exemplifies this.

Defining your own experiment engine iterator

In some cases, you might want to define your own experiment engine iterator, if the default iterators we provide aren't adequate. For instance, you may have two attributes in your training engine which you want to iterate on in tandem, rather than over the cross-product of those values. While providing a guide to this is beyond the scope of this documentation, we can provide some hints.

First, look in MAT_PKG_HOME/lib/mat/python/MAT/Bootstrap.py. This is where the core iterator behavior is defined. Look at the implementations of the CorpusSizeIterator, ValueIterator and IncrementIterator classes. These classes each have a __call__method which loops through the possible values for the iterator. These methods are Python generators; they provide their successive values using the "yield" statement. The __call__ method is passed in a subdirectory name and a dictionary of keywords that will be used to configure the TrainingRunInstance or TestRunInstance, and it yields on each iteration an augmented subdirectory name which encodes the iteration value, and a new, modified dictionary of keywords. Note that the iterator has to copy the relevant keyword dictionaries for each new iteration, so that its iterative changes don't "bleed" from one iteration to the next.

Next, look in MAT_PKG_HOME/lib/mat/python/MAT/CarafeTrain.py. Again, look at the implementation of the CorpusSizeIterator, ValueIterator and IncrementIterator classes. These are specializations of the classes in Bootstrap.py, and the primary purpose of the specialization is to make the iterator settings available to the experiment XML. You'll see that each of these classes has a class-level "argList" declaration which consists of a list of Option objects. These Option objects are special versions of the Option objects in Python's optparse library which have been extended to work not only with command-line invocations but also with XML invocations. The "dest" attribute of each Option should match a keyword in the __init__ method for the class.

You'll want to place your customized iterator in a file in <your_task_directory>/python. If you put it in MyIterator.py, and you name the class MyIterator, you can refer to it in the "type" attribute of the <iterator> element in your experiment XML as "MyIterator.MyIterator".