Customizing a Task: Advanced Topics

Using a different annotation tool

Let's say that you have your own annotation tool, and you'd rather use that tool than the MAT tool. You'll have to make accommodations for how this tool's assumptions differ from those in MAT.

One of the most crucial assumptions is what happens with tokens. While it's possible to use portions of MAT without tokens, we don't guarantee that the entire suite will work if you don't provide tokens. If tokens are present, we try to enforce the generalization that content annotations are synchronized with token boundaries. That is, the only character indexes in a document which are eligible to be start index of a content annotation are the start indexes of token annotations, and the only character indexes which are eligible to be the end index are the end indexes of token annotations. The reason for this generalization is that the Carafe engine assumes tokens as the basic atomic elements for the training and tagging, and those basic elements must be consistent with hand annotation, so that the hand annotated data is appropriate for the training.

In MAT, this assumption is enforced by the standard configuration of the MAT hand annotation tool; you should not configure a workflow to allow hand annotation unless tokenization is already done, and while it is possible to hand annotate without tokens present, we don't recommend it. The result may be that your training data may not align with the Carafe engine's notion of atomic elements, which will render that portion of your data unusable.

If your hand annotation tool respects token boundaries, and you're willing to ensure that tokenization happens before hand annotation, all you need to do is be able to write out the MAT document format, for which we provide Python and Java libraries. But most hand annotation tools do not expect, respect, or enforce token annotation during other hand annotation phases. If you want to use such a tool, you should try inserting the Align step into your workflow after hand annotation applies. This step expands content annotations to the boundary of whatever tokens the content annotation overlaps at its edges.

Make a special workflow for document preparation

Under construction.

Make a special workspace folder for these documents

Under construction.

Using a different training and tagging engine

It is possible, in MAT, to use a different training and tagging engine than the default Carafe engine. When we get around to documenting how to do this, it will be documented here.

Adding workspace folders

MAT comes with a few predefined workspace folders, and a means for moving documents between them. Under some circumstances, you might want to add a folder. In this example, let's suppose that your task includes a summarization capability that produces new, summarized documents from documents that are already tagged, and that you want to save these summarized documents in your workspace.

In order to do this, you'll have to specialize the core task object, in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

You'll customize your task implementation in your task directory in python/<file>.py, where <file> is a name of your choice. When you refer to this implementation in your task.xml file, you'll refer to it as "<file>.<stepname>" , in the class attribute of your <task> element. For instance, if you place the following in the file python/MyPlugin.py:

from MAT.PluginMgr import PluginTaskDescriptor

class MyTaskDescriptor(PluginTaskDescriptor):
....

you'll refer to it in task.xml as follows:

<task name="My Task" class="MyPlugin.MyTaskDescriptor">
...
</task>

To add a workspace folder, you'll add a workspaceCustomize method to your task descriptor:

class MyTaskDescriptor(PluginTaskDescriptor):

def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)

Here, we add a folder named "summarized", which we can't import documents directly into.

Next, we need to add some behavior. In our example, we want to be able to apply a summarize action to documents in the completed folder in the workspace, and have the results land in the summarized folder. So our Python really looks like this:

from MAT.PluginMgr import PluginTaskDescriptor
from MAT.Workspace import WorkspaceOperation

class MyTaskDescriptor(PluginTaskDescriptor):

def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
workspace.folders["completed"].addOperation("summarize", SummarizationOperation)

class SummarizationOperation(WorkspaceOperation):

name = "summarize"
...

We won't describe the full implementation of operations here; see MAT_PKG_HOME/lib/mat/python/MAT/Workspace.py for examples.

Using task settings

Under construction.

Creating your own steps

Steps in MAT are written in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

They should be defined in your task directory in python/<file>.py, where <file> is a name of your choice. When you refer to those steps in your task.xml file, you'll refer to them as "<file>.<stepname>" .

Here's the skeleton of a step:

from MAT.PluginMgr import PluginStep

class MyStep(PluginStep):

def do(self, annotSet, **kw):
# ... make modifications to annotSet
return annotSet

The annotSet is a rich document, defined in MAT_PKG_HOME/lib/mat/python/MAT/Document.py. This class has methods to add and modify annotations, which is mostly what steps do. For examples of how steps use this API, see the class descendants of PluginStep in MAT_PKG_HOME/lib/mat/python/MAT/PluginMgr.py.

Most steps work by side effect, although it's possible to return a different document than the one you were handed, and MAT will recognize that as a new document. Most toolchains will not take advantage of this capability.

Steps have three methods: undo(), do() and doBatch(). By default, doBatch() calls do() for each document it's passed. You can define a special doBatch() if you have batch-level operations (e.g., if tagging every document in a directory is faster than calling the tagger for each document). All three methods can have keyword arguments, which are defined by an "argList" class variable (see PluginMgr.py for examples). Every keyword argument passed to the engine is passed to every step, so your function signature must always end with **kw.

Clean step

A common action might be to ensure that all files are in ASCII format with Unix line endings. Here's how you'd do that in your task:

from MAT.PluginMgr import CleanStep

class MyCleanStep(CleanStep):

def do(self, annotSet, **kw):
return self.truncateToUnixAscii(annotSet)

The truncateToUnixAscii method is defined on CleanStep, so you should inherit from there.

Note: because this step changes the signal of the document, it must be the first step in any workflow, and it cannot be undone; undoing any step that inherits from CleanStep will raise an error.

Zone step

You may want to establish a single zone in your document, in between <TXT> and </TXT>. Here's how you'd do that:

from MAT.PluginMgr import ZoneStep

class MyZoneStep(ZoneStep):

import re

TXT_RE = re.compile("<TXT>(.*)</TXT>", re.I | re.S)

# AMS drops attribute values, not the attribute itself.
# That should probably be fixed. In any case, I'll get
# none of the n attribute values.

def do(self, annotSet, **kw):
# There's <DOC> and <TXT>, and
# everything in between the <TXT> is fair game.
m = self.TXT_RE.search(annotSet.signal)
if m is not None:
self.addZones(annotSet, [(m.start(1), m.end(1), "body")])
else:
self.addZones(annotSet, [(0, len(annotSet.signal), "body")])

return annotSet

The addZones method is defined on the ZoneStep class. For its implementation and more examples of its use, see PluginMgr.py.

Tokenize step

Under construction.

Defining your own reader/writer

It's not too difficult to define your own reader/writer. The file MAT_PKG_HOME/lib/mat/python/MAT/XMLIO.py provides a good example. The template looks like this:

from MAT.DocumentIO import declareDocumentIO, DocumentFileIO, SaveError
from MAT.Document import LoadError

class MyIO(DocumentFileIO):

def deserialize(self, s, annotDoc):
....

def writeToUnicodeString(self, annotDoc):
....

declareDocumentIO("my-io", MyIO, True, True)

The arguments to deserialize() are the input data from the file and an annotated document to populate; see XMLIO.py for an example of how to populate it. writeToUnicodeString() should return a Unicode string which serializes the annotated document passed in. In order to do this, you'll have to familiarize yourself with the API for manipulating documents and annotations, which is not documented but reasonably easy to understand from the source code. Once you do all this, the file type name you assign to the class via the call to declareDocumentIO() will be globally available.

You can also define command-line arguments which will be accepted by the tools when this file type is used. XMLIO.py also exemplifies this.