Customizing a task: advanced topics

Using a different annotation tool
Using a different training and tagging engine
Customizing your task implementation
Changing the default zone information
Adding workspace folders
Using task settings
Creating your own steps
Defining your own reader/writer
Defining your own experiment engine iterator

Using a different annotation tool

Let's say that you have your own annotation tool, and you'd rather use that tool than the MAT tool.

First, you should be aware that you'll probably only be able to interact with MAT in file mode; the MAT Web server provides crucial support for editing files in workspaces which would require extensive modifications to your annotation tool (and which we haven't documented), and while it's possible to open and save workspace files using MATWorkspaceEngine with the --debug option, it's exceptionally clumsy.

Second, you'll have to make accommodations for how your tool's tokenization assumptions differ from those in MAT.

MAT's default Carafe tagging and training engine assumes tokens as the basic atomic elements for the training and tagging. Most hand annotation tools do not expect, respect, or enforce explicit tokenization, and this frequently leads to tiny mismatches between the explicit tokenization that a tool like Carafe can digest and the implicit tokenization of the annotation tool. We address these mismatches by supporting workflows which contain tokenization as an explicit step, and the MAT hand annotation tool enforces the generalization that spanned content annotations are synchronized with token boundaries. That is, the only character indexes in a document which are eligible to be the start index of a content annotation are the start indexes of token annotations, and the only character indexes which are eligible to be the end index are the end indexes of token annotations.

So if you're using the MAT UI, and you're planning on using your documents for training Carafe, you should configure your hand annotation workflow to require tokenization; otherwise, your training data may not align with the Carafe engine's notion of atomic elements, which will render that portion of your data unusable.

If your hand annotation tool respects explicit tokenization, and you're willing to ensure that tokenization happens before hand annotation, you can either set up your annotation tool to read and write the MAT document format (for which we provide Python and Java libraries), or produce a reader/writer which can understand your tools format.

If your hand annotation tool does not respect explicit tokenization, you should try inserting the Align step into your workflow after hand annotation applies. This step expands content annotations to the boundary of whatever tokens the content annotation overlaps at its edges.

Make a special workflow for document preparation

Under construction.

Make a special workspace folder for these documents

Under construction.

Using a different training and tagging engine

It is possible, in MAT, to use a different training and tagging engine than the default Carafe engine. When we get around to documenting how to do this, it will be documented here.

Customizing your task implementation

Some of the following customizations require you to customize your task implementation. In this section, we show you how to do that, in preparation for some of the customizations to follow.

In order to do this, you'll have to specialize the core task object, in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

The way we customize the task implementation is to add a new Python class, and then refer to it. In your task directory, create a file named python/<file>.py, where <file> is a name of your choice. Then, add the following code:

from MAT.PluginMgr import PluginTaskDescriptor

class <myclassname>(PluginTaskDescriptor):
    pass

where <myclassname> is a name of your choice. This creates a Python subclass of the general task. ("pass" is a placeholder for content in the class definition; we'll remove it when we have something to put there).

Now, in your task.xml file, you must modify your task definition to refer to this class using the "class" attribute, as follows:

<task name="<yourtaskname>" class="<file>.<myclassname>">
  ...
</task>

So let's say your task is named "My task", and you've named your file python/MyModule.py, and in that file you have this definition:

class MyTask(PluginTaskDescriptor):
    ....

In your task.xml file, you should now have this:

<task name="My task" class="MyModule.MyTask">
  ...
</task>

Changing the default zone information

MAT comes with a default zone annotation, which you'll typically inherit when you define your task. If you choose to add or provide your own zone annotations, and you don't want the default zone annotation to be used when MAT zones your document, unfortunately the only way to change this at the moment requires you to customize your task implementation and redefine the getTrueZoneInfo method.

If your zone annotation has an attribute to distinguish between types of regions, you can specify the attribute and its recognized values (default value first):

class MyTask(PluginTaskDescriptor):
    
    def getTrueZoneInfo(self):
        return "myzonetag", "myregionattr", ["myregionvalue", "mynondefaultregionvalue"]

Or, if there's no such attribute, return None in those positions:

class MyTask(PluginTaskDescriptor):
    
    def getTrueZoneInfo(self):
        return "myzonetag", None, None

Adding workspace folders

MAT comes with a few predefined workspace folders, and a means for moving documents between them. Under some circumstances, you might want to add a folder. In this example, let's suppose that your task includes a summarization capability that produces new, summarized documents from documents that are already tagged, and that you want to save these summarized documents in your workspace.

First, customize your task implementation.

Now, to add a workspace folder, you'll add a workspaceCustomize method to your task descriptor:

class MyTask(PluginTaskDescriptor):
    
    def workspaceCustomize(self, workspace, create = False):
        workspace.addFolder("summarized", create = create,
                            description = "summarized versions of annotated documents",
                            importTarget = False)

Here, we add a folder named "summarized", which we can't import documents directly into.

Next, we need to add some behavior. In our example, we want to be able to apply a summarize action to documents in the completed folder in the workspace, and have the results land in the summarized folder. So our Python really looks like this:

from MAT.PluginMgr import PluginTaskDescriptor
from MAT.Workspace import WorkspaceOperation

class MyTask(PluginTaskDescriptor):
    
    def workspaceCustomize(self, workspace, create = False):
        workspace.addFolder("summarized", create = create,
                            description = "summarized versions of annotated documents",
                            importTarget = False)
        workspace.folders["completed"].addOperation("summarize", SummarizationOperation)

class SummarizationOperation(WorkspaceOperation):

    name = "summarize"
    ...

We won't describe the full implementation of operations here; see MAT_PKG_HOME/lib/mat/python/MAT/Workspace.py for examples.

Using task settings

Under construction.

Creating your own steps

Steps in MAT are written in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.

They should be defined in your task directory in python/<file>.py, where <file> is a name of your choice. When you refer to those steps in your task.xml file, you'll refer to them as "<file>.<stepname>" .

Here's the skeleton of a step:

from MAT.PluginMgr import PluginStep

class MyStep(PluginStep):

    def do(self, annotSet, **kw):
        # ... make modifications to annotSet
        return annotSet

The annotSet is a rich document, defined in MAT_PKG_HOME/lib/mat/python/MAT/Document.py. This class has methods to add and modify annotations, which is mostly what steps do. For examples of how steps use this API, see the class descendants of PluginStep in MAT_PKG_HOME/lib/mat/python/MAT/PluginMgr.py.

Most steps work by side effect, although it's possible to return a different document than the one you were handed, and MAT will recognize that as a new document. Most toolchains will not take advantage of this capability.

Steps have three methods: undo(), do() and doBatch(). By default, doBatch() calls do() for each document it's passed. You can define a special doBatch() if you have batch-level operations (e.g., if tagging every document in a directory is faster than calling the tagger for each document). All three methods can have keyword arguments, which are defined by an "argList" class variable (see PluginMgr.py for examples). Every keyword argument passed to the engine is passed to every step, so your function signature must always end with **kw.

Clean step

A common action might be to ensure that all files are in ASCII format with Unix line endings. Here's how you'd do that in your task:

from MAT.PluginMgr import CleanStep

class MyCleanStep(CleanStep):

    def do(self, annotSet, **kw):
        return self.truncateToUnixAscii(annotSet)

The truncateToUnixAscii method is defined on CleanStep, so you should inherit from there.

Note: because this step changes the signal of the document, it must be the first step in any workflow, and it cannot be undone; undoing any step that inherits from CleanStep will raise an error.

Zone step

You may want to establish a single zone in your document, in between <TXT> and </TXT>. Here's how you'd do that:

from MAT.PluginMgr import ZoneStep

class MyZoneStep(ZoneStep):
        
    import re

    TXT_RE = re.compile("<TXT>(.*)</TXT>", re.I | re.S)

    # AMS drops attribute values, not the attribute itself.
    # That should probably be fixed. In any case, I'll get
    # none of the n attribute values.

    def do(self, annotSet, **kw):
        # There's <DOC> and <TXT>, and
        # everything in between the <TXT> is fair game.
        m = self.TXT_RE.search(annotSet.signal)
        if m is not None:
            self.addZones(annotSet, [(m.start(1), m.end(1),  "body")])
        else:
            self.addZones(annotSet, [(0, len(annotSet.signal), "body")])
        
        return annotSet

The addZones method is defined on the ZoneStep class, and makes use of the getTrueZoneInfo method (which you have to specialize if you're changing the default zone annotation). For its implementation and more examples of its use, see PluginMgr.py.

Tokenize step

Under construction.

Defining your own reader/writer

See here.

Defining your own experiment engine iterator

In some cases, you might want to define your own experiment engine iterator, if the default iterators we provide aren't adequate. For instance, you may have two attributes in your training engine which you want to iterate on in tandem, rather than over the cross-product of those values. While providing a guide to this is beyond the scope of this documentation, we can provide some hints.

First, look in MAT_PKG_HOME/lib/mat/python/MAT/Bootstrap.py. This is where the core iterator behavior is defined. Look at the implementations of the CorpusSizeIterator, ValueIterator and IncrementIterator classes. These classes each have a __call__method which loops through the possible values for the iterator. These methods are Python generators; they provide their successive values using the "yield" statement. The __call__ method is passed in a subdirectory name and a dictionary of keywords that will be used to configure the TrainingRunInstance or TestRunInstance, and it yields on each iteration an augmented subdirectory name which encodes the iteration value, and a new, modified dictionary of keywords. Note that the iterator has to copy the relevant keyword dictionaries for each new iteration, so that its iterative changes don't "bleed" from one iteration to the next.

Next, look in MAT_PKG_HOME/lib/mat/python/MAT/CarafeTrain.py. Again, look at the implementation of the CorpusSizeIterator, ValueIterator and IncrementIterator classes. These are specializations of the classes in Bootstrap.py, and the primary purpose of the specialization is to make the iterator settings available to the experiment XML. You'll see that each of these classes has a class-level "argList" declaration which consists of a list of Option objects. These Option objects are special versions of the Option objects in Python's optparse library which have been extended to work not only with command-line invocations but also with XML invocations. The "dest" attribute of each Option should match a keyword in the __init__ method for the class.

You'll want to place your customized iterator in a file in <your_task_directory>/python. If you put it in MyIterator.py, and you name the class MyIterator, you can refer to it in the "type" attribute of the <iterator> element in your experiment XML as "MyIterator.MyIterator".