Let's say that you have your own annotation tool, and you'd
rather use that tool than the MAT tool.
First, you should be aware that you'll probably only be able to
interact with MAT in file
mode; the MAT Web server provides crucial support for
editing files in workspaces which would require extensive
modifications to your annotation tool (and which we haven't
documented), and while it's possible to open and save workspace
files using MATWorkspaceEngine
with the --debug option, it's exceptionally clumsy.
Second, you'll have to make accommodations for how your tool's
tokenization assumptions differ from those in MAT.
MAT's default Carafe tagging and training engine assumes tokens
as the basic atomic elements for the training and tagging. Most
hand annotation tools do not expect, respect, or enforce explicit
tokenization, and this frequently leads to tiny mismatches between
the explicit tokenization that a tool like Carafe can digest and
the implicit tokenization of the annotation tool. We address these
mismatches by supporting workflows which contain tokenization as
an explicit step, and the MAT hand annotation tool enforces the
generalization that spanned content annotations are synchronized
with token boundaries. That is, the only character indexes in a
document which are eligible to be the start index of a content
annotation are the start indexes of token annotations, and the
only character indexes which are eligible to be the end index are
the end indexes of token annotations.
So if you're using the MAT UI, and you're planning on using your
documents for training Carafe, you should configure your hand
annotation workflow to require tokenization; otherwise, your
training data may not align with the Carafe engine's notion of
atomic elements, which will render that portion of your data
unusable.
If your hand annotation tool respects explicit tokenization, and
you're willing to ensure that tokenization happens before hand
annotation, you can either set up your annotation tool to read and
write the MAT document format
(for which we provide Python and Java libraries), or produce a reader/writer which can
understand your tools format.
If your hand annotation tool does not respect explicit tokenization, you should try inserting the Align step into your workflow after hand annotation applies. This step expands content annotations to the boundary of whatever tokens the content annotation overlaps at its edges.
Under construction.
Under construction.
It is possible, in MAT, to use a different training and tagging
engine than the default Carafe
engine. When we get around to documenting how to do this, it
will be documented here.
Some of the following customizations require you to customize
your task implementation. In this section, we show you how to do
that, in preparation for some of the customizations to follow.
In order to do this, you'll have to specialize the core task object, in Python. This documentation is not a Python tutorial, and will not document the API of all the classes involved; in order to proceed much further, you should know Python, and you should be brave enough to wade through the MAT source code.
The way we customize the task implementation is to add a new
Python class, and then refer to it. In your task directory, create
a file named python/<file>.py, where <file> is a name
of your choice. Then, add the following code:
from MAT.PluginMgr import PluginTaskDescriptor
class <myclassname>(PluginTaskDescriptor):
pass
where <myclassname> is a name of your choice. This creates
a Python subclass of the general task. ("pass" is a placeholder
for content in the class definition; we'll remove it when we have
something to put there).
Now, in your task.xml file, you must modify your task definition
to refer to this class using the "class" attribute, as follows:
<task name="<yourtaskname>" class="<file>.<myclassname>">
...
</task>
So let's say your task is named "My task", and you've named your
file python/MyModule.py, and in that file you have this
definition:
class MyTask(PluginTaskDescriptor):
....
In your task.xml file, you should now have this:
<task name="My task" class="MyModule.MyTask">
...
</task>
MAT comes with a default zone
annotation, which you'll typically inherit when you define your task. If you choose to
add or provide your own zone annotations, and you don't want the
default zone annotation to be used when MAT zones your document,
unfortunately the only way to change this at the moment requires
you to customize
your task implementation and redefine the getTrueZoneInfo
method.
If your zone annotation has an attribute to distinguish between
types of regions, you can specify the attribute and its recognized
values (default value first):
class MyTask(PluginTaskDescriptor):
def getTrueZoneInfo(self):
return "myzonetag", "myregionattr", ["myregionvalue", "mynondefaultregionvalue"]
Or, if there's no such attribute, return None in those positions:
class MyTask(PluginTaskDescriptor):
def getTrueZoneInfo(self):
return "myzonetag", None, None
MAT comes with a few predefined workspace folders, and a means
for moving documents between them. Under some circumstances, you
might want to add a folder. In this example, let's suppose that
your task includes a summarization capability that produces new,
summarized documents from documents that are already tagged, and
that you want to save these summarized documents in your
workspace.
First, customize
your task implementation.
Now, to add a workspace folder, you'll add a workspaceCustomize
method to your task descriptor:
class MyTask(PluginTaskDescriptor):
def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
Here, we add a folder named "summarized", which we can't import
documents directly into.
Next, we need to add some behavior. In our example, we want to be
able to apply a summarize action to documents in the completed
folder in the workspace, and have the results land in the
summarized folder. So our Python really looks like this:
from MAT.PluginMgr import PluginTaskDescriptor
from MAT.Workspace import WorkspaceOperation
class MyTask(PluginTaskDescriptor):
def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
workspace.folders["completed"].addOperation("summarize", SummarizationOperation)
class SummarizationOperation(WorkspaceOperation):
name = "summarize"
...
We won't describe the full implementation of operations here; see
MAT_PKG_HOME/lib/mat/python/MAT/Workspace.py for examples.
Under construction.
Steps in MAT are written in Python. This documentation is not a
Python tutorial, and will not document the API of all the classes
involved; in order to proceed much further, you should know
Python, and you should be brave enough to wade through the MAT
source code.
They should be defined in your task directory in
python/<file>.py, where <file> is a name of your
choice. When you refer to those steps in your task.xml file,
you'll refer to them as "<file>.<stepname>" .
Here's the skeleton of a step:
from MAT.PluginMgr import PluginStep
class MyStep(PluginStep):
def do(self, annotSet, **kw):
# ... make modifications to annotSet
return annotSet
The annotSet is a rich document, defined in
MAT_PKG_HOME/lib/mat/python/MAT/Document.py. This class has
methods to add and modify annotations, which is mostly what steps
do. For examples of how steps use this API, see the class
descendants of PluginStep in
MAT_PKG_HOME/lib/mat/python/MAT/PluginMgr.py.
Most steps work by side effect, although it's possible to return
a different document than the one you were handed, and MAT will
recognize that as a new document. Most toolchains will not take
advantage of this capability.
Steps have three methods: undo(), do() and doBatch(). By default,
doBatch() calls do() for each document it's passed. You can define
a special doBatch() if you have batch-level operations (e.g., if
tagging every document in a directory is faster than calling the
tagger for each document). All three methods can have keyword
arguments, which are defined by an "argList" class variable (see
PluginMgr.py for examples). Every keyword argument passed to the
engine is passed to every step, so your function signature must
always end with **kw.
A common action might be to ensure that all files are in ASCII
format with Unix line endings. Here's how you'd do that in your
task:
from MAT.PluginMgr import CleanStep
class MyCleanStep(CleanStep):
def do(self, annotSet, **kw):
return self.truncateToUnixAscii(annotSet)
The truncateToUnixAscii method is defined on CleanStep, so you
should inherit from there.
Note: because this step
changes the signal of the document, it must be the first step in
any workflow, and it cannot be undone; undoing any step that
inherits from CleanStep will raise an error.
You may want to establish a single zone in your document, in
between <TXT> and </TXT>. Here's how you'd do that:
from MAT.PluginMgr import ZoneStep
class MyZoneStep(ZoneStep):
import re
TXT_RE = re.compile("<TXT>(.*)</TXT>", re.I | re.S)
# AMS drops attribute values, not the attribute itself.
# That should probably be fixed. In any case, I'll get
# none of the n attribute values.
def do(self, annotSet, **kw):
# There's <DOC> and <TXT>, and
# everything in between the <TXT> is fair game.
m = self.TXT_RE.search(annotSet.signal)
if m is not None:
self.addZones(annotSet, [(m.start(1), m.end(1), "body")])
else:
self.addZones(annotSet, [(0, len(annotSet.signal), "body")])
return annotSet
The addZones method is defined on the ZoneStep class, and makes
use of the getTrueZoneInfo method (which you have to specialize if
you're changing the default zone annotation). For its
implementation and more examples of its use, see PluginMgr.py.
Under construction.
In some cases, you might want to define your own experiment
engine iterator, if the default iterators we provide aren't
adequate. For instance, you may have two attributes in your
training engine which you want to iterate on in tandem, rather
than over the cross-product of those values. While providing a
guide to this is beyond the scope of this documentation, we can
provide some hints.
First, look in MAT_PKG_HOME/lib/mat/python/MAT/Bootstrap.py. This
is where the core iterator behavior is defined. Look at the
implementations of the CorpusSizeIterator, ValueIterator and
IncrementIterator classes. These classes each have a
__call__method which loops through the possible values for the
iterator. These methods are Python generators; they provide their
successive values using the "yield" statement. The __call__ method
is passed in a subdirectory name and a dictionary of keywords that
will be used to configure the TrainingRunInstance or
TestRunInstance, and it yields on each iteration an augmented
subdirectory name which encodes the iteration value, and a new,
modified dictionary of keywords. Note that the iterator has to
copy the relevant keyword dictionaries for each new iteration, so
that its iterative changes don't "bleed" from one iteration to the
next.
Next, look in MAT_PKG_HOME/lib/mat/python/MAT/CarafeTrain.py.
Again, look at the implementation of the CorpusSizeIterator,
ValueIterator and IncrementIterator classes. These are
specializations of the classes in Bootstrap.py, and the primary
purpose of the specialization is to make the iterator settings
available to the experiment XML. You'll see that each of these
classes has a class-level "argList" declaration which consists of
a list of Option objects. These Option objects are special
versions of the Option objects in Python's optparse library which
have been extended to work not only with command-line invocations
but also with XML invocations. The "dest" attribute of each Option
should match a keyword in the __init__ method for the class.
You'll want to place your customized iterator in a file in
<your_task_directory>/python. If you put it in
MyIterator.py, and you name the class MyIterator, you can refer to
it in the "type" attribute of the <iterator> element in your
experiment XML as "MyIterator.MyIterator".