Let's say that you have your own annotation tool, and you'd rather
use that tool than the MAT tool. You'll have to make accommodations for
how this tool's assumptions differ from those in MAT.
One of the most crucial assumptions is what happens with tokens.
While it's possible to use portions of MAT without tokens, we don't
guarantee that the entire suite will work if you don't provide tokens.
If tokens are present, we try
to enforce the generalization that content annotations
are synchronized with token boundaries. That is, the only character
indexes in a document which are eligible to be start index of a content
annotation are the start indexes of token annotations, and the only
character indexes which are eligible to be the end index are the end
indexes of token annotations. The reason for this generalization is
that the Carafe engine assumes tokens as the basic atomic elements for
the training and tagging, and those basic elements must be consistent
with hand annotation, so that the hand annotated data is appropriate
for the training.
In MAT, this assumption is enforced by the
standard configuration of the MAT hand annotation tool; you should not
configure a workflow to allow hand annotation unless tokenization is
already done, and while it is possible to hand annotate without
tokens present, we don't recommend it. The result may be that your
training data may not align with the Carafe engine's notion of atomic
elements, which will render that portion of your data unusable.
If your hand annotation tool respects token boundaries, and you're
willing to ensure that tokenization happens before hand annotation, all
you need to do is be able to write out the MAT document format, for which we
provide Python and Java libraries.
But most hand annotation tools do not expect, respect, or enforce token
annotation during other hand annotation phases. If you want to use such
a tool, you should try inserting the Align step into your workflow
after hand annotation applies. This step expands content annotations to
the boundary of whatever tokens the content annotation overlaps at its
edges.
Under construction.
Under construction.
It is possible, in MAT, to use a different training and tagging
engine than the default Carafe engine.
When
we
get
around to documenting how to do this, it will be documented
here.
MAT comes with a few predefined workspace folders, and a means for
moving documents between them. Under some circumstances, you might want
to add a folder. In this example, let's suppose that your task includes
a summarization capability that produces new, summarized documents from
documents that are already tagged, and that you want to save these
summarized documents in your workspace.
In order to do this, you'll have to specialize the core task object,
in Python. This documentation is not a Python tutorial, and will not
document the API of all the classes involved; in order to proceed much
further, you should know Python, and
you should be brave enough to wade through the MAT source code.
You'll customize your task implementation in your task directory in
python/<file>.py, where <file> is a name of your choice.
When you refer to this implementation in your task.xml file, you'll
refer to it as "<file>.<stepname>" , in the class attribute
of your <task> element. For instance, if you place the following
in the file python/MyPlugin.py:
from MAT.PluginMgr import PluginTaskDescriptor
class MyTaskDescriptor(PluginTaskDescriptor):
....
you'll refer to it in task.xml as follows:
<task name="My Task" class="MyPlugin.MyTaskDescriptor">
...
</task>
To add a workspace folder, you'll add a workspaceCustomize method to
your task descriptor:
class MyTaskDescriptor(PluginTaskDescriptor):
def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
Here, we add a folder named "summarized", which we can't import
documents directly into.
Next, we need to add some behavior. In our example, we want to be
able to apply a summarize action to documents in the completed folder
in the workspace, and have the results land in the summarized folder.
So our Python really looks like this:
from MAT.PluginMgr import PluginTaskDescriptor
from MAT.Workspace import WorkspaceOperation
class MyTaskDescriptor(PluginTaskDescriptor):
def workspaceCustomize(self, workspace, create = False):
workspace.addFolder("summarized", create = create,
description = "summarized versions of annotated documents",
importTarget = False)
workspace.folders["completed"].addOperation("summarize", SummarizationOperation)
class SummarizationOperation(WorkspaceOperation):
name = "summarize"
...
We won't describe the full implementation of operations here; see
MAT_PKG_HOME/lib/mat/python/MAT/Workspace.py for examples.
Under construction.
Steps in MAT are written in Python. This documentation is not a
Python tutorial, and will not document the API of all the classes
involved; in order to proceed much further, you should know Python, and
you should be brave enough to wade through the MAT source code.
They should be defined in your task directory in
python/<file>.py, where <file> is a name of your choice.
When you refer to those steps in your task.xml file, you'll refer to
them as "<file>.<stepname>" .
Here's the skeleton of a step:
from MAT.PluginMgr import PluginStep
class MyStep(PluginStep):
def do(self, annotSet, **kw):
# ... make modifications to annotSet
return annotSet
The annotSet is a rich document, defined in
MAT_PKG_HOME/lib/mat/python/MAT/Document.py. This class has methods to
add and modify annotations, which is mostly what steps do. For examples
of how steps use this API, see the class descendants of PluginStep in
MAT_PKG_HOME/lib/mat/python/MAT/PluginMgr.py.
Most steps work by side effect, although it's possible to return a
different document than the one you were handed, and MAT will recognize
that as a new document. Most toolchains will not take advantage of this
capability.
Steps have three methods: undo(), do() and doBatch(). By default,
doBatch() calls do() for each document it's passed. You can define a
special doBatch() if you have batch-level operations (e.g., if tagging
every document in a directory is faster than calling the tagger for
each document). All three methods can have keyword arguments, which are
defined by an "argList" class variable (see PluginMgr.py for examples).
Every keyword argument passed to the engine is passed to every step, so
your function signature must always end with **kw.
A common action might be to ensure that all files are in ASCII
format with Unix line endings. Here's how you'd do that in your task:
from MAT.PluginMgr import CleanStep
class MyCleanStep(CleanStep):
def do(self, annotSet, **kw):
return self.truncateToUnixAscii(annotSet)
The truncateToUnixAscii method is defined on CleanStep, so you
should inherit from there.
Note: because this step
changes the signal of the document, it must be the first step in any
workflow, and it cannot be undone; undoing any step that inherits from
CleanStep will raise an error.
You may want to establish a single zone in your document, in between
<TXT> and </TXT>. Here's how you'd do that:
from MAT.PluginMgr import ZoneStep
class MyZoneStep(ZoneStep):
import re
TXT_RE = re.compile("<TXT>(.*)</TXT>", re.I | re.S)
# AMS drops attribute values, not the attribute itself.
# That should probably be fixed. In any case, I'll get
# none of the n attribute values.
def do(self, annotSet, **kw):
# There's <DOC> and <TXT>, and
# everything in between the <TXT> is fair game.
m = self.TXT_RE.search(annotSet.signal)
if m is not None:
self.addZones(annotSet, [(m.start(1), m.end(1), "body")])
else:
self.addZones(annotSet, [(0, len(annotSet.signal), "body")])
return annotSet
The addZones method is defined on the ZoneStep class. For its
implementation and more examples of its use, see PluginMgr.py.
Under construction.
It's not too difficult to define your own reader/writer. The file
MAT_PKG_HOME/lib/mat/python/MAT/XMLIO.py provides a good example. The
template looks like this:
from MAT.DocumentIO import declareDocumentIO, DocumentFileIO, SaveError
from MAT.Document import LoadError
class MyIO(DocumentFileIO):
def deserialize(self, s, annotDoc):
....
def writeToUnicodeString(self, annotDoc):
....
declareDocumentIO("my-io", MyIO, True, True)
The arguments to deserialize() are the input data from the file and
an annotated document to populate; see XMLIO.py for an example of how
to populate it. writeToUnicodeString() should return a Unicode string
which serializes the annotated document passed in. In order to do this,
you'll have to familiarize yourself with the API for manipulating
documents and annotations, which is not documented but reasonably easy
to understand from the source code. Once you do all this, the file type
name you assign to the class via the call to declareDocumentIO() will
be globally available.
You can also define command-line arguments which will be accepted by
the tools when this file type is used. XMLIO.py also exemplifies this.
In some cases, you might want to define your own experiment engine
iterator, if the default iterators we provide aren't adequate. For
instance, you may have two attributes in your training engine which you
want to iterate on in tandem, rather than over the cross-product of
those values. While providing a guide to this is beyond the scope of
this documentation, we can provide some hints.
First, look in MAT_PKG_HOME/lib/mat/python/MAT/Bootstrap.py. This is
where the core iterator behavior is defined. Look at the
implementations of the CorpusSizeIterator, ValueIterator and
IncrementIterator classes. These classes each have a __call__method
which loops through the possible values for the iterator. These methods
are Python generators; they provide their successive values using the
"yield" statement. The __call__ method is passed in a subdirectory name
and a dictionary of keywords that will be used to configure the
TrainingRunInstance or TestRunInstance, and it yields on each iteration
an augmented subdirectory name which encodes the iteration value, and a
new, modified dictionary of keywords. Note that the iterator has to
copy the relevant keyword dictionaries for each new iteration, so that
its iterative changes don't "bleed" from one iteration to the next.
Next, look in MAT_PKG_HOME/lib/mat/python/MAT/CarafeTrain.py. Again,
look at the implementation of the CorpusSizeIterator, ValueIterator and
IncrementIterator classes. These are specializations of the classes in
Bootstrap.py, and the primary purpose of the specialization is to make
the iterator settings available to the experiment XML. You'll see that
each of these classes has a class-level "argList" declaration which
consists of a list of Option objects. These Option objects are special
versions of the Option objects in Python's optparse library which have
been extended to work not only with command-line invocations but also
with XML invocations. The "dest" attribute of each Option should match
a keyword in the __init__ method for the class.
You'll want to place your customized iterator in a file in
<your_task_directory>/python. If you put it in MyIterator.py, and
you name the class MyIterator, you can refer to it in the "type"
attribute of the <iterator> element in your experiment XML as
"MyIterator.MyIterator".