If you haven't received an already-customized version of MAT, and
you want to do something besides the default named entity task, you're
going to want to define your own task. This document describes how to
do that for simple tasks.
Create a directory. This directory might ultimately have various
subdirectories; for instance, custom Python code must live in files in
a python/ subdirectory, and custom Javascript code should live in files
in a js/ subdirectory. But you don't need to know about those right now.
Most of what we're going to talk about in this document is the
task.xml file. You can get a better idea of what this file consists of
by looking at the task documentation,
the
documentation
on
the
sample task,
and the documentation on the task XML
itself.
For now, just open an empty file named task.xml and save the empty
file in your directory created in step 1.
Create a top-level <task> element, and give your task a name:
<task name="Widget Annotation">
<tags>
</tags>
<workflows>
</workflows>
</task>
If you had a need for customizing the task class, you'd add a class
attribute to the task; one reason you might do this is to add a new
folder to a workspace. See the advanced
documentation for a hint.
For historical reasons, the three subelements of <task> shown
here are the obligatory elements. We've set up <tags> and
<workflows> with separate closing tags because you'll almost
certainly be populating these elements.
Let's assume that you're going to use the default MAT automated
tools (tagger, trainer, tokenizer). Then you'll want to inherit the
structural annotations from the core task, and define your content
annotations:
<tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: blue"/>
</tag>
<tag name="TAG2" category="content">
<ui css="background-color: red"/>
</tag>
</tags>
You can customize your annotation declaration in a number of ways.
See the task XML use cases for
examples.
Another toplevel element in the <task> is the settings to
specify how the model is built.
<model_config class="MAT.JavaCarafe.CarafeModelBuilder"/>
<default_model>default_model</default_model>
This uses the Carafe engine, with
the default feature specification which Carafe provides, and
instructions to save default models to the file
"default_model" in your task directory. If you want to use the faster,
but possibly less-well-performing (and slightly less reliable) PSA
training method, use this model_build_settings specification:
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
This specifies 6 iterations of
periodic stepsize
adjustment.
There are lots of ways of costumizing the Carafe model builder. See MATModelBuilder
and
the Carafe engine documentation for
more details about these settings.
There are some circumstances under which you don't need to configure model building; e.g., if you only intend to use the MAT tool for hand annotation or to score documents.
Other toplevels element in the <task> are the descriptions of
the workflows and steps you'll use. Right now, MAT is somewhat limited
in its default steps and flexibility; without customization, a limited
number of steps are available, and these steps can be organized into
only a limited range of workflows. We believe that these steps and
workflows are sufficient for the most common tasks a user might have;
and, unfortunately, at the moment it's quite difficult to describe how
to extend these options in any great detail. See the advanced customization notes
for what's available.
We recommend the following workflow and step blocks, as described in
the sample task.
<workflows>
<workflow name="Hand annotation">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" pretty_name="hand tag" by_hand="yes"/>
</workflow>
<workflow name="Review/repair" hand_annotation_available_at_end="yes"/>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag"/>
</workflow>
<workflow name="Align">
<step name="zone"/>
<step name="tokenize"/>
<step name="align"/>
</workflow>
</workflows>
<step_implementations>
<step name="tokenize" class="MAT.JavaCarafe.CarafeTokenizationStep"/>
<step name="zone" class="MAT.PluginMgr.WholeZoneStep"/>
<step name="align" class="MAT.PluginMgr.AlignStep"/>
<step name="tag" tagging_step="yes" workflows="Demo" class="MAT.JavaCarafe.CarafeTagStep"/>
<!-- for undo -->
<step name="tag" tagging_step="yes" class="MAT.PluginMgr.TagStep"/>
</step_implementations>
If you intend to use workspace
mode, you should also define your workspace implementations. The
workspace block that corresponds to the workflows and steps described
immediately above looks like this:
<workspace>
<operation name="autotag">
<settings workflow="Demo" steps="zone,tokenize,tag"/>
</operation>
<operation name="modelbuild">
<settings/>
</operation>
<operation name="tagprep">
<settings workflow="Hand annotation" steps="zone,tokenize"/>
</operation>
</workspace>
Use the MATManagePluginDirs tool to ensure that MAT knows about your
task directory. If <dir> is your task directory:
Unix:
% $MAT_PKG_HOME/bin/MATManagePluginDirs install <dir>
Windows native:
> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd install <dir>
Tasks are highly customizable, in ways that we'll never have enough
time to document. See the advanced
documentation for what we've been able to write down about these
other customizations, or work your way through the source code.