If you haven't received an already-customized version of MAT, and
you want to do something besides the default named entity task,
you're going to want to define your own task. This document
describes how to do that for simple tasks.
Create a directory. This directory might ultimately have various
subdirectories; for instance, custom Python code must live in
files in a python/ subdirectory, and custom Javascript code should
live in files in a js/ subdirectory. But you don't need to know
about those right now.
Most of what we're going to talk about in this document is the
task.xml file. You can get a better idea of what this file
consists of by looking at the task
documentation, the documentation on the sample 'Named Entity' task, and the
documentation on the task XML and annotation set descriptor XML
itself.
For now, just open an empty file named task.xml and save the
empty file in your directory created in step 1.
Create a top-level <task> element, and give your task a
name:
<task name="Widget Annotation">
<annotation_set_descriptors>
<annotation_set_descriptor name="content" category="content">
</annotation_set_descriptor>
</annotation_set_descriptors>
<annotation_display>
</annotation_display>
<workflows>
</workflows>
</task>
If you had a need for customizing the task class, you'd add a
class attribute to the task; one reason you might do this is to
add a new folder to a workspace. See the advanced documentation
for a hint.
For historical reasons, the <workflows> element shown here
is obligatory. We've added <annotation_set_descriptors>
because you're definitely going to be defining them. Notice that
the descriptor has the name and category attributes both set to
"content"; at the moment, these are the only settings you should
use when declaring your annotations. We've also added
<annotation_display>, because you're going to want that too.
Let's assume that you're going to use the default MAT automated
tools (tagger, trainer, tokenizer). Then you'll want to inherit
the zone and
token annotations from the core task:
<annotation_set_descriptors inherit="category:zone,category:token">
...
</annotation_set_descriptors>
Next, you should define your labels:
<annotation_set_descriptors inherit="category:zone,category:token">
<annotation_set_descriptor name="content" category="content">
<annotation label="TAG1"/>
<annotation label="TAG2"/>
</annotation_set_descriptor>
</annotation_set_descriptors>
You can customize your annotation declaration in a number of
ways. See the annotation set
descriptor XML use cases for examples.
Next, you'll want to associate display behavior with your
annotations, for the UI using <annotation_display>:
<annotation_display>
<label name="TAG1" accelerator="1" css="background-color: blue"/>
<label name="TAG2" accelerator="2" css="background-color: green"/>
</annotation_display>
Examples of customizing your annotation display can be found in
the task XML use cases.
Another toplevel element in the <task> is the settings to
specify how the model is built.
<model_config class="MAT.JavaCarafe.CarafeModelBuilder"/>
<default_model>default_model</default_model>
This uses the Carafe engine,
with the default feature specification which Carafe provides, and
instructions to save default models to the file "default_model" in
your task directory. If you want to use the faster, but possibly
less-well-performing (and slightly less reliable) periodic
stepsize adjustment training method, use this model_build_settings
specification:
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
This specifies 6 iterations of periodic stepsize adjustment.
There are lots of ways of customizing the Carafe model builder.
See MATModelBuilder and the Carafe engine documentation for
more details about these settings.
There are some circumstances under which you don't need to configure model building; e.g., if you only intend to use the MAT tool for hand annotation or to score documents.
Other toplevel elements in the <task> are the descriptions
of the workflows and steps you'll use. Right now, MAT is somewhat
limited in its default steps and flexibility; without
customization, a limited number of steps are available, and these
steps can be organized into only a limited range of workflows. You
can find a summary of the available steps and workflows here, and additional details
here. We believe that these
steps and workflows are sufficient for the most common tasks a
user might have; and, unfortunately, at the moment it's quite
difficult to describe how to extend these options in any great
detail. See the advanced
customization notes for what's available.
We recommend the following workflow and step blocks, as described
in the sample 'Named Entity' task.
<workflows>
<workflow name="Hand annotation">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" pretty_name="hand tag" by_hand="yes"/>
</workflow>
<workflow name="Review/repair" hand_annotation_available_at_end="yes"/>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag"/>
</workflow>
<workflow name="Align">
<step name="zone"/>
<step name="tokenize"/>
<step name="align"/>
</workflow>
</workflows>
<step_implementations>
<step name="tokenize" class="MAT.JavaCarafe.CarafeTokenizationStep"/>
<step name="zone" class="MAT.PluginMgr.WholeZoneStep"/>
<step name="align" class="MAT.PluginMgr.AlignStep"/>
<step name="tag" workflows="Demo" class="MAT.JavaCarafe.CarafeTagStep"/>
<!-- for undo -->
<step name="tag" class="MAT.PluginMgr.TagStep"/>
</step_implementations>
If you intend to use workspace
mode, you should also define your workspace implementations.
The workspace block that corresponds to the workflows and steps
described immediately above looks like this:
<workspace>
<operation name="autotag">
<settings workflow="Demo" steps="tag"/>
</operation>
<operation name="modelbuild">
<settings/>
</operation>
<operation name="import">
<settings workflow="Hand annotation" steps="zone,tokenize"/>
</operation>
</workspace>
Use the MATManagePluginDirs tool to ensure that MAT knows about
your task directory. If <dir> is your task directory:
Unix:
% $MAT_PKG_HOME/bin/MATManagePluginDirs install <dir>
Windows native:
> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd install <dir>
Tasks are highly customizable, in ways that we'll never have
enough time to document. See the advanced documentation
for what we've been able to write down about these other
customizations, or work your way through the source code.