The sample task can be found in MAT_PKG_HOME/sample/ne. Like all
tasks, it has a file named task.xml at its root. The format of this
file is described in the task XML
documentation. It has no Python or Javascript customizations, so it has
none of the corresponding subdirectories. See "Creating
a task" for a description of the subdirectory structure of the
task.
In addition to the task.xml file, the sample task contains a
demo.xml file which describes a demo of the automated capability in
this task. We describe both of those files in detail here.
Here, we describe in detail the
makeup of the task.xml file for the sample task. We've numbered the
lines to indicate our progress through the file.
1 <task name="Named Entity">
The file contains a single task declaration, which must be named.
2 <tags inherit_structure="yes">
3 <tag name="PERSON" category="content">
4 <ui css="background-color: CCFF66" accelerator="P"/><!-- # light green -->
5 </tag>
6 <tag name="LOCATION" category="content">
7 <ui css="background-color: FF99CC" accelerator="L"/><!-- # pink -->
8 </tag>
9 <tag name="ORGANIZATION" category="content">
10 <ui css="background-color: 99CCFF" accelerator="O"/><!-- # light blue -->
11 </tag>
12 </tags>
The file contains a block of tag declarations. Here, we have
inherited the structure tags (i.e., zone and lex) from the root task,
and defined our own
content tags. Each of the content tags must have a name. The <ui>
subelement describes the visual features of the tag. For instance, the
PERSON tag will display as light green, and the tagging menu will
support the "P" keyboard accelerator for annotating a selected span
with the PERSON tag.
The tag block is obligatory.
13 <workflows>
14 <workflow name="Hand annotation">
15 <step name="zone"/>
16 <step name="tokenize"/>
17 <step name="tag" pretty_name="hand tag" by_hand="yes"/>
18 </workflow>
19 <workflow name="Tokenless hand annotation">
20 <step name="zone"/>
21 <step name="tag" pretty_name="hand tag" by_hand="yes"/>
22 </workflow>
23 <workflow name="Review/repair" hand_annotation_available_at_end="yes"/>
24 <workflow name="Demo" hand_annotation_available_at_end="yes">
25 <step name="zone"/>
26 <step name="tokenize"/>
27 <step name="tag"/>
28 </workflow>
29 <workflow name="Align">
30 <step name="zone"/>
31 <step name="tokenize"/>
32 <step name="align"/>
33 </workflow>
34 </workflows>
The file contains three workflow definitions.
The implementation of these steps is found immediately below.
The workflows block is obligatory.
35 <step_implementations>
36 <step name="tokenize" class="MAT.JavaCarafe.CarafeTokenizationStep"/>
37 <step name="zone" class="MAT.PluginMgr.WholeZoneStep"/>
38 <step name="align" class="MAT.PluginMgr.AlignStep"/>
39 <step name="tag" tagging_step="yes" workflows="Demo" class="MAT.JavaCarafe.CarafeTagStep"/>
40 <!-- for undo -->
41 <step name="tag" tagging_step="yes" class="MAT.PluginMgr.TagStep"/>
42 </step_implementations>
The file defines implementations for the steps in the
workflow. The implementations are essentially mappings from simple
names to Python classes which implement the steps. The classes
referenced here are described in the documentation on tasks. Step implementations can be
limited to workflows, as the first implementation of the tag step is
here. Step implementations can also be designated as tagging steps
(which are the only steps which support the "by_hand" attribute that
can be specified in the workflows). If a step is designated as a
tagging step and a by_hand step in a workflow, it will be assigned the
PluginMgr.HandAnnotationTagStep automatically.
Every step must have an implementation, which is why there are two tagging step implementations. So the second step implementation for "tag" here is to support the default. Remember, the step names are global to the task, as described in the documentation on tasks. So it's not a good idea for the effect of step implementations to differ among workflows; so here, the two tag steps differ in the means they achieve the effect (one is automated and one is by hand), but otherwise, their effect is identical.
The step implementations block is optional.
43 <model_config class="MAT.JavaCarafe.CarafeModelBuilder">
44 <build_settings training_method="psa" max_iterations="6"/>
45 </model_config>
46 <model_config config_name="alt_model_build" class="MAT.JavaCarafe.CarafeModelBuilder"/>
47 <default_model>default_model</default_model
The settings for building a model are defined here. We use the
Carafe engine, which uses its default
feature spec in the absence of a specified feature spec file. We
use
periodic stepsize adjustment, and
we assign a location (a file named "default model") for the default
location of models build for this task with MATModelBuilder (see the
--save_as_default_model flag). We also have a second, non-default block
of settings, named "alt_model_build", which doesn't use periodic
stepsize adjustment.
The model build settings block is optional, as is the default model.
48 <workspace>
49 <operation name="autotag">
50 <settings workflow="Demo" steps="zone,tokenize,tag"/>
51 </operation>
52 <operation name="modelbuild">
53 <settings/>
54 </operation>
55 <operation name="tagprep">
56 <settings workflow="Hand annotation" steps="zone,tokenize"/>
57 </operation>
58 </workspace>
We define the behavior of the operations in the workspaces for this task here. For a list of predefined folders, see the workspace documentation. Each folder has a set of operations and expected possible settings. In this case, the autotag step takes settings which are equivalent to the flags to MATEngine; the tagprep step does the same. So we see that the autotag operation is equivalent to invoking MATEngine on a document using the Demo workflow defined above, performing three steps, and tagprep uses a different workflow, and applies two steps. The modelbuild operation, on the other hand, specifies no settings at all; everything it needs is inherited from the model build settings block immediately above.
The workspace block is optional.
60 </task>
And finally, we're done.
Here, we describe in detail the makeup of the demo.xml file for the sample task. We've numbered the lines to indicate our progress through the file.
1 <demo name="Named Entity Identification">
Each demo has a name, which will be the title of the demo page.
2 <description>
3 <![CDATA[
4 <p>This demo shows the simple named entity identification capability
5 provided in the MAT sample task.
6 ]]>
7 </description>
Each demo has an HTML description. This description can be arbitrary
HTML. In order to force the XML parser to ignore the tag structure, and
treat the content as an unanalyzed string, we use the XML
<![CDATA[...]]> directive.
8 <activity name="Tag" enable_blank_document="yes">
9 <description>Automatically locate named entities in the document.</description>
10 <engine_settings task = "Named Entity" workflow="Demo" steps="zone,tokenize,tag"/>
Each demo can have a number of activities the user can perform. In
most cases, there will be only one (i.e., tag the document), but if the
task has been extensively customized, there may be more. The
enable_blank_document attribute makes it possible for the user to type
in arbitrary text.
Each activity has a description, which the user will see, and
settings for MATEngine which dictate how
to process the document.
11 <sample_document description="Sample news article #1"
12 file_type = "raw"
13 relative_location="resources/data/raw/voa1.txt"/>
14 <sample_document description="Sample news article #2"
15 file_type = "raw"
16 relative_location="resources/data/raw/voa2.txt"/>
Each activity can have a number of sample documents (one of which
might be a blank document if enable_blank_document is used). Each
sample document has a description, which the user sees in a drop-down
menu, a location for the document (which should be a relative pathname
within the task directory), and whether the document is a raw or
mat-json document.
17 </activity>
18 </demo>
And now, we're done.