The sample tasks can be found in MAT_PKG_HOME/sample/ne. This
directory, like all task directories, has a file named task.xml at
its root. The format of this file is described in the task XML documentation. It has no
Python or Javascript customizations, so it has none of the
corresponding subdirectories. See "Creating
a task" for a description of the subdirectory structure of
the task.
This task.xml file contains two tasks: "Named Entity" and
"Enhanced Named Entity". The first task is a simple span task; it
contains spanned annotations without any complex attribute
structure. This task is used for Tutorials 1 - 7, and for a
variety of other examples throughout this documentation. The
second task is a complex task, containing both spanned and
spanless annotations and multiple attributes, some of which take
other annotations as their values. This second task is used for Tutorial 8, as well as the UI
documentation on editing
annotations and spanless
annotations.
In addition to the task.xml file, the sample task contains a
demo.xml file which describes a demo of the automated capability
for the "Enhanced Named Entity" task. We describe both of those
files in detail here.
Here, we describe in detail the makeup of the task.xml file for
the sample task. We've numbered the lines to indicate our progress
through the file.
1 <tasks>
2 <task name='Named Entity'>
The file typically contains a single task declaration, with <task> as the toplevel element. However, if you wish to declare multiple tasks in the same task.xml file, it can also contain multiple <task> elements, within a <tasks> element. Here, we will define two tasks. Each task must be named.
3 <annotation_set_descriptors all_annotations_known='no'
4 inherit='category:zone,category:token'>
5 <annotation_set_descriptor category='content' name='content'>
6 <annotation label='PERSON'/>
7 <annotation label='LOCATION'/>
8 <annotation label='ORGANIZATION'/>
9 </annotation_set_descriptor>
10 </annotation_set_descriptors>
11 <annotation_display>
12 <label name='PERSON' accelerator='P' css='background-color: #CCFF66'/>
13 <label name='LOCATION' accelerator='L' css='background-color: #FF99CC'/>
14 <label name='ORGANIZATION' accelerator='O' css='background-color: #99CCFF'/>
15 </annotation_display>
Each task contains a block of annotation declarations. Here, we
have inherited the zone and token category tags from the root
task, and defined our own content tags, PERSON, LOCATION and
ORGANIZATION. In a separate <annotation_display> block, we
define the display properties of these tags. For instance, the
PERSON tag will display as light green (defined here in
hexadecimal), and the tagging menu will support the "P" keyboard
accelerator for annotating a selected span with the PERSON tag.
16 <workflows>
17 <workflow name='Hand annotation'>
18 <step name='zone'/>
19 <step name='tokenize'/>
20 <step pretty_name='hand tag' name='tag' by_hand='yes'/>
21 </workflow>
22 <workflow name='Tokenless hand annotation'>
23 <step name='zone'/>
24 <step pretty_name='hand tag' name='tag' by_hand='yes'/>
25 </workflow>
26 <workflow hand_annotation_available_at_end='yes' name='Review/repair'/>
27 <workflow hand_annotation_available_at_end='yes' name='Demo'>
28 <step name='zone'/>
29 <step name='tokenize'/>
30 <step name='tag'/>
31 </workflow>
32 <workflow name='Align'>
33 <step name='zone'/>
34 <step name='tokenize'/>
35 <step name='align'/>
36 </workflow>
37 </workflows>
The task contains five workflow definitions.
The implementation of these steps is found immediately below.
The <workflows> element is obligatory.38 <step_implementations>
39 <step name='tokenize' class='MAT.JavaCarafe.CarafeTokenizationStep'/>
40 <step name='zone' class='MAT.PluginMgr.WholeZoneStep'/>
41 <step name='align' class='MAT.PluginMgr.AlignStep'/>
42 <step workflows='Demo' name='tag' class='MAT.JavaCarafe.CarafeTagStep'/>
43 <step name='tag' class='MAT.PluginMgr.TagStep'/>
44 </step_implementations>
The task defines implementations for the steps in the workflow.
The implementations are essentially mappings from simple names to
Python classes which implement the steps. The classes referenced
here are described in the documentation on tasks. Step implementations
can be limited to workflows, as the first implementation of the
tag step is here. If a step is designated as by_hand step in a
workflow, it will be assigned the PluginMgr.TagStep implementation
automatically.
Every step must have an implementation, which is why there are two tagging step implementations. So the second step implementation for "tag" here is to support the default. Remember, the step names are global to the task, as described in the documentation on tasks. So it's not a good idea for the effect of step implementations to differ among workflows; so here, the two tag steps differ in the means they achieve the effect (one is automated and one is by hand), but otherwise, their effect is identical.
The <step_implementations> element is optional.45 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
46 <build_settings training_method='psa' max_iterations='6'/>
47 </model_config>
48 <model_config config_name='alt_model_build'
49 class='MAT.JavaCarafe.CarafeModelBuilder'/>
50 <default_model>default_model</default_model>
The settings for building a model are defined here. We use the Carafe engine, which uses its
default feature spec in the absence of a specified feature spec
file. We use periodic stepsize adjustment, and we assign a
location (a file named "default model") for the default location
of models build for this task with MATModelBuilder
(see the --save_as_default_model flag). We also have a second,
non-default block of settings, named "alt_model_build", which
doesn't use periodic stepsize adjustment.
51 <workspace>
52 <operation name='autotag'>
53 <settings steps='tag' workflow='Demo'/>
54 </operation>
55 <operation name='modelbuild'>
56 <settings/>
57 </operation>
58 <operation name='import'>
59 <settings steps='zone,tokenize' workflow='Hand annotation'/>
60 </operation>
61 </workspace>
We define the behavior of the operations in the workspaces for this task here. For a list of predefined folders, see the workspace documentation. Each folder has a set of operations and expected possible settings. In this case, the autotag step takes settings which are equivalent to the flags to MATEngine; the import step does the same. So we see that the autotag operation is equivalent to invoking MATEngine on a document using the Demo workflow defined above, performing one step, and import uses a different workflow, and applies two steps. The modelbuild operation, on the other hand, specifies no settings at all; everything it needs is inherited from the model build settings block immediately above.
The <workspace> element is optional.62 </task>
63 <task name='Enhanced Named Entity'>
At this point, we end the first task and begin the second one.
64 <annotation_set_descriptors all_annotations_known='no'
65 inherit='category:zone,category:token'>
66 <annotation_set_descriptor category='content' name='content'>
67 <annotation label='PERSON'/>
68 <annotation label='LOCATION'/>
69 <annotation label='ORGANIZATION'/>
70 <attribute of_annotation="PERSON,LOCATION,ORGANIZATION" name="nomtype">
71 <choice>Proper name</choice>
72 <choice>Noun</choice>
73 <choice>Pronoun</choice>
74 </attribute>
75 <attribute of_annotation="LOCATION" name="is_political_entity" type="boolean"/>
76 <annotation label="LOCATED_EVENT"/>
77 <attribute of_annotation="LOCATED_EVENT" type="annotation" name="actor">
78 <label_restriction label="PERSON"/>
79 </attribute>
80 <attribute of_annotation="LOCATED_EVENT" type="annotation" name="location">
81 <label_restriction label="LOCATION"/>
82 <label_restriction label="ORGANIZATION"/>
83 </attribute>
84 <annotation label="PERSON_COREF" span="no"/>
85 <attribute of_annotation="PERSON_COREF" type="annotation" aggregation="set" name="mentions">
86 <label_restriction label="PERSON"/>
87 </attribute>
88 <annotation label="LOCATION_RELATION" span="no"/>
89 <attribute of_annotation="LOCATION_RELATION" type="annotation" name="located">
90 <label_restriction label="ORGANIZATION"/>
91 <label_restriction label="PERSON"/>
92 </attribute>
93 <attribute of_annotation="LOCATION_RELATION" type="annotation" name="location">
94 <label_restriction label="LOCATION"/>
95 </attribute>
96 </annotation_set_descriptor>
97 </annotation_set_descriptors>
This annotation definition block is much more complex than the
one in the "Named Entity" task. In addition to the three labels we
saw previously, we also have three other labels: "LOCATED_EVENT"
(spanned) and "PERSON_COREF" and "LOCATION_RELATION" (spanless).
We also have several attributes, of different types. Most notable
is the "mentions" attribute of the "PERSON_COREF" annotation,
which takes sets of annotations as its value.
98 <annotation_display>
99 <label name='PERSON' accelerator='P' css='background-color: #CCFF66' edit_immediately="yes"/>
100 <label name='LOCATION' accelerator='L' css='background-color: #FF99CC' edit_immediately="yes"/>
101 <label name='ORGANIZATION' accelerator='O' css='background-color: #99CCFF' edit_immediately="yes"/>
102 <label name='PERSON_COREF' accelerator='C' css='background-color: lightgreen' edit_immediately="yes"/>
103 <label name='LOCATED_EVENT' accelerator='E' css='background-color: pink' edit_immediately="yes"/>
104 <label name='LOCATION_RELATION' accelerator='R' css='background-color: orange' edit_immediately="yes"/>
105 </annotation_display>
The annotation display block is also somewhat more complex; we
see here that all of the annotations are marked to be edited
immediately upon creation.
106 <workflows>
107 <workflow name='Hand annotation'>
108 <step name='zone'/>
109 <step name='tokenize'/>
110 <step pretty_name='hand tag' name='tag' by_hand='yes'/>
111 </workflow>
112 <workflow name='Tokenless hand annotation'>
113 <step name='zone'/>
114 <step pretty_name='hand tag' name='tag' by_hand='yes'/>
115 </workflow>
116 <workflow hand_annotation_available_at_end='yes' name='Review/repair'/>
117 <workflow hand_annotation_available_at_end='yes' name='Demo'>
118 <step name='zone'/>
119 <step name='tokenize'/>
120 <step name='tag'/>
121 </workflow>
122 <workflow name='Align'>
123 <step name='zone'/>
124 <step name='tokenize'/>
125 <step name='align'/>
126 </workflow>
127 </workflows>
The workflows in this task are identical to those in the "Named
Entity" task. Because the Carafe tagger only operates on the
simple span subset of this (or any) task, the "Demo" workflow will
only apply the spanned labels, not the attributes associated with
them, and won't apply the spanless labels at all.
128 <step_implementations>
129 <step name='tokenize' class='MAT.JavaCarafe.CarafeTokenizationStep'/>
130 <step name='zone' class='MAT.PluginMgr.WholeZoneStep'/>
131 <step name='align' class='MAT.PluginMgr.AlignStep'/>
132 <step workflows='Demo' name='tag' class='MAT.JavaCarafe.CarafeTagStep'/>
133 <step name='tag' class='MAT.PluginMgr.TagStep'/>
134 </step_implementations>
135 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
136 <build_settings training_method='psa' max_iterations='6'/>
137 </model_config>
138 <model_config config_name='alt_model_build'
139 class='MAT.JavaCarafe.CarafeModelBuilder'/>
140 <default_model>default_enhanced_model</default_model>
The step implementation and model configuration are the same as
those in the "Named Entity" task. Because the Carafe tagger only
operates on the simple span subset of this (or any) task, the
model builder will only train models for the spanned labels, not
the attributes associated with them, and won't build a model for
the spanless labels at all.
141 <workspace>The workspace configuration is identical to that of the "Named Entity" task (and, of course, the caveats about Carafe for model building and automated tagging apply here as well).
142 <operation name='autotag'>
143 <settings steps='tag' workflow='Demo'/>
144 </operation>
145 <operation name='modelbuild'>
146 <settings/>
147 </operation>
148 <operation name='import'>
149 <settings steps='zone,tokenize' workflow='Hand annotation'/>
150 </operation>
151 </workspace>
152 </task>
153 </tasks>
Here, we describe in detail the makeup of the demo.xml file for the sample task. We've numbered the lines to indicate our progress through the file.
1 <demo name="Named Entity Identification">
Each demo has a name, which will be the title of the demo page.
2 <description>
3 <![CDATA[
4 <p>This demo shows the simple named entity identification capability
5 provided in the MAT sample task.
6 ]]>
7 </description>
Each demo has an HTML description. This description can be
arbitrary HTML. In order to force the XML parser to ignore the tag
structure, and treat the content as an unanalyzed string, we use
the XML <![CDATA[...]]> directive.
8 <activity name="Tag" enable_blank_document="yes">
9 <description>Automatically locate named entities in the document.</description>
10 <engine_settings task = "Named Entity" workflow="Demo" steps="zone,tokenize,tag"/>
Each demo can have a number of activities the user can perform.
In most cases, there will be only one (i.e., tag the document),
but if the task has been extensively customized, there may be
more. The enable_blank_document attribute makes it possible for
the user to type in arbitrary text.
Each activity has a description, which the user will see, and
settings for MATEngine which dictate
how to process the document.
11 <sample_document description="Sample news article #1"
12 file_type = "raw"
13 relative_location="resources/data/raw/voa1.txt"/>
14 <sample_document description="Sample news article #2"
15 file_type = "raw"
16 relative_location="resources/data/raw/voa2.txt"/>
Each activity can have a number of sample documents. Each sample document has a description, which the user sees in a drop-down menu, a location for the document (which should be a relative pathname within the task directory), and whether the document is a raw or mat-json document. If enable_blank_document is provided, the user will see an additional entry for a blank document she can type into.
17 </activity>
18 </demo>
And now, we're done.