The sample tasks

The sample tasks can be found in MAT_PKG_HOME/sample/ne. This directory, like all task directories, has a file named task.xml at its root. The format of this file is described in the task XML documentation. It has no Python or Javascript customizations, so it has none of the corresponding subdirectories. See "Creating a task" for a description of the subdirectory structure of the task.

This task.xml file contains two tasks: "Named Entity" and "Enhanced Named Entity". The first task is a simple span task; it contains spanned annotations without any complex attribute structure. This task is used for Tutorials 1 - 7, and for a variety of other examples throughout this documentation. The second task is a complex task, containing both spanned and spanless annotations and multiple attributes, some of which take other annotations as their values. This second task is used for Tutorial 8, as well as the UI documentation on editing annotations and spanless annotations.

In addition to the task.xml file, the sample task contains a demo.xml file which describes a demo of the automated capability for the "Enhanced Named Entity" task. We describe both of those files in detail here.

The task.xml file

Here, we describe in detail the makeup of the task.xml file for the sample task. We've numbered the lines to indicate our progress through the file.

The "Named Entity" task

     1	<tasks>
     2	  <task name='Named Entity'>

The file typically contains a single task declaration, with <task> as the toplevel element. However, if you wish to declare multiple tasks in the same task.xml file, it can also contain multiple <task> elements, within a <tasks> element. Here, we will define two tasks. Each task must be named.

     3	    <annotation_set_descriptors all_annotations_known='no'
     4	                                inherit='category:zone,category:token'>
     5	      <annotation_set_descriptor category='content' name='content'>
     6	        <annotation label='PERSON'/>
     7	        <annotation label='LOCATION'/>
     8	        <annotation label='ORGANIZATION'/>
     9	      </annotation_set_descriptor>
    10	    </annotation_set_descriptors>
    11	    <annotation_display>
    12	      <label name='PERSON' accelerator='P' css='background-color: #CCFF66'/>
    13	      <label name='LOCATION' accelerator='L' css='background-color: #FF99CC'/>
    14	      <label name='ORGANIZATION' accelerator='O' css='background-color: #99CCFF'/>
    15	    </annotation_display>

Each task contains a block of annotation declarations. Here, we have inherited the zone and token category tags from the root task, and defined our own content tags, PERSON, LOCATION and ORGANIZATION. In a separate <annotation_display> block, we define the display properties of these tags. For instance, the PERSON tag will display as light green (defined here in hexadecimal), and the tagging menu will support the "P" keyboard accelerator for annotating a selected span with the PERSON tag.

Both the <annotation_set_descriptors> and <annotation_display> elements are optional.

    16	    <workflows>
    17	      <workflow name='Hand annotation'>
    18	        <step name='zone'/>
    19	        <step name='tokenize'/>
    20	        <step pretty_name='hand tag' name='tag' by_hand='yes'/>
    21	      </workflow>
    22	      <workflow name='Tokenless hand annotation'>
    23	        <step name='zone'/>
    24	        <step pretty_name='hand tag' name='tag' by_hand='yes'/>
    25	      </workflow>
    26	      <workflow hand_annotation_available_at_end='yes' name='Review/repair'/>
    27	      <workflow hand_annotation_available_at_end='yes' name='Demo'>
    28	        <step name='zone'/>
    29	        <step name='tokenize'/>
    30	        <step name='tag'/>
    31	      </workflow>
    32	      <workflow name='Align'>
    33	        <step name='zone'/>
    34	        <step name='tokenize'/>
    35	        <step name='align'/>
    36	      </workflow>
    37	    </workflows>

The task contains five workflow definitions.

The first workflow has three steps: zone, tokenize, and tag; the last of these steps is marked as a hand task (i.e., it's not done by an automated process). This workflow allows you to prepare a document for hand tagging, and leaves a step for hand tagging itself.
The second workflow is the same as the first, except it omits tokenization. This is typically not a good idea, due to the desirability of ensuring that the Carafe engine is given explicit tokenization. However, if you're using the annotation capabilities of MAT without the Carafe engine (say, you're just hand-annotating a corpus), this should be fine.
The third workflow has no steps, but hand annotation is available as a final option. This workflow allows you to correct already-annotated documents.
The fourth workflow has three steps, like the first, but the final step is an automated step, and is intended for automated processing of documents; it uses a specified model, rather than hand annotation, to tag the document.
The fifth workflow has three steps, but in this case, the workflow is intended for documents which have content tags but nothing else. These documents were most likely prepared by other tools. The third step, instead of doing hand or automated tagging, aligns the content annotations with the token boundaries.

The implementation of these steps is found immediately below.

The <workflows> element is obligatory.

    38	    <step_implementations>
    39	      <step name='tokenize' class='MAT.JavaCarafe.CarafeTokenizationStep'/>
    40	      <step name='zone' class='MAT.PluginMgr.WholeZoneStep'/>
    41	      <step name='align' class='MAT.PluginMgr.AlignStep'/>
    42	      <step workflows='Demo' name='tag' class='MAT.JavaCarafe.CarafeTagStep'/>
    43	      <step name='tag' class='MAT.PluginMgr.TagStep'/>
    44	    </step_implementations>

The task defines implementations for the steps in the workflow. The implementations are essentially mappings from simple names to Python classes which implement the steps. The classes referenced here are described in the documentation on tasks. Step implementations can be limited to workflows, as the first implementation of the tag step is here. If a step is designated as by_hand step in a workflow, it will be assigned the PluginMgr.TagStep implementation automatically.

Every step must have an implementation, which is why there are two tagging step implementations. So the second step implementation for "tag" here is to support the default. Remember, the step names are global to the task, as described in the documentation on tasks. So it's not a good idea for the effect of step implementations to differ among workflows; so here, the two tag steps differ in the means they achieve the effect (one is automated and one is by hand), but otherwise, their effect is identical.

The <step_implementations> element is optional.

    45	    <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
    46	      <build_settings training_method='psa' max_iterations='6'/>
    47	    </model_config>
    48	    <model_config config_name='alt_model_build'
    49	                  class='MAT.JavaCarafe.CarafeModelBuilder'/>
    50	    <default_model>default_model</default_model>

The settings for building a model are defined here. We use the Carafe engine, which uses its default feature spec in the absence of a specified feature spec file. We use periodic stepsize adjustment, and we assign a location (a file named "default model") for the default location of models build for this task with MATModelBuilder (see the --save_as_default_model flag). We also have a second, non-default block of settings, named "alt_model_build", which doesn't use periodic stepsize adjustment.

The <model_config> element is optional, as is <default_model>.

    51	    <workspace>
    52	      <operation name='autotag'>
    53	        <settings steps='tag' workflow='Demo'/>
    54	      </operation>
    55	      <operation name='modelbuild'>
    56	        <settings/>
    57	      </operation>
    58	      <operation name='import'>
    59	        <settings steps='zone,tokenize' workflow='Hand annotation'/>
    60	      </operation>
    61	    </workspace>

We define the behavior of the operations in the workspaces for this task here. For a list of predefined folders, see the workspace documentation. Each folder has a set of operations and expected possible settings. In this case, the autotag step takes settings which are equivalent to the flags to MATEngine; the import step does the same. So we see that the autotag operation is equivalent to invoking MATEngine on a document using the Demo workflow defined above, performing one step, and import uses a different workflow, and applies two steps. The modelbuild operation, on the other hand, specifies no settings at all; everything it needs is inherited from the model build settings block immediately above.

The <workspace> element is optional.

The "Enhanced Named Entity" task

    62	  </task>
    63	  <task name='Enhanced Named Entity'>

At this point, we end the first task and begin the second one.

    64	    <annotation_set_descriptors all_annotations_known='no'
    65	                                inherit='category:zone,category:token'>
    66	      <annotation_set_descriptor category='content' name='content'>
    67	        <annotation label='PERSON'/>
    68	        <annotation label='LOCATION'/>
    69	        <annotation label='ORGANIZATION'/>
    70	        <attribute of_annotation="PERSON,LOCATION,ORGANIZATION" name="nomtype">
    71	          <choice>Proper name</choice>
    72	          <choice>Noun</choice>
    73	          <choice>Pronoun</choice>
    74	        </attribute>
    75	        <attribute of_annotation="LOCATION" name="is_political_entity" type="boolean"/>
    76	        <annotation label="LOCATED_EVENT"/>
    77	        <attribute of_annotation="LOCATED_EVENT" type="annotation" name="actor">
    78	          <label_restriction label="PERSON"/>
    79	        </attribute>
    80	        <attribute of_annotation="LOCATED_EVENT" type="annotation" name="location">
    81	          <label_restriction label="LOCATION"/>
    82	          <label_restriction label="ORGANIZATION"/>
    83	        </attribute>
    84	        <annotation label="PERSON_COREF" span="no"/>
    85	        <attribute of_annotation="PERSON_COREF" type="annotation" aggregation="set" name="mentions">
    86	          <label_restriction label="PERSON"/>
    87	        </attribute>
    88	        <annotation label="LOCATION_RELATION" span="no"/>
    89	        <attribute of_annotation="LOCATION_RELATION" type="annotation" name="located">
    90	          <label_restriction label="ORGANIZATION"/>
    91	          <label_restriction label="PERSON"/>
    92	        </attribute>
    93	        <attribute of_annotation="LOCATION_RELATION" type="annotation" name="location">
    94	          <label_restriction label="LOCATION"/>
    95	        </attribute>
    96	      </annotation_set_descriptor>
    97	    </annotation_set_descriptors>

This annotation definition block is much more complex than the one in the "Named Entity" task. In addition to the three labels we saw previously, we also have three other labels: "LOCATED_EVENT" (spanned) and "PERSON_COREF" and "LOCATION_RELATION" (spanless). We also have several attributes, of different types. Most notable is the "mentions" attribute of the "PERSON_COREF" annotation, which takes sets of annotations as its value.

    98	    <annotation_display>
    99	      <label name='PERSON' accelerator='P' css='background-color: #CCFF66' edit_immediately="yes"/>
   100	      <label name='LOCATION' accelerator='L' css='background-color: #FF99CC' edit_immediately="yes"/>
   101	      <label name='ORGANIZATION' accelerator='O' css='background-color: #99CCFF' edit_immediately="yes"/>
   102	      <label name='PERSON_COREF' accelerator='C' css='background-color: lightgreen' edit_immediately="yes"/>
   103	      <label name='LOCATED_EVENT' accelerator='E' css='background-color: pink' edit_immediately="yes"/>
   104	      <label name='LOCATION_RELATION' accelerator='R' css='background-color: orange' edit_immediately="yes"/>
   105	    </annotation_display>

The annotation display block is also somewhat more complex; we see here that all of the annotations are marked to be edited immediately upon creation.

   106	    <workflows>
   107	      <workflow name='Hand annotation'>
   108	        <step name='zone'/>
   109	        <step name='tokenize'/>
   110	        <step pretty_name='hand tag' name='tag' by_hand='yes'/>
   111	      </workflow>
   112	      <workflow name='Tokenless hand annotation'>
   113	        <step name='zone'/>
   114	        <step pretty_name='hand tag' name='tag' by_hand='yes'/>
   115	      </workflow>
   116	      <workflow hand_annotation_available_at_end='yes' name='Review/repair'/>
   117	      <workflow hand_annotation_available_at_end='yes' name='Demo'>
   118	        <step name='zone'/>
   119	        <step name='tokenize'/>
   120	        <step name='tag'/>
   121	      </workflow>
   122	      <workflow name='Align'>
   123	        <step name='zone'/>
   124	        <step name='tokenize'/>
   125	        <step name='align'/>
   126	      </workflow>
   127	    </workflows>

The workflows in this task are identical to those in the "Named Entity" task. Because the Carafe tagger only operates on the simple span subset of this (or any) task, the "Demo" workflow will only apply the spanned labels, not the attributes associated with them, and won't apply the spanless labels at all.

   128	    <step_implementations>
   129	      <step name='tokenize' class='MAT.JavaCarafe.CarafeTokenizationStep'/>
   130	      <step name='zone' class='MAT.PluginMgr.WholeZoneStep'/>
   131	      <step name='align' class='MAT.PluginMgr.AlignStep'/>
   132	      <step workflows='Demo' name='tag' class='MAT.JavaCarafe.CarafeTagStep'/>
   133	      <step name='tag' class='MAT.PluginMgr.TagStep'/>
   134	    </step_implementations>
   135	    <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
   136	      <build_settings training_method='psa' max_iterations='6'/>
   137	    </model_config>
   138	    <model_config config_name='alt_model_build'
   139	                  class='MAT.JavaCarafe.CarafeModelBuilder'/>
   140	    <default_model>default_enhanced_model</default_model>

The step implementation and model configuration are the same as those in the "Named Entity" task. Because the Carafe tagger only operates on the simple span subset of this (or any) task, the model builder will only train models for the spanned labels, not the attributes associated with them, and won't build a model for the spanless labels at all.

   141	    <workspace>
   142	      <operation name='autotag'>
   143	        <settings steps='tag' workflow='Demo'/>
   144	      </operation>
   145	      <operation name='modelbuild'>
   146	        <settings/>
   147	      </operation>
   148	      <operation name='import'>
   149	        <settings steps='zone,tokenize' workflow='Hand annotation'/>
   150	      </operation>
   151	    </workspace>
   152	  </task>
   153	</tasks>

The workspace configuration is identical to that of the "Named Entity" task (and, of course, the caveats about Carafe for model building and automated tagging apply here as well).

The demo.xml file

Here, we describe in detail the makeup of the demo.xml file for the sample task. We've numbered the lines to indicate our progress through the file.

     1	<demo name="Named Entity Identification">

Each demo has a name, which will be the title of the demo page.

     2	  <description>
     3	<![CDATA[
     4	<p>This demo shows the simple named entity identification capability
     5	  provided in the MAT sample task.
     6	]]>
     7	  </description>

Each demo has an HTML description. This description can be arbitrary HTML. In order to force the XML parser to ignore the tag structure, and treat the content as an unanalyzed string, we use the XML <![CDATA[...]]> directive.

     8	  <activity name="Tag" enable_blank_document="yes">
     9	    <description>Automatically locate named entities in the document.</description>
    10	    <engine_settings task = "Named Entity" workflow="Demo" steps="zone,tokenize,tag"/>

Each demo can have a number of activities the user can perform. In most cases, there will be only one (i.e., tag the document), but if the task has been extensively customized, there may be more. The enable_blank_document attribute makes it possible for the user to type in arbitrary text.

Each activity has a description, which the user will see, and settings for MATEngine which dictate how to process the document.

    11	    <sample_document description="Sample news article #1"
    12	                     file_type = "raw"
    13	                     relative_location="resources/data/raw/voa1.txt"/>
    14	    <sample_document description="Sample news article #2"
    15	                     file_type = "raw"
    16	                     relative_location="resources/data/raw/voa2.txt"/>

Each activity can have a number of sample documents. Each sample document has a description, which the user sees in a drop-down menu, a location for the document (which should be a relative pathname within the task directory), and whether the document is a raw or mat-json document. If enable_blank_document is provided, the user will see an additional entry for a blank document she can type into.

    17	  </activity>
    18	</demo>

And now, we're done.