The Sample Task

The sample task can be found in MAT_PKG_HOME/sample/ne. Like all tasks, it has a file named task.xml at its root. The format of this file is described in the task XML documentation. It has no Python or Javascript customizations, so it has none of the corresponding subdirectories. See "Creating a task" for a description of the subdirectory structure of the task.

In addition to the task.xml file, the sample task contains a demo.xml file which describes a demo of the automated capability in this task. We describe both of those files in detail here.

The task.xml file

Here, we describe in detail the makeup of the task.xml file for the sample task. We've numbered the lines to indicate our progress through the file.

     1	<task name="Named Entity">

The file contains a single task declaration, which must be named.

     2	  <tags inherit_structure="yes">
     3	    <tag name="PERSON" category="content">
     4	      <ui css="background-color: CCFF66" accelerator="P"/><!-- # light green -->
     5	    </tag>
     6	    <tag name="LOCATION" category="content">
     7	      <ui css="background-color: FF99CC" accelerator="L"/><!-- # pink -->
     8	    </tag>
     9	    <tag name="ORGANIZATION" category="content">
    10	      <ui css="background-color: 99CCFF" accelerator="O"/><!-- # light blue -->
    11	    </tag>
    12	  </tags>

The file contains a block of tag declarations. Here, we have inherited the structure tags (i.e., zone and lex) from the root task, and defined our own content tags. Each of the content tags must have a name. The <ui> subelement describes the visual features of the tag. For instance, the PERSON tag will display as light green, and the tagging menu will support the "P" keyboard accelerator for annotating a selected span with the PERSON tag.

The tag block is obligatory.

    13	  <workflows>
    14	    <workflow name="Hand annotation">
    15	      <step name="zone"/>
    16	      <step name="tokenize"/>
    17	      <step name="tag" pretty_name="hand tag" by_hand="yes"/>
    18	    </workflow>
    19	    <workflow name="Tokenless hand annotation">
    20	      <step name="zone"/>
    21	      <step name="tag" pretty_name="hand tag" by_hand="yes"/>
    22	    </workflow>
    23	    <workflow name="Review/repair" hand_annotation_available_at_end="yes"/>
    24	    <workflow name="Demo" hand_annotation_available_at_end="yes">
    25	      <step name="zone"/>
    26	      <step name="tokenize"/>
    27	      <step name="tag"/>
    28	    </workflow>
    29	    <workflow name="Align">
    30	      <step name="zone"/>
    31	      <step name="tokenize"/>
    32	      <step name="align"/>
    33	    </workflow>
    34	  </workflows>

The file contains three workflow definitions.

The first workflow has three steps: zone, tokenize, and tag; the last of these steps is marked as a hand task (i.e., it's not done by an automated process). This workflow allows you to prepare a document for hand tagging, and leaves a step for hand tagging itself.
The second workflow is the same as the first, except it omits tokenization. This is not advisable; we include this workflow for testing purposes, for those situations where no tokenizer is available. Note that if tokenization is omitted, we do not guarantee that the entire MAT suite will work.
The third workflow has no steps, but hand annotation is available as a final option. This workflow allows you to correct already-annotated documents.
The fourth workflow has three steps, like the first, but the final step is an automated step, and is intended for automated processing of documents; it uses a specified model, rather than hand annotation, to tag the document.
The fifth workflow has three steps, but in this case, the workflow is intended for documents which have content tags but nothing else. These documents were most likely prepared by other tools. The third step, instead of doing hand or automated tagging, aligns the content annotations with the token boundaries.

The implementation of these steps is found immediately below.

The workflows block is obligatory.

    35	  <step_implementations>
    36	    <step name="tokenize" class="MAT.JavaCarafe.CarafeTokenizationStep"/>
    37	    <step name="zone" class="MAT.PluginMgr.WholeZoneStep"/>
    38	    <step name="align" class="MAT.PluginMgr.AlignStep"/>
    39	    <step name="tag" tagging_step="yes" workflows="Demo" class="MAT.JavaCarafe.CarafeTagStep"/>
    40	    <!-- for undo -->
    41	    <step name="tag" tagging_step="yes" class="MAT.PluginMgr.TagStep"/>
    42	  </step_implementations>

The file defines implementations for the steps in the workflow. The implementations are essentially mappings from simple names to Python classes which implement the steps. The classes referenced here are described in the documentation on tasks. Step implementations can be limited to workflows, as the first implementation of the tag step is here. Step implementations can also be designated as tagging steps (which are the only steps which support the "by_hand" attribute that can be specified in the workflows). If a step is designated as a tagging step and a by_hand step in a workflow, it will be assigned the PluginMgr.HandAnnotationTagStep automatically.

Every step must have an implementation, which is why there are two tagging step implementations. So the second step implementation for "tag" here is to support the default. Remember, the step names are global to the task, as described in the documentation on tasks. So it's not a good idea for the effect of step implementations to differ among workflows; so here, the two tag steps differ in the means they achieve the effect (one is automated and one is by hand), but otherwise, their effect is identical.

The step implementations block is optional.

    43	  <model_config class="MAT.JavaCarafe.CarafeModelBuilder">
    44	    <build_settings training_method="psa" max_iterations="6"/>
    45	  </model_config>
    46	  <model_config config_name="alt_model_build" class="MAT.JavaCarafe.CarafeModelBuilder"/>
    47	  <default_model>default_model</default_model

The settings for building a model are defined here. We use the Carafe engine, which uses its default feature spec in the absence of a specified feature spec file. We use periodic stepsize adjustment, and we assign a location (a file named "default model") for the default location of models build for this task with MATModelBuilder (see the --save_as_default_model flag). We also have a second, non-default block of settings, named "alt_model_build", which doesn't use periodic stepsize adjustment.

The model build settings block is optional, as is the default model.

    48	  <workspace>
    49	    <operation name="autotag">
    50	      <settings workflow="Demo" steps="zone,tokenize,tag"/>
    51	    </operation>
    52	    <operation name="modelbuild">
    53	      <settings/>
    54	    </operation>
    55	    <operation name="tagprep">
    56	      <settings workflow="Hand annotation" steps="zone,tokenize"/>
    57	    </operation>
    58	  </workspace>

We define the behavior of the operations in the workspaces for this task here. For a list of predefined folders, see the workspace documentation. Each folder has a set of operations and expected possible settings. In this case, the autotag step takes settings which are equivalent to the flags to MATEngine; the tagprep step does the same. So we see that the autotag operation is equivalent to invoking MATEngine on a document using the Demo workflow defined above, performing three steps, and tagprep uses a different workflow, and applies two steps. The modelbuild operation, on the other hand, specifies no settings at all; everything it needs is inherited from the model build settings block immediately above.

The workspace block is optional.

    60	</task>

And finally, we're done.

The demo.xml file

Here, we describe in detail the makeup of the demo.xml file for the sample task. We've numbered the lines to indicate our progress through the file.

     1	<demo name="Named Entity Identification">

Each demo has a name, which will be the title of the demo page.

     2	  <description>
     3	<![CDATA[
     4	<p>This demo shows the simple named entity identification capability
     5	  provided in the MAT sample task.
     6	]]>
     7	  </description>

Each demo has an HTML description. This description can be arbitrary HTML. In order to force the XML parser to ignore the tag structure, and treat the content as an unanalyzed string, we use the XML <![CDATA[...]]> directive.

     8	  <activity name="Tag" enable_blank_document="yes">
     9	    <description>Automatically locate named entities in the document.</description>
    10	    <engine_settings task = "Named Entity" workflow="Demo" steps="zone,tokenize,tag"/>

Each demo can have a number of activities the user can perform. In most cases, there will be only one (i.e., tag the document), but if the task has been extensively customized, there may be more. The enable_blank_document attribute makes it possible for the user to type in arbitrary text.

Each activity has a description, which the user will see, and settings for MATEngine which dictate how to process the document.

    11	    <sample_document description="Sample news article #1"
    12	                     file_type = "raw"
    13	                     relative_location="resources/data/raw/voa1.txt"/>
    14	    <sample_document description="Sample news article #2"
    15	                     file_type = "raw"
    16	                     relative_location="resources/data/raw/voa2.txt"/>

Each activity can have a number of sample documents (one of which might be a blank document if enable_blank_document is used). Each sample document has a description, which the user sees in a drop-down menu, a location for the document (which should be a relative pathname within the task directory), and whether the document is a raw or mat-json document.

    17	  </activity>
    18	</demo>

And now, we're done.