Task XML Use Cases

Use cases for the XML format for the task files (see "Creating a New Task") are described in this document. The reference document is found here. Click here for a split-screen view.

At the moment, most of the task XML customizations are quite complex, and not yet documented. Here, we focus on the ways that the user can specify various variations on defining their content annotations.

Defining content annotations

The simplest example of customizing your annotations in your task.xml file is inheriting all your structural annotations and adding your own content annotations. The role of the different annotation categories is described here.

When you define your content annotations, you should assign some CSS to distinguish them in the Web UI. Right now, the most appropriate way to do this is to use background colors.

  <tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: blue"/>
</tag>
<tag name="TAG2" category="content">
<ui css="background-color: red"/>
</tag>
</tags>

So here, we've inherited the structure annotations using the inherit_structure attribute, and defined two content annotations, TAG1 and TAG2. We've assigned TAG1 a blue background color, and TAG2 a red background color. Since you're using CSS, you can assign colors using hexadecimal designations as well (or, if you prefer, set a background image, or other wacky things).

One caveat: at the moment, annotation spans are styled on a token-by-token basis. So if, for instance, you want to have a left bracket at the left end of an annotation, and a right bracket at the right end, you can't do that quite yet; you'd end up with each token bracketed.

Defining a single content annotation, partitioned by attribute values

In other situations, you may want to define a single content annotation, which has a distinguished attribute value. One common example of this in language processing arises in tagging for so-called named entities (people, locations, organizations). One common tagging scheme assigns a single ENAMEX tag to these entities, and distinguishes among them using the value of the "type" attribute.

There's no problem doing this in MAT. In order to do this, we use the <attr_set> sub-element of <tag>. Each <attr_set> has a name, by which the alternative is known in the tagging menu in the UI, and also in the scoring engine. Within each <attr_set> is one or more <attr> elements, which have a name and a value; any annotation which has the appropriate tag, and also the appropriate attributes and values, will be considered to be in this attr set. So for the user, the specification below will look exactly like having defined three separate tags: PERSON, LOCATION, ORGANIZATION. However, internally, the rich annotated document will only have ENAMEX annotations.

  <tags inherit_structure="yes">
<tag name="ENAMEX" category="content">
<attr_set name="PERSON">
<attr name="type" value="PERSON"/>
<ui css="background-color: CCFF66"/><!-- # light green -->
</attr_set>
<attr_set name="LOCATION">
<attr name="type" value="LOCATION"/>
<ui css="background-color: FF99CC"/><!-- # pink -->
</attr_set>
<attr_set name="ORGANIZATION">
<attr name="type" value="ORGANIZATION"/>
<ui css="background-color: 99CCFF"/><!-- # light blue -->
</attr_set>
</tag>
</tags>

Defining keyboard accelerators

The <ui> element also supports the option of having keyboard accelerators. These are keys that the user can press when the tagging menu is visible in the UI, which are equivalent to having selected that menu item. You can add an accelerator using an attribute on the <ui> element:

  <tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: blue" accelerator="A"/>
</tag>
<tag name="TAG2" category="content">
<ui css="background-color: red" accelerator="B"/>
</tag>
</tags>

It's probably a good idea to choose the accelerators mnemonically (the first letter of the menu item name is always a good mnemonic, unless of course more than one item starts with the same letter). Be careful, though; MAT doesn't yet ensure that there are no clashes among accelerators.

Changing the annotation foreground font

Sometimes the color you choose is too dark to see the text, in which case you can use CSS to change the text color:

  <tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: black; color: white" accelerator="A"/>
</tag>
</tags>

Remember, the value of the css attribute is really CSS; it's not converted or processed in any way before it's inserted into the CSS rules in the Web UI. The one caveat is that the CSS is applied to each token in the annotated phrase, not to the phrase as a whole.

Using cascaded menus for more and less specialized tags

Let's say that you have the following annotations:

  <tags inherit_structure="yes">
<tag name="PERSON" category="content">
<ui css="background-color: blue"/>
</tag>
<tag name="MAN" category="content">
<ui css="background-color: pink"/>
</tag>
<tag name="WOMAN" category="content">
<ui css="background-color: orange"/>
</tag>
<tag name="US-LOCATION" category="content">
<ui css="background-color: gray"/>
</tag>
<tag name="FOREIGN-LOCATION" category="content">
<ui css="background-color: yellow"/>
</tag>
</tags>

Your annotator is instructed to label people, using PERSON as the annotation if she can't tell which of MAN or WOMAN is applicable. Your preference is to arrange these in a visual hierarchy for the annotator's convenience; you wish to do the same with US-LOCATION and FOREIGN-LOCATION, even though they don't have a common, less specific annotation. Here's what you do:

<tags>
...
<tag_group name="PERSON" children="MAN,WOMAN"/>
<tag_group name="LOCATION" children="US-LOCATION,FOREIGN-LOCATION"/>
</tags>

The tag group can reference an existing annotation (as in the PERSON case) or create its own group (as in the LOCATION case). The effect of these groups will be to create submenus in the annotation popup in the MAT UI.

Customizing autotagging

If you use the autotag option in the UI, the UI will use token boundaries as the required boundaries for the phrase matches, if you tokenize the document. If you don't tokenize, autotagging will use whitespace as the delimiter; as a result, if you've tagged the first "George" in the fragment below, the second "George" will not be autotagged, because it's delimited on the right by a comma, not whitespace:

I asked George to join me, but George, being shy, said no.

If you want to modify this behavior, you can use the tokenless_autotag_delimiters property of the web_customization element in task.xml:

<task name="...">
...
<web_customization tokenless_autotag_delimiters=","/>
...
</task>

The value here is a sequence of characters, any one of which will count as a delimiter.

Note that this technique is fairly blunt; it will add commas to the eligible delimiter set at the start and the end of the candidate match, and does not distinguish among different uses of commas; in order to do something like that, you really need a sophisticated tokenizer.

Finally, note that because this is an XML attribute, you have to escape any XML-significant characters. For instance, if you want both single and double quotes to count as delimiters, you'll need to do something like this:

<task name="...">
...
<web_customization tokenless_autotag_delimiters=",'&quot;"/>
...
</task>

Sharing content annotations among multiple tasks

Let's say you have a set of content annotations, and you want to use this set in multiple languages, and these languages have different model build configurations, tag steps, tokenizers, etc. You can certainly define a variety of workflows (e.g., "English annotation", "French annotation"), etc., within the same task, but if the languages don't share a text direction (e.g., English vs. Arabic), there's no way to assign the behavior correctly within a single task. In addition, you may simply want to encapsulate the differences more cleanly. The right way to do this is to define multiple tasks within the task.xml file, and set up parent-child relationships so that, e.g., the content annotations are inherited. For instance:

<tasks>
<task name="Named Entity" visible="no">
<tags inherit_structure="yes">
<tag name="PERSON" category="content">
<ui css="background-color: CCFF66" accelerator="P"/><!-- # light green -->
</tag>
<tag name="LOCATION" category="content">
<ui css="background-color: FF99CC" accelerator="L"/><!-- # pink -->
</tag>
<tag name="ORGANIZATION" category="content">
<ui css="background-color: 99CCFF" accelerator="O"/><!-- # light blue -->
</tag>
</tags>
...
</task>
<task name="English Named Entity" parent="Named Entity">
<tags inherit_structure="yes" inherit_content="yes"/>
...
</task>
<task name="Arabic Named Entity" parent="Named Entity">
<tags inherit_structure="yes" inherit_content="yes"/>
<web_customization text_right_to_left="yes"/>
...
</task>
<tasks>

Just specify the workflows, steps, step implementations and workspaces in the child tasks as you would normally do. You'll probably want to mark the parent as not visible (as shown here), so that it won't appear as an available task in the UI or for the command-line tools.