Use cases for the XML format for the task files (see "Creating a New Task") are described
in this document. The reference document is found here. Click here
for a split-screen view.
At the moment, most of the task XML customizations are quite
complex, and not yet documented. Here, we focus on the ways that the
user can specify various variations on defining their content
annotations.
The simplest example of customizing your annotations in your
task.xml file is inheriting all your structural annotations and adding
your own content annotations. The role of the different annotation
categories is described here.
When you define your content annotations, you should assign some CSS
to distinguish them in the Web UI. Right now, the most appropriate way
to do this is to use background colors.
<tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: blue"/>
</tag>
<tag name="TAG2" category="content">
<ui css="background-color: red"/>
</tag>
</tags>
So here, we've inherited the structure annotations using the
inherit_structure attribute, and defined two content annotations, TAG1
and TAG2. We've assigned TAG1 a blue background color, and TAG2 a red
background color. Since you're using CSS, you can assign colors using
hexadecimal designations as well (or, if you prefer, set a background
image, or other wacky things).
One caveat: at the moment, annotation spans are styled on a
token-by-token basis. So if, for instance, you want to have a left
bracket at the left end of an annotation, and a right bracket at the
right end, you can't do that quite yet; you'd end up with each token
bracketed.
In other situations, you may want to define a single content
annotation, which has a distinguished attribute value. One common
example of this in language processing arises in tagging for so-called
named entities (people, locations, organizations). One common tagging
scheme assigns a single ENAMEX tag to these entities, and distinguishes
among them using the value of the "type" attribute.
There's no problem doing this in MAT. In order to do this, we use
the <attr_set> sub-element of <tag>. Each <attr_set>
has a name, by which the alternative is known in the tagging menu in
the UI, and also in the scoring engine. Within each <attr_set> is
one or more <attr> elements, which have a name and a value; any
annotation which has the appropriate tag, and also the appropriate
attributes and values, will be considered to be in this attr set. So
for the user, the specification below will look exactly like having
defined three separate tags: PERSON, LOCATION, ORGANIZATION. However,
internally, the rich annotated document will only have ENAMEX
annotations.
<tags inherit_structure="yes">
<tag name="ENAMEX" category="content">
<attr_set name="PERSON">
<attr name="type" value="PERSON"/>
<ui css="background-color: CCFF66"/><!-- # light green -->
</attr_set>
<attr_set name="LOCATION">
<attr name="type" value="LOCATION"/>
<ui css="background-color: FF99CC"/><!-- # pink -->
</attr_set>
<attr_set name="ORGANIZATION">
<attr name="type" value="ORGANIZATION"/>
<ui css="background-color: 99CCFF"/><!-- # light blue -->
</attr_set>
</tag>
</tags>
The <ui> element also supports the option of having keyboard
accelerators. These are keys that the user can press when the tagging
menu is visible in the UI, which are equivalent to having selected that
menu item. You can add an accelerator using an attribute on the
<ui> element:
<tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: blue" accelerator="A"/>
</tag>
<tag name="TAG2" category="content">
<ui css="background-color: red" accelerator="B"/>
</tag>
</tags>
It's probably a good idea to choose the accelerators mnemonically
(the first letter of the menu item name is always a good mnemonic,
unless of course more than one item starts with the same letter). Be
careful, though; MAT doesn't yet ensure that there are no clashes among
accelerators.
Sometimes the color you choose is too dark to see the text, in which
case you can use CSS to change the text color:
<tags inherit_structure="yes">
<tag name="TAG1" category="content">
<ui css="background-color: black; color: white" accelerator="A"/>
</tag>
</tags>
Remember, the value of the css attribute is really CSS; it's not
converted or processed in any way before it's inserted into the CSS
rules in the Web UI. The one caveat is that the CSS is applied to each
token in the annotated phrase, not to the phrase as a whole.
Let's say that you have the following annotations:
<tags inherit_structure="yes">
<tag name="PERSON" category="content">
<ui css="background-color: blue"/>
</tag>
<tag name="MAN" category="content">
<ui css="background-color: pink"/>
</tag>
<tag name="WOMAN" category="content">
<ui css="background-color: orange"/>
</tag>
<tag name="US-LOCATION" category="content">
<ui css="background-color: gray"/>
</tag>
<tag name="FOREIGN-LOCATION" category="content">
<ui css="background-color: yellow"/>
</tag>
</tags>
Your annotator is instructed to label people, using PERSON as the
annotation if she can't tell which of MAN or WOMAN is applicable. Your
preference is to arrange these in a visual hierarchy for the
annotator's convenience; you wish to do the same with US-LOCATION and
FOREIGN-LOCATION, even though they don't have a common, less specific
annotation. Here's what you do:
<tags>
...
<tag_group name="PERSON" children="MAN,WOMAN"/>
<tag_group name="LOCATION" children="US-LOCATION,FOREIGN-LOCATION"/>
</tags>
The tag group can reference an existing annotation (as in the PERSON
case) or create its own group (as in the LOCATION case). The effect of
these groups will be to create submenus in the annotation popup in the
MAT UI.
If you use the autotag option
in the UI, the UI will use token boundaries as the required boundaries
for the phrase matches, if you tokenize the document. If you don't
tokenize, autotagging will use whitespace as the delimiter; as a
result, if you've tagged the first "George" in the fragment below, the
second "George" will not be autotagged, because it's delimited on the
right by a comma, not whitespace:
I asked George to join me, but George, being shy, said no.
If you want to modify this behavior, you can use the
tokenless_autotag_delimiters property of the web_customization element
in task.xml:
<task name="...">
...
<web_customization tokenless_autotag_delimiters=","/>
...
</task>
The value here is a sequence of characters, any one of which will
count as a delimiter.
Note that this technique is fairly blunt; it will add commas to the
eligible delimiter set at the start and the end of the candidate match,
and does not distinguish among different uses of commas; in order to do
something like that, you really need a sophisticated tokenizer.
Finally, note that because this is an XML attribute, you have to
escape any XML-significant characters. For instance, if you want both
single and double quotes to count as delimiters, you'll need to do
something like this:
<task name="...">
...
<web_customization tokenless_autotag_delimiters=",'""/>
...
</task>
Let's say you have a set of content annotations, and you want to use
this set in multiple languages, and these languages have different
model build configurations, tag steps, tokenizers, etc. You can
certainly define a variety of workflows (e.g., "English annotation",
"French annotation"), etc., within the same task, but if the languages
don't share a text direction (e.g., English vs. Arabic), there's no way
to assign the behavior correctly within a single task. In addition, you
may simply want to encapsulate the differences more cleanly. The right
way to do this is to define multiple tasks within the task.xml file,
and set up parent-child relationships so that, e.g., the content
annotations are inherited. For instance:
<tasks>
<task name="Named Entity" visible="no">
<tags inherit_structure="yes">
<tag name="PERSON" category="content">
<ui css="background-color: CCFF66" accelerator="P"/><!-- # light green -->
</tag>
<tag name="LOCATION" category="content">
<ui css="background-color: FF99CC" accelerator="L"/><!-- # pink -->
</tag>
<tag name="ORGANIZATION" category="content">
<ui css="background-color: 99CCFF" accelerator="O"/><!-- # light blue -->
</tag>
</tags>
...
</task>
<task name="English Named Entity" parent="Named Entity">
<tags inherit_structure="yes" inherit_content="yes"/>
...
</task>
<task name="Arabic Named Entity" parent="Named Entity">
<tags inherit_structure="yes" inherit_content="yes"/>
<web_customization text_right_to_left="yes"/>
...
</task>
<tasks>
Just specify the workflows, steps, step implementations and
workspaces in the child tasks as you would normally do. You'll probably
want to mark the parent as not visible (as shown here), so that it
won't appear as an available task in the UI or for the command-line
tools.