Annotation set descriptor XML use cases

Use cases for the XML format for the annotation set descriptors in the task files (see "Creating a New Task") are described in this document. The reference document is found here. Click here for a split-screen view.

At the moment, most of the task XML customizations are quite complex, and not yet documented. Here, we focus on the ways that the user can specify various variations on defining their content annotations. For examples of how to customize the UI display of your annotations, see here.

Defining content annotations
Defining spanless content annotations
Defining attributes
Defining a single content annotation, partitioned by attribute values
Defining complex annotation-valued attributes
Defining relations among annotations

Defining content annotations

The simplest example of customizing your annotations in your task.xml file is inheriting all your structural annotations and adding your own content annotations. The role of the different annotation categories is described here.

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      <annotation label="TAG1"/>
      <annotation label="TAG2"/>
    </annotation_set_descriptor>
  </annotation_set_descriptors>

So here, we've inherited the structure annotations and defined two content annotations, TAG1 and TAG2. The content annotations are both spanned annotations, by default.

Defining spanless content annotations

Not all content annotations are spanned annotations; some annotations aren't anchored directly to the text. You can find examples of such annotations in Tutorial 8. It's easy to define these annotations:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="SPANLESS1" span="no"/>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

The UI effects of defining spanless annotations are described here.

Defining attributes

Annotations, spanned or spanless, can have attributes. These attributes can be strings (the default), floats, integers, Booleans, or other annotations, or sets or lists of these types. Here's how to define a simple string attribute:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <attribute of_annotation="TAG1" name="string_attr"/>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

String attributes can have default values, or choices:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <attribute of_annotation="TAG1" name="string_attr" default="Pronoun">
        <choice>Pronoun</choice>
        <choice>Nominal</choice>
        <choice>Proper name</choice>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

Integer and float attributes can be defined with accepted ranges:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <attribute of_annotation="TAG1" type="int" name="int_attr">
        <range from="10" to="20"/>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

Annotation attributes must have label restrictions that specify what types of annotations can fill this attribute value (more examples here):

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <attribute of_annotation="TAG1" type="annotation" name="annot_attr">
        <label_restriction label="TAG2"/>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

And any of these attributes can be set or list aggregations:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <attribute of_annotation="TAG1" type="annotation" aggregation="set" name="mentions">
        <label_restriction label="TAG2"/>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

Defining a single content annotation, partitioned by attribute values

In some situations, you may want to define a single content annotation, which has a distinguished attribute value. One common example of this in language processing arises in tagging for so-called named entities (people, locations, organizations). One common tagging scheme assigns a single ENAMEX tag to these entities, and distinguishes among them using the value of the "type" attribute. This label + attribute/value pair is assigned a notional name, for use in the UI, scorer, etc. We call these effective labels.

Effective labels must be defined on choice restrictions of string or integer attributes. If an effective label is declared for one of the choices, there must be a declaration for all of them. In other words, the choices must completely partition the label.

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="ENAMEX"/>
      <attribute of_annotation="ENAMEX" name="type">
        <choice effective_label="PERSON">PER</choice>
        <choice effective_label="ORGANIZATION">ORG</choice>
        <choice effective_label="LOCATION">LOC</choice>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

Defining complex annotation-valued attributes

You can define complex restrictions on annotation-valued attributes in a number of ways. These restrictions consist of a label and its attributes; the attributes must be choice attributes (i.e., string or integer attributes with choices defined). The availability of these restrictions is independent of whether an effective label is defined for the attribute.

Here's an example fragment. It starts with the effective label attribute definition from the previous example, but defines a second (nonsensical) integer choice attribute:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="ENAMEX"/>
      <attribute of_annotation="ENAMEX" name="type">
        <choice effective_label="PERSON">PER</choice>
        <choice effective_label="ORGANIZATION">ORG</choice>
        <choice effective_label="LOCATION">LOC</choice>
      </attribute>
      <attribute of_annotation="ENAMEX" type="int" name="size">
        <choice>0</choice>
        <choice>1</choice>
      </attribute>
      <!-- and now, the annotation-valued attribute -->
      <annotation label="LOCATED"/>
      <attribute of_annotation="LOCATED" name="who" type="annotation">
        <label_restriction label="ENAMEX">
          <attributes type="PERSON" size="1"/>
        </label_restriction>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

The label restriction itself can refer either to a true or an effective label, and the effective label can be combined with additional attribute restrictions:

      <annotation label="LOCATED"/>
      <attribute of_annotation="LOCATED" name="who" type="annotation">
        <label_restriction label="PERSON">
          <attributes size="1"/>
        </label_restriction>
      </attribute>

Defining relations among annotations

As you can see from the previous example, you can use annotation-valued attributes and label restrictions to create relations among annotations. This is the only facility that MAT provides for making these connections. We acknowledge that this approach has limitations:

There's no facility for creating "unnamed" spans as attribute fillers (although there are good reasons to want such a capability, for marking head-extent relations, for instance).
There's no explicit facility for making annotation-valued attributes "optional". Leaving an attribute unfilled is the recommended strategy. (Although, again, in the UI, there are good reasons to want to be able to hide unfilled attributes.)
There's no facility for providing any sort of structure for attributes, e.g., a subtype. If the attribute requires structure, it should be implemented as its own relation. (Again, there are good reasons to want to do this otherwise, e.g., complex modifiers like temporal modifiers.)
There's no facility for labeling an annotation type as being a singleton (e.g., for document-level annotations), or as many-to-one or one-to-many (although, again, we recognize that both these features would be valuable).

So, for instance, how might you represent an array of time restrictions (before, after, etc.) on an event? Here are three different strategies.

Strategy 1: a slot for each restriction type

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="TIME"/>
      <!-- this annotation can be spanned or spanless -->
      <annotation label="EVENT"/>
      <!-- if you're anticipating multiple times for a restriction type, 
           make these set aggregations -->
      <attribute of_annotation="EVENT" type="annotation" name="before">
        <label_restriction label="TIME"/>
      <attribute of_annotation="EVENT" type="annotation" name="after">
        <label_restriction label="TIME"/>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

The obvious problem with this strategy is that you might have many, many temporal relations you care about, and/or you may want to provide attributes for the temporal relations.

Strategy 2: separate relations for time

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="TIME"/>
      <!-- this annotation can be spanned or spanless -->
      <annotation label="EVENT"/>
      <!-- so can this annotation -->
      <annotation label="BEFORE"/>
      <attribute of_annotation="BEFORE" type="annotation" name="event">
        <label_restriction label="EVENT"/>
      </attribute>
      <attribute of_annotation="BEFORE" type="annotation" name="time">
        <label_restriction label="TIME"/>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

You could generalize this strategy by having a single temporal relation with an attribute to indicate what kind of temporal relation it is:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="TIME"/>
      <!-- this annotation can be spanned or spanless -->
      <annotation label="EVENT"/>
      <!-- so can this annotation -->
      <annotation label="TEMPORAL"/>
      <attribute of_annotation="TEMPORAL" type="annotation" name="event">
        <label_restriction label="EVENT"/>
      </attribute>
      <attribute of_annotation="TEMPORAL" type="annotation" name="time">
        <label_restriction label="TIME"/>
      </attribute>
      <attribute of_annotation="TEMPORAL" name="type">
        <choice>BEFORE</choice>
        <choice>AFTER</choice>
        ...
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

The obvious problem with this strategy is that the temporal relations are separated from the events they modify (because there's no way of showing or representing relations as subordinate attributes).

Strategy 3: subordinate the temporal relation

This strategy is a combination of the first two:

  <annotation_set_descriptors inherit="category:zone,category:token">
    <annotation_set_descriptor name="content" category="content">
      ...
      <annotation label="TIME"/>
      <!-- this annotation can be spanned or spanless -->
      <annotation label="EVENT"/>
      <!-- so can this annotation -->
      <annotation label="TEMPORAL"/>
      <attribute of_annotation="TEMPORAL" type="annotation" name="time">
        <label_restriction label="TIME"/>
      </attribute>
      <attribute of_annotation="TEMPORAL" name="type">
        <choice>BEFORE</choice>
        <choice>AFTER</choice>
        ...
      </attribute>
      <attribute of_annotation="EVENT" type="annotation" aggregation="set" name="temporal">
        <label_restriction label="TEMPORAL"/>
      </attribute>
      ...
    </annotation_set_descriptor>
  </annotation_set_descriptors>

The distinction here is subtle: instead of TEMPORAL being a two-place relation between an event and a time, it's got only one argument, the time, and its relation to the EVENT is represented by the its presence in the "temporal" set-aggregation annotation-valued attribute.

The obvious disadvantage to this strategy is that it doesn't correspond trivially to what we'd think of as the "correct event logic". However, given that we're talking about annotations, not objects in a knowledge representation, it might ultimately be the proper compromise.