The MAT toolkit is designed to be loosely coupled, using documents
at rest (rather than data structures in process) as the common data
interface. The default format for rich annotated documents is described
in this document; the full range of available readers and writers is
described elsewhere.
The MAT document format is built on top of the Javascript Object Notation (JSON). It is
simple and lightweight, and unlike XML, is designed for typed data.
This format closely mirrors the structure of the documents themselves,
so it's worth reviewing in any case, but especially if
you want to process MAT-annotated documents outside of the MAT toolkit.
Currently, we provide facilities for rendering and digesting this
format in Python and Java; if you want to manipulate this format in any
other programming language, you'll have to write the renderer/digester
yourself.
JSON is so-called because it's a subset of the Javascript
programming language, and thus exceptionally well-suited for passing
data to and from Web applications like the MAT UI. JSON contains hashes
(curly brackets), lists (square brackets), UTF-16 strings, integers and
floats, plus the constants null,
true, false. Whitespace is not
significant except within strings. That's it.
It's important to remember that JSON is not a data structure; it's a
string representation of data structures. There are JSON libraries for
reading and writing JSON strings and mapping them to native data
structures. E.g., in Python, hashes are mapped to dictionaries, strings
to string, lists to lists, null
to None, true to True, false to False.
<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 1 }
<aset_list>: [ <aset_entry>* ]
<aset_entry>: { "type": <string>, "attrs": [ <string>* ], "annots": <annot_list> }
<annot_list>: [ [ <int>, <int>, <string>* ]* ]
The value of "signal" is the document contents; once the document
has any annotations at all, the signal should not be changed.
The "version" key is optional, in this initial version of the
format. Decoders should assume that if the key is missing, the version
is 1. Decoders should raise an error if the version is later than the
version they're designed to handle. The version number will change as
the document format evolves.
The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document, and to record some display metadata about the various tags (e.g., what color to use for the tag, or whether the annotation label is a content annotation or not). Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.
The <aset_list> is a sequence of entries, one for each
annotation type. Each <aset_entry> specifies the name of a tag
(e.g., "PERSON"), a list of attributes which can be filled (e.g.
["gender"]), and a list of annotations. Each element of the list of
annotations contains two integers, which are 0-based indexes into the
signal representing the start and end of
the annotation span, respectively, plus attribute values. The
value of "attrs" (the attribute names) and the list of values after
the first two integers in each element in the value of "annots" (the
attribute values) are essentially parallel; the attribute values may be
no longer than the attribute names, and they are paired with each other
until the values are exhausted, at which point all subsequent attribute
values should be treated as null.
The
reason
for
allowing annot lists which are shorter than the list of
attribute names is partially for space efficiency, and partially to
support the option of adding a new attribute to an annotation type
without having to go to the trouble of adding a null to every instance of that
annotation type.
Here's a sample document:
{"signal": "I like Michael Jackson and Janet Jackson.",
"asets": [ {"type": "PERSON",
"attrs": ["gender", "number"],
"annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
"metadata": {}
}
In this example, note that the value of the "gender" attribute for
the PERSON annotation spanning "Michael Jackson" is null, and the value of the
"number" attribute is "singular". For the PERSON annotation
spanning "Janet Jackson" only the "gender" attribute is specified (it
is "female"), implying that the "number" attribute for this annotation
is null. This
illustrates how the MAT document format allows the specified list of
annotation values to be shorter than the list of annotation names (with
implicit nulls making up the difference).
To write a document to a MAT JSON document file, convert your
document object to the appropriate data structures in your programming
language, render the structure to JSON, and write the string to a file,
using the UTF-8 character encoding. To read a document, read the
contents of the file using the UTF-8 character encoding, decode the
string into the matching data structures in your programming language,
and convert those data structures into your document object.
The character encoding of a MAT JSON document is always UTF-8.
It's important to remember, always, what the index offsets in
the annotations represent: they are character offsets, independent of
the particular character encoding. (If you don't understand the
distinction, we recommend you read Joel Spolsky's
Unicode primer.) We've chosen UTF-8 as our encoding for MAT JSON
documents because it is flexible enough to encode all Unicode
characters with good efficiency, and it's a proper superset of ASCII.
So if an annotation covers the span from index 7 to index 22, as the
first annotation does in our example above, this means "from the 7th
character of the document (where 0 is the first) to the 21st character
of the document (where 0 is the first)". It does not mean "from the 7th byte of the
document to the 21st byte of the document".
This can lead to tremendous confusion in counting offsets, depending
on how your programming language treats Unicode strings. For instance,
Javascript has UTF-16 strings, which means that each Unicode character
takes up exactly 2 bytes, and it just so happens that the 2-byte
numeric value is the same as the Unicode code point for that character.
For characters whose Unicode code point is greater than 65536 (that is,
larger than can be represented in 2 bytes), UTF-16 has a system of what
they call "surrogates", which are pairs of 2-byte sequences reserved
for representing these larger code points. So in some cases, a Unicode
character takes up 2 bytes in UTF-16, and in other cases, it takes up 4
bytes. Now, this case is extremely unusual; all the characters in all
the living human languages are in the 2-byte space, but the Unicode
code point space is 32 bits, so there are quite a number of rare
characters (e.g., music notation, dead languages) which appear as these
"surrogate pairs".
The reason this is relevant to programmers is that these surrogate
pairs count as length 2 in Javascript; that is, the String.length
attribute counts UTF-16 elements. Java is similar, except that it has a
separate set of APIs for counting Unicode characters (e.g., charAt()
vs. codePointAt() in Java 1.5).
Neither the Python nor the Javascript implementations in MAT
currently treat these characters above the 2-byte space correctly.