MAT JSON Document Format

The MAT toolkit is designed to be loosely coupled, using documents at rest (rather than data structures in process) as the common data interface. The default format for rich annotated documents is described in this document; the full range of available readers and writers is described elsewhere.

JSON

The MAT document format is built on top of the Javascript Object Notation (JSON). It is simple and lightweight, and unlike XML, is designed for typed data. This format closely mirrors the structure of the documents themselves, so it's worth reviewing in any case, but especially if you want to process MAT-annotated documents outside of the MAT toolkit. Currently, we provide facilities for rendering and digesting this format in Python and Java; if you want to manipulate this format in any other programming language, you'll have to write the renderer/digester yourself.

JSON is so-called because it's a subset of the Javascript programming language, and thus exceptionally well-suited for passing data to and from Web applications like the MAT UI. JSON contains hashes (curly brackets), lists (square brackets), UTF-16 strings, integers and floats, plus the constants null, true, false. Whitespace is not significant except within strings. That's it.

It's important to remember that JSON is not a data structure; it's a string representation of data structures. There are JSON libraries for reading and writing JSON strings and mapping them to native data structures. E.g., in Python, hashes are mapped to dictionaries, strings to string, lists to lists, null to None, true to True, false to False.

The MAT document format

<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 1 }

<aset_list>: [ <aset_entry>* ]

<aset_entry>: { "type": <string>, "attrs": [ <string>* ], "annots": <annot_list> }

<annot_list>: [ [ <int>, <int>, <string>* ]* ]

The value of "signal" is the document contents; once the document has any annotations at all, the signal should not be changed.

The "version" key is optional, in this initial version of the format. Decoders should assume that if the key is missing, the version is 1. Decoders should raise an error if the version is later than the version they're designed to handle. The version number will change as the document format evolves.

The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document, and to record some display metadata about the various tags (e.g., what color to use for the tag, or whether the annotation label is a content annotation or not). Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.

The <aset_list> is a sequence of entries, one for each annotation type. Each <aset_entry> specifies the name of a tag (e.g., "PERSON"), a list of attributes which can be filled (e.g. ["gender"]), and a list of annotations. Each element of the list of annotations contains two integers, which are 0-based indexes into the signal representing the start and end of the annotation span, respectively, plus attribute values. The value of "attrs" (the attribute names) and the list of values after the first two integers in each element in the value of "annots" (the attribute values) are essentially parallel; the attribute values may be no longer than the attribute names, and they are paired with each other until the values are exhausted, at which point all subsequent attribute values should be treated as null. The reason for allowing annot lists which are shorter than the list of attribute names is partially for space efficiency, and partially to support the option of adding a new attribute to an annotation type without having to go to the trouble of adding a null to every instance of that annotation type.

Here's a sample document:

{"signal": "I like Michael Jackson and Janet Jackson.",
 "asets": [ {"type": "PERSON", 
             "attrs": ["gender", "number"],
             "annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
 "metadata": {}
}

In this example, note that the value of the "gender" attribute for the PERSON annotation spanning "Michael Jackson" is null, and the value of the "number" attribute is "singular". For the PERSON annotation spanning "Janet Jackson" only the "gender" attribute is specified (it is "female"), implying that the "number" attribute for this annotation is null. This illustrates how the MAT document format allows the specified list of annotation values to be shorter than the list of annotation names (with implicit nulls making up the difference).

A note about files, signal offsets, JSON strings, and character encodings

To write a document to a MAT JSON document file, convert your document object to the appropriate data structures in your programming language, render the structure to JSON, and write the string to a file, using the UTF-8 character encoding. To read a document, read the contents of the file using the UTF-8 character encoding, decode the string into the matching data structures in your programming language, and convert those data structures into your document object.

The character encoding of a MAT JSON document is always UTF-8.

It's important to remember, always, what the index offsets in the annotations represent: they are character offsets, independent of the particular character encoding. (If you don't understand the distinction, we recommend you read Joel Spolsky's Unicode primer.) We've chosen UTF-8 as our encoding for MAT JSON documents because it is flexible enough to encode all Unicode characters with good efficiency, and it's a proper superset of ASCII. So if an annotation covers the span from index 7 to index 22, as the first annotation does in our example above, this means "from the 7th character of the document (where 0 is the first) to the 21st character of the document (where 0 is the first)". It does not mean "from the 7th byte of the document to the 21st byte of the document".

This can lead to tremendous confusion in counting offsets, depending on how your programming language treats Unicode strings. For instance, Javascript has UTF-16 strings, which means that each Unicode character takes up exactly 2 bytes, and it just so happens that the 2-byte numeric value is the same as the Unicode code point for that character. For characters whose Unicode code point is greater than 65536 (that is, larger than can be represented in 2 bytes), UTF-16 has a system of what they call "surrogates", which are pairs of 2-byte sequences reserved for representing these larger code points. So in some cases, a Unicode character takes up 2 bytes in UTF-16, and in other cases, it takes up 4 bytes. Now, this case is extremely unusual; all the characters in all the living human languages are in the 2-byte space, but the Unicode code point space is 32 bits, so there are quite a number of rare characters (e.g., music notation, dead languages) which appear as these "surrogate pairs".

The reason this is relevant to programmers is that these surrogate pairs count as length 2 in Javascript; that is, the String.length attribute counts UTF-16 elements. Java is similar, except that it has a separate set of APIs for counting Unicode characters (e.g., charAt() vs. codePointAt() in Java 1.5).

Neither the Python nor the Javascript implementations in MAT currently treat these characters above the 2-byte space correctly.