MAT JSON Document Format

The MAT toolkit is designed to be loosely coupled, using documents at rest (rather than data structures in process) as the common data interface. The default format for rich annotated documents is described in this document; the full range of available readers and writers is described elsewhere.

JSON

The MAT document format is built on top of the Javascript Object Notation (JSON). It is simple and lightweight, and unlike XML, is designed for typed data. This format closely mirrors the structure of the documents themselves, so it's worth reviewing in any case, but especially if you want to process MAT-annotated documents outside of the MAT toolkit. Currently, we provide facilities for rendering and digesting this format in Python, JavaScript, and Java; if you want to manipulate this format in any other programming language, you'll have to write the renderer/digester yourself.

JSON is so-called because it's a subset of the Javascript programming language, and thus exceptionally well-suited for passing data to and from Web applications like the MAT UI. JSON contains hashes (curly brackets), lists (square brackets), UTF-16 strings, integers and floats, plus the constants null, true, false. Whitespace is not significant except within strings. That's it.

It's important to remember that JSON is not a data structure; it's a string representation of data structures. There are JSON libraries for reading and writing JSON strings and mapping them to native data structures. E.g., in Python, hashes are mapped to dictionaries, strings to string, lists to lists, null to None, true to True, false to False.

The MAT document format

The current version number of the MAT document format is 2. All literals specified below are case-sensitive.

<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 2 }

<metadata>: { ... }

<aset_list>: [ <aset_entry>* ]

<aset_entry>: { "type": <string>, "hasID": <boolean>, "hasSpan": <boolean>,
                "attrs": [ <attr_entry>* ], "annots": <annot_list> }

<attr_entry>: { "name": <string>, "type": <attr_type>, "aggregation": <aggr_type> }

<attr_type>: "string" | "annotation" | "float" | "int" | "boolean"

<aggr_type>: null | "none" | "list" | "set"

<annot_list>: [ <annot_entry>* ]

<annot_entry>: <spanned_annot_entry> | <spanless_annot_entry>

<spanned_annot_entry>: [ <int>, <int>, <id>?, <value>* ]

<spanless_annot_entry>: [ <id>?, <value> ]

<id>: <string>

<value>: <base_value> | <aggregation_value> 

<base_value>: null | <string> | <id> | <float> | <integer> | <boolean> 

<aggregation_value>: [ <string>* | <id>* | <float>* | <integer>* | <boolean>* ]

The value of "signal" is the document contents; once the document has any annotations at all, the signal should not be changed.

The "version" key is obligatory, and its value must be 2. Decoders should raise an error if the version is later than the version they're designed to handle. If the "version" key is absent, or its version is 1, the document is in MAT document format version 1. All version 2 decoders should recognize version 1 as well (see below).

The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document, whether the document is a reconciliation document or not, and for pairing information in comparison documents. Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.

The <aset_list> is a sequence of entries, one for each annotation type.

Each <aset_entry> specifies the name of a tag (e.g., "PERSON"), whether or not the annotations in this set have IDs ("hasID"), whether or not the annotations in this set have spans ("hasSpan"), a list of attributes which can be filled, and a list of annotations. If "hasID" is not present, its value is assumed to be false. If "hasSpan" is not present, its value is assumed to be true.

Note: MAT uses the presence or absence of an <aset_entry> as an indication of whether some operation has attempted to add elements of that annotation type to the document. Do not prepopulate the document with <aset_entry>s which have empty <annot_list>s. This infelicity will be removed in a subsequent release.

Each <attr_entry> specifies the name of the attribute (e.g., "gender") and the type of the attribute. The recognized <attr_type>s in version 2 are "string", "int", "float", "boolean", and "annotation". The attribute can also have an aggregation type; the recognized <attr_type>s in version 2 are "list", "set", and "none" or null. Both the type and aggregation are optional. The default value for the attribute type is "string"; the default aggregation value is null. The legal <value>s for each type and aggregation are:

type "string": null or a JSON string
type "int": null or a JSON number which corresponds to an integer (JSON has no distinction between ints and floats)
type "float": null or a JSON number
type "boolean": null or a JSON boolean (true or false)
type "annotation": null or a JSON string which is the ID of another annotation
aggregation "list": null or a JSON list of values of the appropriate type
aggregation "set": null or a JSON list of values of the appropriate type

The form of each element of the list of annotations depends on the values of "hasID" and "hasSpan". If "hasSpan" is true, the first two elements are integers, which are 0-based indices into the signal representing the start and end of the annotation span, respectively. If "hasID" is true, the element immediately after the indices (or the first element, if "hasSpan" is false) is the ID of the annotation, which can be referred to by other annotations if they have an attribute whose value is "annotation".

In each annotation set, the value of "attrs" (the attribute entries) and the list of attribute values in each element of "annots", after "hasSpan" and "hasID" have been accounted for, are essentially parallel; the attribute values may be no longer than the attribute entries, and they are paired with each other until the values are exhausted, at which point all subsequent attribute values should be treated as null. The reason for allowing annot lists which are shorter than the list of attribute names is partially for space efficiency, and partially to support the option of adding a new attribute to an annotation type without having to go to the trouble of adding a null to every instance of that annotation type.

Here's a sample document:

{"signal": "I like Michael Jackson and Janet Jackson.",
 "version": 2,
 "asets": [ {"type": "PERSON", "hasID": false, "hasSpan": true,
             "attrs": [{"name": "gender", "type": "string"}, 
                       {"name": "number", "type": "string", "aggregation": null}],
             "annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
 "metadata": {}
}

In this example, note that the value of the "gender" attribute for the PERSON annotation spanning "Michael Jackson" is null, and the value of the "number" attribute is "singular". For the PERSON annotation spanning "Janet Jackson" only the "gender" attribute is specified (it is "female"), implying that the "number" attribute for this annotation is null. This illustrates how the MAT document format allows the specified list of annotation values to be shorter than the list of annotation names (with implicit nulls making up the difference).

Previous versions: version 1

Version 1 is somewhat simpler than version 2; it differs from version 2 in that there are no spanless annotations or annotation-valued attributes. The Python, Javascript and Java mat-json reader/writers all recognize both versions. However, they all produce version 2. The version 1 readers (in MAT 1.3 and previous) will not be able to read version 2. We provide a special mat-json-v1 writer in MAT 2.0 to write version-1-compatible MAT JSON documents (by discarding spanless annotations and attribute values which are annotations).

Here is the spec. All literals are case-sensitive.

<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 1 }

<aset_list>: [ <aset_entry>* ]

<aset_entry>: { "type": <string>, "attrs": [ <string>* ], "annots": <annot_list> }

<annot_list>: [ [ <int>, <int>, <string>* ]* ]

The value of "signal" is the document contents; once the document has any annotations at all, the signal should not be changed.

The "version" key is optional, in this initial version of the format. Decoders should assume that if the key is missing, the version is 1. Decoders should raise an error if the version is later than the version they're designed to handle.

The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document. Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.

The <aset_list> is a sequence of entries, one for each annotation type. Each <aset_entry> specifies the name of a tag (e.g., "PERSON"), a list of attributes which can be filled (e.g. ["gender"]), and a list of annotations. Each element of the list of annotations contains two integers, which are 0-based indexes into the signal representing the start and end of the annotation span, respectively, plus attribute values. The value of "attrs" (the attribute names) and the list of values after the first two integers in each element in the value of "annots" (the attribute values) are essentially parallel; the attribute values may be no longer than the attribute names, and they are paired with each other until the values are exhausted, at which point all subsequent attribute values should be treated as null. The reason for allowing annot lists which are shorter than the list of attribute names is partially for space efficiency, and partially to support the option of adding a new attribute to an annotation type without having to go to the trouble of adding a null to every instance of that annotation type.

Here's a sample document:

{"signal": "I like Michael Jackson and Janet Jackson.",
 "asets": [ {"type": "PERSON", 
             "attrs": ["gender", "number"],
             "annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
 "metadata": {}
}

A note about files, signal offsets, JSON strings, and character encodings

To write a document to a MAT JSON document file, convert your document object to the appropriate data structures in your programming language, render the structure to JSON, and write the string to a file, using the UTF-8 character encoding. To read a document, read the contents of the file using the UTF-8 character encoding, decode the string into the matching data structures in your programming language, and convert those data structures into your document object.

The character encoding of a MAT JSON document is always UTF-8.

It's important to remember, always, what the index offsets in the annotations represent: they are character offsets, independent of the particular character encoding. (If you don't understand the distinction, we recommend you read Joel Spolsky's Unicode primer.) We've chosen UTF-8 as our encoding for MAT JSON documents because it is flexible enough to encode all Unicode characters with good efficiency, and it's a proper superset of ASCII. So if an annotation covers the span from index 7 to index 22, as the first annotation does in our example above, this means "from the 7th character of the document (where 0 is the first) to the 21st character of the document (where 0 is the first)". It does not mean "from the 7th byte of the document to the 21st byte of the document".

This can lead to tremendous confusion in counting offsets, depending on how your programming language treats Unicode strings. For instance, Javascript has UTF-16 strings, which means that each Unicode character takes up exactly 2 bytes, and it just so happens that the 2-byte numeric value is the same as the Unicode code point for that character. For characters whose Unicode code point is greater than 65536 (that is, larger than can be represented in 2 bytes), UTF-16 has a system of what they call "surrogates", which are pairs of 2-byte sequences reserved for representing these larger code points. So in some cases, a Unicode character takes up 2 bytes in UTF-16, and in other cases, it takes up 4 bytes. Now, this case is extremely unusual; all the characters in all the living human languages are in the 2-byte space, but the Unicode code point space is 32 bits, so there are quite a number of rare characters (e.g., music notation, dead languages) which appear as these "surrogate pairs".

The reason this is relevant to programmers is that these surrogate pairs count as length 2 in Javascript; that is, the String.length attribute counts UTF-16 elements. Java is similar, except that it has a separate set of APIs for counting Unicode characters (e.g., charAt() vs. codePointAt() in Java 1.5).

None of the document object libraries included with MAT (Python, Javascript, Java) currently treat these characters above the 2-byte space correctly.