The MAT toolkit is designed to be loosely coupled, using
documents at rest (rather than data structures in process) as the
common data interface. The default format for rich annotated
documents is described in this document; the full range of
available readers and writers is described elsewhere.
The MAT document format is built on top of the Javascript Object Notation (JSON). It
is simple and lightweight, and unlike XML, is designed for typed
data. This format closely mirrors the structure of the documents
themselves, so it's worth reviewing in any case, but especially if
you want to process MAT-annotated documents outside of the MAT
toolkit. Currently, we provide facilities for rendering and
digesting this format in Python, JavaScript, and Java; if you want to
manipulate this format in any other programming language, you'll
have to write the renderer/digester yourself.
JSON is so-called because it's a subset of the Javascript
programming language, and thus exceptionally well-suited for
passing data to and from Web applications like the MAT UI. JSON
contains hashes (curly brackets), lists (square brackets), UTF-16
strings, integers and floats, plus the constants null, true, false. Whitespace
is not significant except within strings. That's it.
It's important to remember that JSON is not a data structure;
it's a string representation of data structures. There are JSON
libraries for reading and writing JSON strings and mapping them to
native data structures. E.g., in Python, hashes are mapped to
dictionaries, strings to string, lists to lists, null to None, true to True, false to False.
The current version number of the MAT document format is 2. All
literals specified below are case-sensitive.
<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 2 }
<metadata>: { ... }
<aset_list>: [ <aset_entry>* ]
<aset_entry>: { "type": <string>, "hasID": <boolean>, "hasSpan": <boolean>,
"attrs": [ <attr_entry>* ], "annots": <annot_list> }
<attr_entry>: { "name": <string>, "type": <attr_type>, "aggregation": <aggr_type> }
<attr_type>: "string" | "annotation" | "float" | "int" | "boolean"
<aggr_type>: null | "none" | "list" | "set"
<annot_list>: [ <annot_entry>* ]
<annot_entry>: <spanned_annot_entry> | <spanless_annot_entry>
<spanned_annot_entry>: [ <int>, <int>, <id>?, <value>* ]
<spanless_annot_entry>: [ <id>?, <value> ]
<id>: <string>
<value>: <base_value> | <aggregation_value>
<base_value>: null | <string> | <id> | <float> | <integer> | <boolean>
<aggregation_value>: [ <string>* | <id>* | <float>* | <integer>* | <boolean>* ]
The value of "signal" is the document contents; once the document
has any annotations at all, the signal should not be changed.
The "version" key is obligatory, and its value must be 2.
Decoders should raise an error if the version is later than the
version they're designed to handle. If the "version" key is
absent, or its version is 1, the document is in MAT document
format version 1. All version 2 decoders should recognize version
1 as well (see below).
The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document, whether the document is a reconciliation document or not, and for pairing information in comparison documents. Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.
The <aset_list> is a sequence of entries, one for each
annotation type.
Each <aset_entry> specifies the name of a tag (e.g.,
"PERSON"), whether or not the annotations in this set have IDs
("hasID"), whether or not the annotations in this set have spans
("hasSpan"), a list of attributes which can be filled, and a list
of annotations. If "hasID" is not present, its value is assumed to
be false. If "hasSpan" is not present, its value is assumed to be
true.
Note: MAT uses the
presence or absence of an <aset_entry> as an indication of
whether some operation has attempted to add elements of that
annotation type to the document. Do not prepopulate the document with
<aset_entry>s which have empty <annot_list>s. This
infelicity will be removed in a subsequent release.
Each <attr_entry> specifies the name of the attribute
(e.g., "gender") and the type of the attribute. The recognized
<attr_type>s in version 2 are "string", "int", "float",
"boolean", and "annotation". The attribute can also have an
aggregation type; the recognized <attr_type>s in version 2
are "list", "set", and "none" or null. Both the type and
aggregation are optional. The default value for the attribute type
is "string"; the default aggregation value is null. The legal
<value>s for each type and aggregation are:
The form of each element of the list of annotations depends on
the values of "hasID" and "hasSpan". If "hasSpan" is true, the
first two elements are integers, which are 0-based indices into
the signal representing the start and end of the annotation span,
respectively. If "hasID" is true, the element immediately after
the indices (or the first element, if "hasSpan" is false) is the
ID of the annotation, which can be referred to by other
annotations if they have an attribute whose value is "annotation".
In each annotation set, the value of "attrs" (the attribute
entries) and the list of attribute values in each element of
"annots", after "hasSpan" and "hasID" have been accounted for, are
essentially parallel; the attribute values may be no longer than
the attribute entries, and they are paired with each other until
the values are exhausted, at which point all subsequent attribute
values should be treated as null.
The reason for allowing annot lists which are shorter than the
list of attribute names is partially for space efficiency, and
partially to support the option of adding a new attribute to an
annotation type without having to go to the trouble of adding a null to every instance of
that annotation type.
Here's a sample document:
{"signal": "I like Michael Jackson and Janet Jackson.",
"version": 2,
"asets": [ {"type": "PERSON", "hasID": false, "hasSpan": true,
"attrs": [{"name": "gender", "type": "string"},
{"name": "number", "type": "string", "aggregation": null}],
"annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
"metadata": {}
}
In this example, note that the value of the "gender" attribute
for the PERSON annotation spanning "Michael Jackson" is null, and the value of
the "number" attribute is "singular". For the PERSON
annotation spanning "Janet Jackson" only the "gender" attribute is
specified (it is "female"), implying that the "number" attribute
for this annotation is null.
This illustrates how the MAT document format allows the specified
list of annotation values to be shorter than the list of
annotation names (with implicit nulls making up the difference).
Version 1 is somewhat simpler than version 2; it differs from
version 2 in that there are no spanless annotations or
annotation-valued attributes. The Python, Javascript and Java
mat-json reader/writers all recognize both versions. However, they
all produce version 2. The version 1 readers (in MAT 1.3 and
previous) will not be able to read version 2. We provide a special
mat-json-v1 writer in MAT 2.0 to write version-1-compatible MAT
JSON documents (by discarding spanless annotations and attribute
values which are annotations).
Here is the spec. All literals are case-sensitive.
<document>: {"signal": <string>, "metadata": <metadata>, "asets": <aset_list>, "version": 1 }
<aset_list>: [ <aset_entry>* ]
<aset_entry>: { "type": <string>, "attrs": [ <string>* ], "annots": <annot_list> }
<annot_list>: [ [ <int>, <int>, <string>* ]* ]
The value of "signal" is the document contents; once the document
has any annotations at all, the signal should not be changed.
The "version" key is optional, in this initial version of the
format. Decoders should assume that if the key is missing, the
version is 1. Decoders should raise an error if the version is
later than the version they're designed to handle.
The <metadata> is a hash, whose contents are application-specific. Currently, we use it to track which steps of workflows have been applied to a document. Rule of thumb should be that if you are modifying a document, you should make sure the metadata is preserved.
The <aset_list> is a sequence of entries, one for each
annotation type. Each <aset_entry> specifies the name of a
tag (e.g., "PERSON"), a list of attributes which can be filled
(e.g. ["gender"]), and a list of annotations. Each element of the
list of annotations contains two integers, which are 0-based
indexes into the signal representing the start and end of the
annotation span, respectively, plus attribute values. The value of
"attrs" (the attribute names) and the list of values after the
first two integers in each element in the value of "annots" (the
attribute values) are essentially parallel; the attribute values
may be no longer than the attribute names, and they are paired
with each other until the values are exhausted, at which point all
subsequent attribute values should be treated as null. The reason for
allowing annot lists which are shorter than the list of attribute
names is partially for space efficiency, and partially to support
the option of adding a new attribute to an annotation type without
having to go to the trouble of adding a null to every instance of that annotation
type.
Here's a sample document:
{"signal": "I like Michael Jackson and Janet Jackson.",In this example, note that the value of the "gender" attribute for the PERSON annotation spanning "Michael Jackson" is null, and the value of the "number" attribute is "singular". For the PERSON annotation spanning "Janet Jackson" only the "gender" attribute is specified (it is "female"), implying that the "number" attribute for this annotation is null. This illustrates how the MAT document format allows the specified list of annotation values to be shorter than the list of annotation names (with implicit nulls making up the difference).
"asets": [ {"type": "PERSON",
"attrs": ["gender", "number"],
"annots": [[7, 22, null, "singular"], [27, 41, "female"]]} ].
"metadata": {}
}
To write a document to a MAT JSON document file, convert your
document object to the appropriate data structures in your
programming language, render the structure to JSON, and write the
string to a file, using the UTF-8 character encoding. To read a
document, read the contents of the file using the UTF-8 character
encoding, decode the string into the matching data structures in
your programming language, and convert those data structures into
your document object.
The character encoding of a MAT JSON document is always UTF-8.
It's important to remember, always, what the index offsets
in the annotations represent: they are character offsets,
independent of the particular character encoding. (If you don't
understand the distinction, we recommend you read Joel
Spolsky's Unicode primer.) We've chosen UTF-8 as our
encoding for MAT JSON documents because it is flexible enough to
encode all Unicode characters with good efficiency, and it's a
proper superset of ASCII. So if an annotation covers the span from
index 7 to index 22, as the first annotation does in our example
above, this means "from the 7th character of the document (where 0
is the first) to the 21st character of the document (where 0 is
the first)". It does not
mean "from the 7th byte of the document to the 21st byte of the
document".
This can lead to tremendous confusion in counting offsets,
depending on how your programming language treats Unicode strings.
For instance, Javascript has UTF-16 strings, which means that each
Unicode character takes up exactly 2 bytes, and it just so happens
that the 2-byte numeric value is the same as the Unicode code
point for that character. For characters whose Unicode code point
is greater than 65536 (that is, larger than can be represented in
2 bytes), UTF-16 has a system of what they call "surrogates",
which are pairs of 2-byte sequences reserved for representing
these larger code points. So in some cases, a Unicode character
takes up 2 bytes in UTF-16, and in other cases, it takes up 4
bytes. Now, this case is extremely unusual; all the characters in
all the living human languages are in the 2-byte space, but the
Unicode code point space is 32 bits, so there are quite a number
of rare characters (e.g., music notation, dead languages) which
appear as these "surrogate pairs".
The reason this is relevant to programmers is that these
surrogate pairs count as length 2 in Javascript; that is, the
String.length attribute counts UTF-16 elements. Java is similar,
except that it has a separate set of APIs for counting Unicode
characters (e.g., charAt() vs. codePointAt() in Java 1.5).
None of the document object libraries included with MAT (Python,
Javascript, Java) currently treat these characters above the
2-byte space correctly.