Readers and writers

All the MAT tools are flexibly configured to use one of an extensible set of readers and writers. Currently, there are three reader/writer types: raw, mat-json, and xml-inline. There is also a fake-xml-inline reader. These types can be passed to tools like MATEngine. You may also find that your task has defined additional readers and writers; consult your task maintainer for details about these.

raw

For reading, a file of this type is treated as all signal. For write, the signal is extracted from the relevant annotated document. This reader/writer has no additional options. The default encoding for this reader/writer is ASCII.

It is very important that you know what the encoding of your raw document is, and not just for MAT; any tool that reads raw text documents needs to know. If you're not sure, ask the person who provided the documents to you.

mat-json

This type designates the MAT-specific JSON document format (current version is 2). This reader/writer has no additional options. The only available encoding is UTF-8.

mat-json-v1

This type designates version 1 of the MAT-specific JSON document format. This type is available only as a writer (since mat-json reads both version 1 and version 2). It has no additional options. The only available encoding is UTF-8.

xml-inline

This type designates XML inline data where some of the elements correspond to annotations. Under normal circumstances, all markup which doesn't correspond to a known annotation will be discarded; see the --xml_input_is_overlay flag to remedy that. If there are no XML elements which correspond to known annotations, consider using the raw reader instead. The default encoding is UTF-8.

Note: the input here must really be XML. If the input simply has SGML-like inline markup layered on top of a raw document, you probably want the fake-xml-inline reader.

This type accepts the following options:

Command line option
XML attribute
Value
Description
--xml_input_is_overlay
xml_input_is_overlay
"yes" (XML)
Reader flag. Normally, the XML reader will digest elements with the same name as a known annotation in the given task, and discard all other XML markup. If this flag is specified, the input XML will be treated as a mix of task-relevant annotations and underlying XML, and the extracted signal will be a well-formed XML file. Ignored if --xml_translate_all is specified.
--xml_translate_all
xml_translate_all
"yes" (XML)
Reader flag. Normally, the XML reader will digest elements with the same name as a known annotation in the given task, and discard all other XML markup. If this flag is specified, the task (if provided) will be ignored, and all elements will be converted to annotations.

If no task is provided (MATScore, MATReport and MATTransducer all can be used without tasks), the reader will set this flag internally.
--signal_is_xml
signal_is_xml
"yes" (XML)
Writer flag. Normally, the XML writer assumes that the underlying signal is not XML. If this flag is present, the underlying signal will be treated as a well-formed XML file when the output file is rendered. If the input file type is also 'xml-inline', use the --xml_input_is_overlay flag to control this setting instead.
--xml_output_tag_exclusions <tag,tag,...>
xml_output_tag_exclusions
A comma-delimited list of annotation labels to exclude from the XML output. Writer flag.
--xml_output_exclude_metadata
xml_output_exclude_metadata
"yes" (XML) Writer flag. Normally, the XML writer saves the document metadata inside an XML comment, so it can be read back in by the XML reader, and also renders the annotation and attribute type information as zero-length XML tags. This flag causes this metadata not to be written.

Every attempt is made to make XML read/write lossless with respect to the underlying document. However, this is not always possible, because MAT documents use standoff annotations, and any crossing dependencies will end up  generating malformed XML (e.g., <a>text<b>text</a>text</b>). You can use the --xml_output_tag_exclusions option to discard the offending annotation types.

When used as a writer, xml-inline will dump the annotation and attribute type information (unless --xml_output_exclude_metadata is used). This type information enables all attribute types to be read correctly when xml-inline is used as a reader, whether or not the same annotation task is used. This includes set and list types and annotation-valued attributes. When xml-inline is used as a reader, it looks for the appropriate representation of these types, and if you've provided a task, you can interpret these values correctly even if the document was not produced with the MAT xml-inline writer. We document these values here for completeness; you're welcome to try writing such a document with another tool and seeing if it MAT can read it, but we're not guaranteeing that they'll work.

xml-inline in the MAT UI

The xml-inline reader/writer is available as an option in the MAT UI when you load and save documents in file mode.

When you select "xml-inline" as your load option in the MAT UI, the "Load document" dialog looks like this:

[Load dialog]

The checkbox corresponds to the --xml_input_is_overlay option above.

When you select "xml-inline" as from the "Save" menu in your document window, you'll see the following popup:

[Save popup]

The "Underlying signal is XML" checkbox corresponds to the --signal_is_xml option; the "Annotation types to exclude" typein window corresponds to the --xml_output_tag_exclusions option; and the "Exclude commented-out MAT metadata" checkbox corresponds to the --xml_output_exclude_metadata option.

fake-xml-inline

We commonly encounter data which is XML-"like", which simply has inline SGML-ish markup in a raw document, like so:

The <ORGANIZATION>Smith & Jones Corporation</ORGANIZATION> has announced its IPO.

In these documents, XML-significant characters "&<>" are not properly escaped, and there is no toplevel XML tag surrounding the entire document. The fake-xml-inline reader will search for patterns of the form <...>, and figures out whether the "tag" is a opening, closing, or zero-length tag. It translates attribute-value pairs of the SGML-ish opening tags into annotation attribute-value pairs. If it finds an attribute-value string which can't be parsed using XML-ish rules, it will treat the enclosing "tag" as part of the signal. The reader recognizes nested "tag"s correctly. The default encoding for this reader is UTF-8. There is no corresponding writer.

Defining your own reader/writer

If you have an idiosyncratic document format you want to use, It's not too difficult to define your own reader/writer.