Readers and writers

All the MAT tools are flexibly configured to use one of an extensible set of readers and writers. Currently, there are three reader/writer types: raw, mat-json, and xml-inline. There is also a fake-xml-inline reader. These types can be passed to tools like MATEngine.

raw

For reading, a file of this type is treated as all signal. For write, the signal is extracted from the relevant annotated document. This reader/writer has no additional options. The default encoding for this reader/writer is ASCII.

mat-json

This type designates the MAT-specific JSON document format. This reader/writer hs no additional options. The only available encoding is UTF-8.

xml-inline

This type designates XML inline data where some of the elements correspond to annotations. Under normal circumstances, all markup which doesn't correspond to a known annotation will be discarded; see the --xml_input_is_overlay flag to remedy that. If there are no XML elements which correspond to known annotations, consider using the raw reader instead. The default encoding is UTF-8.

Note: the input here must really be XML. If the input simply has SGML-like inline markup layered on top of a raw document, you probably want the fake-xml-inline reader.

This type accepts the following options:

Command line option
XML attribute
Value
Description
--xml_input_is_overlay
xml_input_is_overlay
"yes" (XML)
Reader flag. Normally, the XML reader will digest elements with the same name as a known annotation in the given task, and discard all other XML markup. If this flag is specified, the input XML will be treated as a mix of task-relevant annotations and underlying XML, and the extracted signal will be a well-formed XML file.
--signal_is_xml
signal_is_xml
"yes" (XML)
Writer flag. Normally, the XML writer assumes that the underlying signal is not XML. If this flag is present, the underlying signal will be treated as a well-formed XML file when the output file is rendered. If the input file type is also 'xml-inline', use the --xml_input_is_overlay flag to control this setting instead.
--xml_output_tag_exclusions <tag,tag,...>
xml_output_tag_exclusions
A comma-delimited list of annotation labels to exclude from the XML output. Writer flag.
--xml_output_exclude_metadata
xml_output_exclude_metadata
"yes" (XML) Writer flag. Normally, the XML writer saves the document metadata inside an XML comment, so it can be read back in by the XML reader. This flag causes the metadata not to be written.

Every attempt is made to make XML read/write lossless with respect to the underlying document; for instance, by default the document metadata is safely encoded and dumped into a distinguished comment during write, so that read can reinstantiate the metadata. However, this is not always possible, because MAT documents use standoff annotations, and any crossing dependencies will end up  generating malformed XML (e.g., <a>text<b>text</a>text</b>). You can use the --xml_output_tag_exclusions option to discard the offending annotation types.

xml-inline in the MAT UI

The xml-inline reader/writer is available as an option in the MAT UI when you load and save documents in file mode.

When you select "xml-inline" as your load option in the MAT UI, the "Load document" dialog looks like this:

[Load dialog]

The checkbox corresponds to the --xml_input_is_overlay option above.

When you select "xml-inline" as from the "Save" menu in your document window, you'll see the following popup:

[Save popup]

The "Underlying signal is XML" checkbox corresponds to the --signal_is_xml option; the "Annotation types to exclude" typein window corresponds to the --xml_output_tag_exclusions option; and the "Exclude commented-out MAT metadata" checkbox corresponds to the --xml_output_exclude_metadata option.

fake-xml-inline

We commonly encounter data which is XML-"like", which simply has inline SGML-ish markup in a raw document, like so:

The <ORGANIZATION>Smith & Jones Corporation</ORGANIZATION> has announced its IPO.

In these documents, XML-significant characters "&<>" are not properly escaped, and there is no toplevel XML tag surrounding the entire document. The fake-xml-inline reader will search for patterns of the form <...>...</...>. It translates attribute-value pairs of the SGML-ish opening tags into annotation attribute-value pairs. The default encoding for this reader is UTF-8. There is no corresponding writer.

Defining your own reader/writer

If you have an idiosyncratic document format you want to use, It's not too difficult to define your own reader/writer.