All the MAT tools are flexibly configured to use one of an
extensible set of readers and writers. Currently, there are three
reader/writer types: raw, mat-json, and xml-inline. There is also a
fake-xml-inline reader. These types can be
passed to tools like MATEngine.
For reading, a file of this type is treated as all signal. For
write, the signal is extracted from the relevant annotated document.
This reader/writer has no additional options. The default encoding for
this reader/writer is ASCII.
This type designates the MAT-specific
JSON
document
format. This reader/writer hs no additional options.
The only available encoding is UTF-8.
This type designates XML inline data where some of the elements
correspond to annotations. Under normal circumstances, all markup which
doesn't correspond to a known annotation will be discarded; see the
--xml_input_is_overlay flag to remedy that. If there are no XML
elements which correspond to known annotations, consider using the raw
reader instead. The default encoding is UTF-8.
Note: the input here must really be XML. If the input simply
has SGML-like inline markup layered on top of a raw document, you
probably want the fake-xml-inline reader.
This type
accepts the following options:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--xml_input_is_overlay |
xml_input_is_overlay |
"yes" (XML) |
Reader flag. Normally, the XML
reader will digest elements with the same name as a known annotation in
the given task, and discard all other XML markup. If this flag is
specified, the input XML will be treated as a mix of task-relevant
annotations and underlying XML, and the extracted signal will be a
well-formed XML file. |
--signal_is_xml |
signal_is_xml |
"yes" (XML) |
Writer flag. Normally, the XML
writer assumes that the underlying signal is not XML. If this flag is
present, the underlying signal will be treated as a well-formed XML
file when the output file is rendered. If the input file type is also
'xml-inline', use the --xml_input_is_overlay flag to control this
setting instead. |
--xml_output_tag_exclusions
<tag,tag,...> |
xml_output_tag_exclusions |
A comma-delimited list of annotation labels to exclude from the XML output. | Writer flag. |
--xml_output_exclude_metadata |
xml_output_exclude_metadata |
"yes" (XML) | Writer flag. Normally, the XML
writer saves the document metadata inside an XML comment, so it can be
read back in by the XML reader. This flag causes the metadata not to be
written. |
Every attempt is made to make XML read/write lossless with respect
to the underlying document; for instance, by default the document
metadata is
safely encoded and dumped into a distinguished comment during write, so
that read can reinstantiate the metadata. However, this is not always
possible, because MAT documents use standoff annotations, and any
crossing dependencies will end up generating malformed XML (e.g.,
<a>text<b>text</a>text</b>). You can use the
--xml_output_tag_exclusions option to discard the offending annotation
types.
The xml-inline reader/writer is available as an option in the MAT UI
when you load and save documents in file mode.
When you select "xml-inline" as your load option in the MAT UI, the
"Load document" dialog looks like this:
The checkbox corresponds
to the --xml_input_is_overlay option above.
When you select "xml-inline" as from the "Save" menu in your
document window, you'll see the following popup:
The "Underlying signal is XML" checkbox corresponds to the
--signal_is_xml option; the "Annotation types to exclude" typein window
corresponds to the --xml_output_tag_exclusions option; and the "Exclude
commented-out MAT metadata" checkbox corresponds to the
--xml_output_exclude_metadata option.
We commonly encounter data which is XML-"like", which simply has
inline SGML-ish markup in a raw document, like so:
The <ORGANIZATION>Smith & Jones Corporation</ORGANIZATION> has announced its IPO.
In these documents, XML-significant characters "&<>" are
not properly escaped, and there is no toplevel XML tag surrounding the
entire document. The fake-xml-inline reader will search for patterns of
the form <...>...</...>. It translates attribute-value
pairs of the SGML-ish opening tags into annotation attribute-value
pairs. The default encoding for this reader is UTF-8. There is no
corresponding writer.
If you have an idiosyncratic document format you want to use, It's
not too difficult to define
your own reader/writer.