All the MAT tools are flexibly configured to use one of an
extensible set of readers and writers. Currently, there are three
reader/writer types: raw, mat-json, and xml-inline. There is also
a fake-xml-inline reader. These types can be passed to tools like
MATEngine. You may also find that
your task has defined additional readers and writers; consult your
task maintainer for details about these.
For reading, a file of this type is treated as all signal. For
write, the signal is extracted from the relevant annotated
document. This reader/writer has no additional options. The
default encoding for this reader/writer is ASCII.
It is very important that you know what the encoding of
your raw document is, and not just for MAT; any tool that reads
raw text documents needs to know. If you're not sure, ask the
person who provided the documents to you.
This type designates the MAT-specific
JSON document format (current version is 2). This
reader/writer has no additional options. The only available
encoding is UTF-8.
This type designates version
1 of the MAT-specific JSON document format. This type is
available only as a writer (since mat-json reads both version 1
and version 2). It has no additional options. The only available
encoding is UTF-8.
This type designates XML inline data where some of the elements
correspond to annotations. Under normal circumstances, all markup
which doesn't correspond to a known annotation will be discarded; see the
--xml_input_is_overlay flag to remedy that. If there are no XML
elements which correspond to known annotations, consider using the
raw reader instead. The default encoding is UTF-8.
Note: the input here must really be XML. If the input simply has SGML-like inline markup layered on top of a raw document, you probably want the fake-xml-inline reader.
This type accepts the following options:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--xml_input_is_overlay |
xml_input_is_overlay |
"yes" (XML) |
Reader flag. Normally, the
XML reader will digest elements with the same name as a
known annotation in the given task, and discard all other
XML markup. If this flag is specified, the input XML will be
treated as a mix of task-relevant annotations and underlying
XML, and the extracted signal will be a well-formed XML
file. Ignored if --xml_translate_all is specified. |
--xml_translate_all |
xml_translate_all |
"yes" (XML) |
Reader flag. Normally, the
XML reader will digest elements with the same name as a
known annotation in the given task, and discard all other
XML markup. If this flag is specified, the task (if
provided) will be ignored, and all elements will be
converted to annotations. If no task is provided (MATScore, MATReport and MATTransducer all can be used without tasks), the reader will set this flag internally. |
--signal_is_xml |
signal_is_xml |
"yes" (XML) |
Writer flag. Normally, the
XML writer assumes that the underlying signal is not XML. If
this flag is present, the underlying signal will be treated
as a well-formed XML file when the output file is rendered.
If the input file type is also 'xml-inline', use the
--xml_input_is_overlay flag to control this setting instead. |
--xml_output_tag_exclusions
<tag,tag,...> |
xml_output_tag_exclusions |
A comma-delimited list of annotation labels to exclude from the XML output. | Writer flag. |
--xml_output_exclude_metadata |
xml_output_exclude_metadata |
"yes" (XML) | Writer flag. Normally, the
XML writer saves the document metadata inside an XML
comment, so it can be read back in by the XML reader, and
also renders the annotation and attribute type information
as zero-length XML tags. This flag causes this metadata not
to be written. |
Every attempt is made to make XML read/write lossless with
respect to the underlying document. However, this is not always
possible, because MAT documents use standoff annotations, and any
crossing dependencies will end up generating malformed XML
(e.g., <a>text<b>text</a>text</b>). You
can use the --xml_output_tag_exclusions option to discard the
offending annotation types.
When used as a writer, xml-inline will dump the annotation and
attribute type information (unless --xml_output_exclude_metadata
is used). This type information enables all attribute types to be
read correctly when xml-inline is used as a reader, whether or not
the same annotation task is used. This includes set and list types
and annotation-valued attributes. When xml-inline is used as a
reader, it looks for the appropriate representation of these
types, and if you've provided a task, you can interpret these
values correctly even if the document was not produced with the
MAT xml-inline writer. We document these values here for
completeness; you're welcome to try writing such a document with
another tool and seeing if it MAT can read it, but we're not
guaranteeing that they'll work.
The xml-inline reader/writer is available as an option in the MAT
UI when you load and save documents in file mode.
When you select "xml-inline" as your load option in the MAT UI,
the "Load document" dialog looks like this:
The checkbox corresponds to the --xml_input_is_overlay option
above.
When you select "xml-inline" as from the "Save" menu in your
document window, you'll see the following popup:
The "Underlying signal is XML" checkbox corresponds to the
--signal_is_xml option; the "Annotation types to exclude" typein
window corresponds to the --xml_output_tag_exclusions option; and
the "Exclude commented-out MAT metadata" checkbox corresponds to
the --xml_output_exclude_metadata option.
We commonly encounter data which is XML-"like", which simply has
inline SGML-ish markup in a raw document, like so:
The <ORGANIZATION>Smith & Jones Corporation</ORGANIZATION> has announced its IPO.
In these documents, XML-significant characters "&<>"
are not properly escaped, and there is no toplevel XML tag
surrounding the entire document. The fake-xml-inline reader will
search for patterns of the form <...>, and figures out
whether the "tag" is a opening, closing, or zero-length tag. It
translates attribute-value pairs of the SGML-ish opening tags into
annotation attribute-value pairs. If it finds an attribute-value
string which can't be parsed using XML-ish rules, it will treat
the enclosing "tag" as part of the signal. The reader recognizes
nested "tag"s correctly. The default encoding for this reader is
UTF-8. There is no corresponding writer.
If you have an idiosyncratic document format you want to use,
It's not too difficult to define
your
own reader/writer.