The XML format for the experiment files (see MATExperimentEngine) is described
in this document. Use cases are described here. Click here for a split-screen view.
<experiment>
<binding>
<corpora>
<partition>
<size>
<corpus>
<pattern>
<prep>
<model_sets>
<build_settings>
<model_set>
<training_corpus>
<runs>
<run_settings>
<prep_args>
<args>
<run>
<test_corpus>
The toplevel element in the file. Note that the three child elements
are not obligatory; the experiment XML can be used simply to build
corpora, or to build models, without performing any experimental runs,
if, for instance, you want to build a model or corpus to be used in
multiple experiments.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory in which the
experiment wil be conducted. If the directory does not exist, it will
be created. If not specified, the directory must be provided when the
experiment is run. |
task |
a string |
yes |
The name of a task, as would be
passed to the --task argument of MATEngine.
This
setting
is
used
to
establish
the
task
for
the
corpus
preparation
and
for
the
experiment
runs, and also to establish the set of available
tags for the training. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<binding> |
no |
yes |
Bindings to be made globally
available in the other elements. |
<corpora> |
no |
yes |
The corpora to be used in the
experiment. |
<model_sets> |
no |
yes |
The model sets to be used in the
experiment. |
<runs> |
no |
yes |
The experimental runs to be used
in the experiment. |
This element allows the user to define global bindings which can be
referred to in any other element of the experiment XML file (except the
attributes of the <experiment> element itself, and the
<binding> elements). These bindings can be referred to either in
XML attributes or in text within XML elements. The pattern for each
binding is $(...). The experiment directory, whether provided via the
dir attribute of the <experiment> element or on the command line,
is provided as EXP_DIR; the pattern directory, if provided by the
--pattern_dir command line argument to MATExperimentEngine, is provided
as PATTERN_DIR.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The binding to be replaced. The
engine will look for $(<name>) anywhere in the attribute values
or text in the experiment XML file. |
value |
a string |
yes |
The value to replace
$(<name>) with. This replacement is not recursive; that is, you
should not include any $(<name>) substrings in your value, unless
you want them to be included literally, because they will not be
expanded. |
Describes corpora to be used in the experiment. This element may be
repeated; the
intention is that a single <corpora> element will correspond to a
shared set of preprocessing instructions.
The corpora may be local, in which case a set of patterns should be
provided, or remote, in which case the source_corpus_dir attribute
should be provided. Remote corpora are used directly unless one or more
of the processing tags are specified (<partition>, <prep>).
In this case, the specified processing steps are added or redone
locally, on a separate copy of the corpus. For instance, if the remote
corpus is split into test and train, but not preprocessed, and the
<prep> tag is specified here, the corpus documents will be
postprocessed here, and the remote split will be preserved. If the
remote corpus is preprocessed and split, but the local
<partition> tag specifies that the corpus type is "train", the
remote corpus preprocessing will be preserved, but locally the split
will be ignored. If the remote corpus contains enough patterns for 300
documents, but max_size remotely is 100 and max_size locally is 200,
the local max_size will be used; this is possible because all the
documents are preprocessed by default when a corpus is prepared,
regardless of max_size, and the order of documents (after an initial
randomization) is preserved from remote corpus to local copy.
Note that inside the experiment engine. MAT uses the MAT JSON
document format exclusively. Therefore, if you want to provide
documents which are in a different format which MAT also understand
(e.g., XML inline), you must use the <prep> tag to convert the
documents to MAT JSON format.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory where the corpora
are found, or should be built. If the directory does not exist, it will
be created. The default value for this attribute is a subdirectory
named "corpora" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The partition settings for this
group of corpora. If omitted, the corpus will not have any partitions. |
<size> |
no |
no |
The size settings for this group
of corpora. If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established). |
<corpus> |
yes |
yes |
The individual corpora in this
group. |
<prep> |
no |
no |
The arguments to the MATEngine
command to use to preprocess the corpora. For instance, this command
might
take documents which have been deidentified and resynthesize fillers
for the deidentified regions. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Specifies a partition of the sister corpora specified
with the <corpus> tag. May be repeated. If this tag is missing,
the corpus has no partitions. The partitions segment the entire corpus,
so the fraction values are normalized to shares of the corpus. If you
want just a 10th of the corpus, for instance, you must divide the
corpus into two partitions at a ratio of 9:1 and ignore the larger
slice.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
fraction |
a float |
yes |
The share of each
corpus that should be allotted to this partition (a float between 0 and
1). |
Specifies the size properties of the sister corpora specified
with the <corpus> tag. If this tag is present, and the sister
corpus has the source_corpus_dir attribute set, the specified values
will override those in the source corpus.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
max_size |
an integer |
no |
The maximum number of
documents
in each corpus.
If specified, each corpus will not exceed this number. This limit is
applied last, so the corpus can be reused with a greater max_size
specified if the requisite number of documents are available. |
truncate_document_list |
"yes" |
no |
If present and max_size is also
present, the max_size limit
will be imposed first, rather than last. The consequence of this is
that no more than max_size documents will be available to remote
accesses of this corpus. |
A corpus is specified either by a set of patterns, or by a reference
to another corpus (via source_corpus_dir). The documents specified by a
set of patterns are randomly reordered before any subsequent processing
is performed (e.g., split, preprocess). If source_corpus_dir is
present, patterns are ignored.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for
subsequent reference in the remainder of the experiment. It is also
used as the name of the subdirectory in which this corpus is built, if
the corpus is either local (i.e., it has patterns), or remote with
processing overrides. |
source_corpus_dir |
a string |
no |
If present, a pathname of
an existing corpus directory. If the path is not an absolute path, the
experiment directory will be prepended. The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain). Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name]. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<pattern> |
yes |
yes |
A glob-style pattern of files to
use to construct this corpus. "Glob" style is the UNIX shell file
pattern matching; e.g., "*" matches everything. (This is in contrast to
standard regular expressions.) If this path pattern isn't an absolute
path, the --pattern_dir option of MATExperimentEngine
must be used to provide the location of the patterns. This element has no attributes or element children; its value is the text it delimits. |
This element houses the arguments to the MATEngine
command to use to preprocess the corpora. You might use this command to
take documents which have been deidentified and resynthesize fillers
for the deidentified regions.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Each experiment also can contain a number of model sets. A model set
is a sequence of models built out of the same corpus, with successively
larger numbers of training inputs. This iterative capability does not
have to be used, but is available if the user wants to track the change
in performance relative to the number of training documents.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<build_settings> |
no |
no |
The instructions for building
the model sets in this bundle. |
<model_set> |
yes |
yes |
A model set. |
In order to run an experiment, you must at the very least have
declared your <model_config> in your task.xml file and
specified a value for the "class" attribute. You can override the
<build_settings>
values here. The training engine you're most likely to use is the Carafe engine.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
training_increment |
integer |
no |
If present, the increment to use
when constructing successively larger models in this model set. If
absent, a single model will be constructed using all the documents in
the training set. |
truncate_to_increment |
"yes" |
no |
If present, and
training_increment is present, the model set will truncate its file
list to match the training_increment. For example, if there are 176
files, and the training increment is 25, the engine will discard the
files above 175 for training purposes. Otherwise, the engine would
build a model for the first 175 documents, and then another model for
all 176. |
config_name |
a string |
no |
By default, the settings here
will override the attribute values for the default model build settings
in task.xml. If this attribute is present, the experiment engine will
look for the model build settings with the specified config_name. |
<attr> |
a string |
no |
An attribute value which
overrides the attribute values for your chosen training engine. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this model set, for
subsequent reference in this experiment. It is also used as the name of
the subdirectory in which this model set is built. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<training_corpus> |
yes |
yes |
One or more corpora (and
possibly partitions of corpora) which should be used to construct this
model set. |
May be repeated. Specifies the training corpora to use in building
this model set. Each corpus is referred to by name, and an optional
partition name.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the training corpus
to use. This name must match the "name" attribute of some
<corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a
partition in the specified corpus, which must match the "name"
attribute of some <partition> element in the corpus. If not
present, the entire corpus will be used. |
The experiment also can have a set of runs. The runs in each
<runs> element share a set of run settings. Whenever the
experiment is run, each <run> is scored, whether or not it's been
scored before. This is a convenient way of reviewing the scores after
an experiment is finished.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<run_settings> |
yes |
no |
A container for the arguments to
run the processing engine with. |
<run> |
yes |
yes |
An experimental run. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<args> |
yes |
no |
The arguments to the MATEngine to use for these experiment runs. |
<prep_args> |
no |
no |
The arguments to the MATEngine
to use to prepare the annotated documents for the experiment runs. By
default, the documents are converted to raw documents, but if instead
you want to just undo a step and leave them as MAT JSON documents, you
can use this element to achieve that. |
This element houses the arguments to the MATEngine
command to perform the experiment runs.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The output_file_type attribute must be specified (you're restricted to mat-json and raw). The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this experimental
run. This is used as the name of the subdirectory in which this run is
conducted. |
model |
a string |
yes |
The name of a model to use. This
string must match the "name" value of some <model_set> element in
the experiment file. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<test_corpus> |
yes |
yes |
One or more test corpora (and
possibly partitions of corpora) to use in this run. |
May be repeated. One or more test corpora (and possibly partitions
of corpora) to use in this run.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the test corpus to
use. This string must match the "name" value of some <corpus>
element in the experiment file. |
partition |
a string |
no |
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used. |