The XML format for the experiment files (see MATExperimentEngine) is described
in this document. Use cases are described here. Click here for a split-screen view.
<experiment>
<binding>
<corpora>
<partition>
<fixed_partition>
<size>
<corpus>
<pattern>
<prep>
<model_sets>
<build_settings>
<iterator>
<corpus_settings>
<iterator>
<model_set>
<training_corpus>
<runs>
<run_settings>
<prep_args>
<args>
<iterator>
<run>
<test_corpus>
The toplevel element in the file. Note that the three child elements
are not obligatory; the experiment XML can be used simply to build
corpora, or to build models, without performing any experimental runs,
if, for instance, you want to build a model or corpus to be used in
multiple experiments.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory in which the
experiment wil be conducted. If the directory does not exist, it will
be created. If not specified, the directory must be provided when the
experiment is run. |
task |
a string |
yes |
The name of a task, as would be
passed to the --task argument of MATEngine.
This
setting
is
used
to
establish
the
task
for
the
corpus
preparation
and
for
the
experiment
runs,
and
also
to establish the set of available
tags for the training. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<binding> |
no |
yes |
Bindings to be made globally
available in the other elements. |
<corpora> |
no |
yes |
The corpora to be used in the
experiment. |
<model_sets> |
no |
yes |
The model sets to be used in the
experiment. |
<runs> |
no |
yes |
The experimental runs to be used
in the experiment. |
This element allows the user to define global bindings which can be
referred to in any other element of the experiment XML file (except the
attributes of the <experiment> element itself, and the
<binding> elements). These bindings can be referred to either in
XML attributes or in text within XML elements. The pattern for each
binding is $(...). The experiment directory, whether provided via the
dir attribute of the <experiment> element or on the command line,
is provided as EXP_DIR; the pattern directory, if provided by the
--pattern_dir command line argument to MATExperimentEngine, is provided
as PATTERN_DIR.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The binding to be replaced. The
engine will look for $(<name>) anywhere in the attribute values
or text in the experiment XML file. |
value |
a string |
yes |
The value to replace
$(<name>) with. This replacement is not recursive; that is, you
should not include any $(<name>) substrings in your value, unless
you want them to be included literally, because they will not be
expanded. |
Describes corpora to be used in the experiment. This element may be
repeated; the
intention is that a single <corpora> element will correspond to a
shared set of preprocessing instructions.
The corpora may be local, in which case a set of patterns should be
provided, or remote, in which case the source_corpus_dir attribute
should be provided. Remote corpora are used directly unless one or more
of the processing tags are specified (<partition>, <prep>).
In this case, the specified processing steps are added or redone
locally, on a separate copy of the corpus. For instance, if the remote
corpus is split into test and train, but not preprocessed, and the
<prep> tag is specified here, the corpus documents will be
postprocessed here, and the remote split will be preserved. If the
remote corpus is preprocessed and split, but the local
<partition> tag specifies that the corpus type is "train", the
remote corpus preprocessing will be preserved, but locally the split
will be ignored. If the remote corpus contains enough patterns for 300
documents, but max_size remotely is 100 and max_size locally is 200,
the local max_size will be used; this is possible because all the
documents are preprocessed by default when a corpus is prepared,
regardless of max_size, and the order of documents (after an initial
randomization) is preserved from remote corpus to local copy.
Note that inside the experiment engine. MAT uses the MAT JSON
document format exclusively. Therefore, if you want to provide
documents which are in a different format which MAT also understand
(e.g., XML inline), you must use the <prep> tag to convert the
documents to MAT JSON format.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory where the corpora
are found, or should be built. If the directory does not exist, it will
be created. The default value for this attribute is a subdirectory
named "corpora" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The proportional partition
settings for this
group of corpora. If neither this nor <fixed_partition> is
present, the corpus will not have any partitions. You cannot mix fixed
and proportional partitions within a given <corpora> element. |
<fixed_partition> |
no |
yes |
The fixed partition settings for
this
group of corpora. If neither this nor <partition> is present, the
corpus will not have any partitions. You cannot mix fixed and
proportional partitions within a given <corpora> element. |
<size> |
no |
no |
The size settings for this group
of corpora. If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established). |
<corpus> |
yes |
yes |
The individual corpora in this
group. |
<prep> |
no |
no |
The arguments to the MATEngine
command to use to preprocess the corpora. For instance, this command
might
take documents which have been deidentified and resynthesize fillers
for the deidentified regions. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Specifies a proportional partition of the sister corpora specified
with the <corpus> tag. May be repeated. If there are no instances
of this element or <fixed_partition>,
the corpus has no partitions. The proportional partitions segment the
entire corpus,
so the fraction values are normalized to shares of the corpus. If you
want just a 10th of the corpus, for instance, you must divide the
corpus into two partitions at a ratio of 9:1 and ignore the larger
slice.
Each <corpora> element may have either proportional or fixed
partitions, but not both.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
fraction |
a float |
yes |
The share of each
corpus that should be allotted to this partition (a float between 0 and
1). |
Specifies a fixed partition of the sister corpora specified
with the <corpus> tag. May be repeated. If there are no instances
of this element or <partition>,
the corpus has no partitions. These partitions do not segment the entire corpus. If
you
want a fixed partition to encompass "everything else", use the special
"remainder" value as described below.
Each <corpora> element may have either proportional or fixed
partitions, but not both.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
size |
an integer or the string
"remainder" |
yes |
The portion of each
corpus that should be allotted to this partition (either an integer
number of documents or the string "remainder"). The "remainder" value
can only be used once in a <corpora> element. |
Specifies the size properties of the sister corpora specified
with the <corpus> tag. If this tag is present, and the sister
corpus has the source_corpus_dir attribute set, the specified values
will override those in the source corpus.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
max_size |
an integer |
no |
The maximum number of
documents
in each corpus.
If specified, each corpus will not exceed this number. This limit is
applied last, so the corpus can be reused with a greater max_size
specified if the requisite number of documents are available. |
truncate_document_list |
"yes" |
no |
If present and max_size is also
present, the max_size limit
will be imposed first, rather than last. The consequence of this is
that no more than max_size documents will be available to remote
accesses of this corpus. |
A corpus is specified either by a set of patterns, or by a reference
to another corpus (via source_corpus_dir). The documents specified by a
set of patterns are randomly reordered before any subsequent processing
is performed (e.g., split, preprocess). If source_corpus_dir is
present, patterns are ignored.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for
subsequent reference in the remainder of the experiment. It is also
used as the name of the subdirectory in which this corpus is built, if
the corpus is either local (i.e., it has patterns), or remote with
processing overrides. |
source_corpus_dir |
a string |
no |
If present, a pathname of
an existing corpus directory. If the path is not an absolute path, the
experiment directory will be prepended. The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain). Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name]. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<pattern> |
yes |
yes |
A glob-style pattern of files to
use to construct this corpus. "Glob" style is the UNIX shell file
pattern matching; e.g., "*" matches everything. (This is in contrast to
standard regular expressions.) If this path pattern isn't an absolute
path, the --pattern_dir option of MATExperimentEngine
must be used to provide the location of the patterns. This element has no attributes or element children; its value is the text it delimits. |
This element houses the arguments to the MATEngine
command to use to preprocess the corpora. You might use this command to
take documents which have been deidentified and resynthesize fillers
for the deidentified regions.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Each experiment also can contain a number of model sets. A model set
is a sequence of models built out of the same corpus, with successively
larger numbers of training inputs. This iterative capability does not
have to be used, but is available if the user wants to track the change
in performance relative to the number of training documents.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<build_settings> |
no |
no |
The instructions for building
the model sets in this bundle. |
<model_set> |
yes |
yes |
A model set. |
In order to run an experiment, you must either (a) have
declared your <model_config> in your task.xml file and
specified a value for the "class" attribute there, in which case you
can override the
<build_settings>
values here, or (b) specify a model_class directly here, in which case
the only settings will be the ones you specify explicitly here. The
training engine you're most likely to use is the Carafe engine.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
config_name |
a string |
no |
By default, the settings here
will override the attribute values for the default model build settings
in task.xml. If this attribute is present, the experiment engine will
look for the model build settings with the specified config_name.
Either this attribute or model_class can be provided, but not both. |
model_class |
a string |
no |
By default, the settings here
will override the attribute values for the default model build settings
in task.xml. If you have your own model builder object (this is highly
unlikely; such an object would be a subclass of
MAT.ModelBuilder.ModelBuilder, and you'd have to work out how to
customize it, but it is possible), the task will not be consulted at
all. Either this attribute or config_name can be provided, but not both. |
<attr> |
a string |
no |
An attribute-value pair which
overrides the attribute values for your chosen training engine (if
config_name is provided) or specifies the attribute values (if
model_class is provided). |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be applied
to create a sequence of models for these model sets. |
It is possible to iterate through a set of values for the model
builder using an iterator. For
instance, the default Carafe engine
allows you to customize the degree of L1 regularization (see the Carafe
documentation for details). You might want to build a series of models
exploring the effects of L1 regularization values from 0.0 through 2.0,
at increments of .2. Or, you might want to vary the number of training
iterations the engine performs from 10 to 150 at increments of 10.
You can specify multiple iterators, and you'll get the cross-product
of the settings. The iterator mechanism is flexible enough that you can
build
iterators which depend on the last model built, or iterators which
specify their own model builder class (if, for example, they need to do
some extensive computation on the training corpus before they train).
Corpus setting iterators and build setting iterators are both
applied
to the model, and the cross-product of the possible values is used. The
corpus setting iterators are applied first.
The build settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either one of the two predefined
iterator types "value" or "increment", or the name of an iterator class
defined in your task's Python library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and values for
the two predefined iterator types is listed immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine
attribute you're iterating over. This attribute would be one that could
be set in the <build_settings> element above or in the
<model_config> element in task.xml. |
values |
a string |
yes |
A comma-delimited set of values
to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values, either
strings, integers, or floats. Default is "str" (string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value on
each iteration. |
force_last |
"yes" |
no |
If present, force the last value
to be processed, even if it's not exactly an increment. For instance,
if you're incrementing model iterations from 20 to 150 by increments of
20, the last value processed will be 140, unless you provide this
setting. This setting is also useful with float values, due to the way
programming languages like Python deal with floats; asking to increment
from .1 to .5 by .1 may or may not give you exactly .5 as the final
value, so you might want to use this setting to force whatever value it
is (e.g., .500000000001) to be processed. |
In addition to the model builder itself, you can configure
properties of the training corpus as well. At the moment, the only
property of the training corpus you can configure is its size, and this
is mostly in service of the iterator over corpus size. When you
specify the size of a corpus, it will truncate the training corpus to
the specified length.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
size |
an integer |
no |
By default, the corpus size is
defined in the <corpora> element. However, you can further
specify the size here, if you want the corpus to be even smaller than
what's specified in <corpora>, or if your training corpus is a
union of a number of different corpora (see the <training_corpus>
element below). |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be applied
to create a sequence of corpora for these model sets. |
It is possible to iterate through a set of values for the model
corpus using an iterator.
Right now, the only available iterator is the "corpus_size" iterator
(although you can also define
your
own if you need to).
Corpus setting iterators and build setting iterators are both
applied to the model, and the cross-product of the possible values is
used. The corpus setting iterators are applied first.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either the predefined iterator
type "corpus_size", or the
name of an iterator class defined in your task's Python library. |
<attr> |
a string |
no |
An attribute-value pair which
configures
the given iterator. The available attributes and values for the
"corpus_size" iterator type is listed immediately below. |
Here are the attributes and values available for the "corpus_size"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
start_val |
an integer |
no |
The initial value for the corpus
size. Defaults to the increment. |
end_val |
an integer |
no |
The final value for the corpus
size. Defaults to the size of the corpus. |
increment |
an integer |
yes |
The corpus size increment for
each iteration. |
force_last |
"yes" |
no |
If present, force the last value
to be processed, even if it's not exactly an increment. So if the
corpus has 176 documents in it, and you've specified an increment of
20, the last corpus size that will be processed is 160, unless this
option is specified. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this model set, for
subsequent reference in this experiment. It is also used as the name of
the subdirectory in which this model set is built. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<training_corpus> |
yes |
yes |
One or more corpora (and
possibly partitions of corpora) which should be used to construct this
model set. |
May be repeated. Specifies the training corpora to use in building
this model set. Each corpus is referred to by name, and an optional
partition name.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the training corpus
to use. This name must match the "name" attribute of some
<corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a
partition in the specified corpus, which must match the "name"
attribute of some <partition> element in the corpus. If not
present, the entire corpus will be used. |
The experiment also can have a set of runs. The runs in each
<runs> element share a set of run settings. Whenever the
experiment is run, each <run> is scored, whether or not it's been
scored before. This is a convenient way of reviewing the scores after
an experiment is finished.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<run_settings> |
yes |
no |
A container for the arguments to
run the processing engine with. |
<run> |
yes |
yes |
An experimental run. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<args> |
yes |
no |
The arguments to the MATEngine to use for these experiment runs. |
<prep_args> |
no |
no |
The arguments to the MATEngine
to use to prepare the annotated documents for the experiment runs. By
default, the documents are converted to raw documents, but if instead
you want to just undo a step and leave them as MAT JSON documents, you
can use this element to achieve that. |
This element houses the arguments to the MATEngine
command to perform the experiment runs.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The output_file_type attribute must be specified (you're restricted to mat-json and raw). The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
It is possible to iterate through a set of values for the run using
an iterator. For instance, the
default Carafe engine
allows you to customize the recall/precision bias (see the Carafe
documentation for details). You might want to build a series of models
exploring the effects of recall/precision bias values from -2.0 through
2.0,
at increments of .5.
You can specify multiple iterators, and you'll get the cross-product
of the settings. The iterator mechanism is flexible enough that you can
build
your
own
iterators if you need them.
The run settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either
one of the two predefined iterator types "value" or "increment", or the
name of an iterator class defined in your task's Python library. |
<attr> |
a string |
no |
An attribute-value pair which
configures
the given iterator. The available attributes and values for the two
predefined iterator types is listed immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The
name of the training engine attribute you're iterating over. This
attribute would be one that could be set in the <run_settings>
element above or in the <run_settings> element in for workflow
steps task.xml. |
values |
a string |
yes |
A comma-delimited set of values
to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values, either
strings, integers, or floats. Default is "str" (string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value on
each iteration. |
force_last |
"yes" |
no |
If present, force the last value
to be processed, even if it's not exactly an increment. For instance,
if you're incrementing recall/precision bias from -2.0 to 2.2 by
increments of .5, the last value processed will be 2.0, unless you
provide this
setting. This setting is also useful with float values even if it
appears that the endpoints match the increment precisely, due to the
way
programming languages like Python deal with floats; asking to increment
from .1 to .5 by .1 may or may not give you exactly .5 as the final
value, so you might want to use this setting to force whatever value it
is (e.g., .500000000001) to be processed. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this experimental
run. This is used as the name of the subdirectory in which this run is
conducted. |
model |
a string |
yes |
The name of a model to use. This
string must match the "name" value of some <model_set> element in
the experiment file. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<test_corpus> |
yes |
yes |
One or more test corpora (and
possibly partitions of corpora) to use in this run. |
May be repeated. One or more test corpora (and possibly partitions
of corpora) to use in this run.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the test corpus to
use. This string must match the "name" value of some <corpus>
element in the experiment file. |
partition |
a string |
no |
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used. |