Experiment XML Reference

The XML format for the experiment files (see MATExperimentEngine) is described in this document. Use cases are described here. Click here for a split-screen view.

Element hierarchy

    <experiment>
       <binding>
       <corpora>
          <partition>
          <fixed_partition>
          <size>
          <corpus>
             <pattern>
          <prep>
       <model_sets>
          <build_settings>
             <iterator>
         <corpus_settings>
             <iterator>
          <model_set>
             <training_corpus>
       <runs>
          <run_settings>
             <prep_args>
             <args>
             <iterator>
          <run>
             <test_corpus>

<experiment>

The toplevel element in the file. Note that the three child elements are not obligatory; the experiment XML can be used simply to build corpora, or to build models, without performing any experimental runs, if, for instance, you want to build a model or corpus to be used in multiple experiments.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
The directory in which the experiment wil be conducted. If the directory does not exist, it will be created. If not specified, the directory must be provided when the experiment is run.
task
a string
yes
The name of a task, as would be passed to the --task argument of MATEngine. This setting is used to establish the task for the corpus preparation and for the experiment runs, and also to establish the set of available tags for the training.

Children

Element
Obligatory?
Repeatable?
Description
<binding>
no
yes
Bindings to be made globally available in the other elements.
<corpora>
no
yes
The corpora to be used in the experiment.
<model_sets>
no
yes
The model sets to be used in the experiment.
<runs>
no
yes
The experimental runs to be used in the experiment.

<binding> (of <experiment>)

This element allows the user to define global bindings which can be referred to in any other element of the experiment XML file (except the attributes of the <experiment> element itself, and the <binding> elements). These bindings can be referred to either in XML attributes or in text within XML elements. The pattern for each binding is $(...). The experiment directory, whether provided via the dir attribute of the <experiment> element or on the command line, is provided as EXP_DIR; the pattern directory, if provided by the --pattern_dir command line argument to MATExperimentEngine, is provided as PATTERN_DIR.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The binding to be replaced. The engine will look for $(<name>) anywhere in the attribute values or text in the experiment XML file.
value
a string
yes
The value to replace $(<name>) with. This replacement is not recursive; that is, you should not include any $(<name>) substrings in your value, unless you want them to be included literally, because they will not be expanded.

<corpora> (of <experiment>)

Describes corpora to be used in the experiment. This element may be repeated; the intention is that a single <corpora> element will correspond to a shared set of preprocessing instructions.

The corpora may be local, in which case a set of patterns should be provided, or remote, in which case the source_corpus_dir attribute should be provided. Remote corpora are used directly unless one or more of the processing tags are specified (<partition>, <prep>). In this case, the specified processing steps are added or redone locally, on a separate copy of the corpus. For instance, if the remote corpus is split into test and train, but not preprocessed, and the <prep> tag is specified here, the corpus documents will be postprocessed here, and the remote split will be preserved. If the remote corpus is preprocessed and split, but the local <partition> tag specifies that the corpus type is "train", the remote corpus preprocessing will be preserved, but locally the split will be ignored. If the remote corpus contains enough patterns for 300 documents, but max_size remotely is 100 and max_size locally is 200, the local max_size will be used; this is possible because all the documents are preprocessed by default when a corpus is prepared, regardless of max_size, and the order of documents (after an initial randomization) is preserved from remote corpus to local copy.

Note that inside the experiment engine. MAT uses the MAT JSON document format exclusively. Therefore, if you want to provide documents which are in a different format which MAT also understand (e.g., XML inline), you must use the <prep> tag to convert the documents to MAT JSON format.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
The directory where the corpora are found, or should be built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "corpora" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<partition>
no
yes
The proportional partition settings for this group of corpora. If neither this nor <fixed_partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions within a given <corpora> element.
<fixed_partition>
no
yes
The fixed partition settings for this group of corpora. If neither this nor <partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions within a given <corpora> element.
<size>
no
no
The size settings for this group of corpora.

If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established).
<corpus>
yes
yes
The individual corpora in this group.
<prep>
no
no
The arguments to the MATEngine command to use to preprocess the corpora. For instance, this command might take documents which have been deidentified and resynthesize fillers for the deidentified regions.

The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files.

If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<partition> (of <corpora>)

Specifies a proportional partition of the sister corpora specified with the <corpus> tag. May be repeated. If there are no instances of this element or <fixed_partition>, the corpus has no partitions. The proportional partitions segment the entire corpus, so the fraction values are normalized to shares of the corpus. If you want just a 10th of the corpus, for instance, you must divide the corpus into two partitions at a ratio of 9:1 and ignore the larger slice.

Each <corpora> element may have either proportional or fixed partitions, but not both.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the partition.
fraction
a float
yes
The share of each corpus that should be allotted to this partition (a float between 0 and 1).

<fixed_partition> (of <corpora>)

Specifies a fixed partition of the sister corpora specified with the <corpus> tag. May be repeated. If there are no instances of this element or <partition>, the corpus has no partitions. These partitions do not segment the entire corpus. If you want a fixed partition to encompass "everything else", use the special "remainder" value as described below.

Each <corpora> element may have either proportional or fixed partitions, but not both.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the partition.
size
an integer or the string "remainder"
yes
The portion of each corpus that should be allotted to this partition (either an integer number of documents or the string "remainder"). The "remainder" value can only be used once in a <corpora> element.

<size> (of <corpora>)

Specifies the size properties of the sister corpora specified with the <corpus> tag. If this tag is present, and the sister corpus has the source_corpus_dir attribute set, the specified values will override those in the source corpus.

Attributes

Attribute
Value
Obligatory?
Description
max_size
an integer
no
The maximum number of documents in each corpus. If specified, each corpus will not exceed this number. This limit is applied last, so the corpus can be reused with a greater max_size specified if the requisite number of documents are available.
truncate_document_list
"yes"
no
If present and max_size is also present, the max_size limit will be imposed first, rather than last. The consequence of this is that no more than max_size documents will be available to remote accesses of this corpus.

<corpus> (of <corpora>)

A corpus is specified either by a set of patterns, or by a reference to another corpus (via source_corpus_dir). The documents specified by a set of patterns are randomly reordered before any subsequent processing is performed (e.g., split, preprocess). If source_corpus_dir is present, patterns are ignored.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built, if the corpus is either local (i.e., it has patterns), or remote with processing overrides.
source_corpus_dir
a string
no
If present, a pathname of an existing corpus directory. If the path is not an absolute path, the experiment directory will be prepended.

The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain).

Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name].

Children

Element
Obligatory?
Repeatable?
Description
<pattern>
yes
yes
A glob-style pattern of files to use to construct this corpus. "Glob" style is the UNIX shell file pattern matching; e.g., "*" matches everything. (This is in contrast to standard regular expressions.) If this path pattern isn't an absolute path, the --pattern_dir option of MATExperimentEngine must be used to provide the location of the patterns.

This element has no attributes or element children; its value is the text it delimits.

<prep> (of <corpora>)

This element houses the arguments to the MATEngine command to use to preprocess the corpora. You might use this command to take documents which have been deidentified and resynthesize fillers for the deidentified regions.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files.

If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<model_sets> (of <experiment>)

Each experiment also can contain a number of model sets. A model set is a sequence of models built out of the same corpus, with successively larger numbers of training inputs. This iterative capability does not have to be used, but is available if the user wants to track the change in performance relative to the number of training documents.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<build_settings>
no
no
The instructions for building the model sets in this bundle.
<model_set>
yes
yes
A model set.

<build_settings> (of <model_sets>)

In order to run an experiment, you must either (a) have declared your <model_config> in your task.xml file and specified a value for the "class" attribute there, in which case you can override the <build_settings> values here, or (b) specify a model_class directly here, in which case the only settings will be the ones you specify explicitly here. The training engine you're most likely to use is the Carafe engine.

Attributes

Attribute
Value
Obligatory?
Description
config_name
a string
no
By default, the settings here will override the attribute values for the default model build settings in task.xml. If this attribute is present, the experiment engine will look for the model build settings with the specified config_name. Either this attribute or model_class can be provided, but not both.
model_class
a string
no
By default, the settings here will override the attribute values for the default model build settings in task.xml. If you have your own model builder object (this is highly unlikely; such an object would be a subclass of MAT.ModelBuilder.ModelBuilder, and you'd have to work out how to customize it, but it is possible), the task will not be consulted at all. Either this attribute or config_name can be provided, but not both.
<attr>
a string
no
An attribute-value pair which overrides the attribute values for your chosen training engine (if config_name is provided) or specifies the attribute values (if model_class is provided).

Children

Element
Obligatory?
Repeatable?
Description
<iterator>
no
yes
An iterator which can be applied to create a sequence of models for these model sets.

<iterator> (of <build_settings>)

It is possible to iterate through a set of values for the model builder using an iterator. For instance, the default Carafe engine allows you to customize the degree of L1 regularization (see the Carafe documentation for details). You might want to build a series of models exploring the effects of L1 regularization values from 0.0 through 2.0, at increments of .2. Or, you might want to vary the number of training iterations the engine performs from 10 to 150 at increments of 10.

You can specify multiple iterators, and you'll get the cross-product of the settings. The iterator mechanism is flexible enough that you can build iterators which depend on the last model built, or iterators which specify their own model builder class (if, for example, they need to do some extensive computation on the training corpus before they train).

Corpus setting iterators and build setting iterators are both applied to the model, and the cross-product of the possible values is used. The corpus setting iterators are applied first.

The build settings support two built-in iterators, which can be configured using the <iterator> element.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either one of the two predefined iterator types "value" or "increment", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the two predefined iterator types is listed immediately below.

Here are the attributes and values available for the "value" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml.
values
a string
yes
A comma-delimited set of values to iterator over.
value_type
one of "float", "str", "int"
no
The type of the values, either strings, integers, or floats. Default is "str" (string).

Here are the attributes and values available for the "increment" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml.
start_val
an integer or float
yes
The initial value for the incrementer.
end_val
an integer or float
yes
The final value for the incrementer.
increment
an integer or float
yes
The increment to add to value on each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. For instance, if you're incrementing model iterations from 20 to 150 by increments of 20, the last value processed will be 140, unless you provide this setting. This setting is also useful with float values, due to the way programming languages like Python deal with floats; asking to increment from .1 to .5 by .1 may or may not give you exactly .5 as the final value, so you might want to use this setting to force whatever value it is (e.g., .500000000001) to be processed.

<corpus_settings> (of <model_sets>)

In addition to the model builder itself, you can configure properties of the training corpus as well. At the moment, the only property of the training corpus you can configure is its size, and this is mostly in service of the iterator over corpus size. When  you specify the size of a corpus, it will truncate the training corpus to the specified length.

Attributes

Attribute
Value
Obligatory?
Description
size
an integer
no
By default, the corpus size is defined in the <corpora> element. However, you can further specify the size here, if you want the corpus to be even smaller than what's specified in <corpora>, or if your training corpus is a union of a number of different corpora (see the <training_corpus> element below).

Children

Element
Obligatory?
Repeatable?
Description
<iterator>
no
yes
An iterator which can be applied to create a sequence of corpora for these model sets.

<iterator> (of <corpus_settings>)

It is possible to iterate through a set of values for the model corpus using an iterator. Right now, the only available iterator is the "corpus_size" iterator (although you can also define your own if you need to).

Corpus setting iterators and build setting iterators are both applied to the model, and the cross-product of the possible values is used. The corpus setting iterators are applied first.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either the predefined iterator type "corpus_size", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the "corpus_size" iterator type is listed immediately below.

Here are the attributes and values available for the "corpus_size" iterator:

Attribute
Value
Obligatory?
Description
start_val
an integer
no
The initial value for the corpus size. Defaults to the increment.
end_val
an integer
no
The final value for the corpus size. Defaults to the size of the corpus.
increment
an integer
yes
The corpus size increment for each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. So if the corpus has 176 documents in it, and you've specified an increment of 20, the last corpus size that will be processed is 160, unless this option is specified.

<model_set> (of <model_sets>)

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of this model set, for subsequent reference in this experiment. It is also used as the name of the subdirectory in which this model set is built.

Children

Element
Obligatory?
Repeatable?
Description
<training_corpus>
yes
yes
One or more corpora (and possibly partitions of corpora) which should be used to construct this model set.

<training_corpus> (of <model_set>)

May be repeated. Specifies the training corpora to use in building this model set. Each corpus is referred to by name, and an optional partition name.

Attributes

Attribute
Value
Obligatory?
Description
corpus
a string
yes
The name of the training corpus to use. This name must match the "name" attribute of some <corpus> element in the experiment file.
partition
a string
no
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.

<runs> (of <experiment>)

The experiment also can have a set of runs. The runs in each <runs> element share a set of run settings. Whenever the experiment is run, each <run> is scored, whether or not it's been scored before. This is a convenient way of reviewing the scores after an experiment is finished.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<run_settings>
yes
no
A container for the arguments to run the processing engine with.
<run>
yes
yes
An experimental run.

<run_settings> (of <runs>)

Children

Element
Obligatory?
Repeatable?
Description
<args>
yes
no
The arguments to the MATEngine to use for these experiment runs.
<prep_args>
no
no
The arguments to the MATEngine to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.

<args> (of <run_settings>)

This element houses the arguments to the MATEngine command to perform the experiment runs.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<prep_args> (of <run_settings>)

This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The output_file_type attribute must be specified (you're restricted to mat-json and raw).

The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<iterator> (of <run_settings>)

It is possible to iterate through a set of values for the run using an iterator. For instance, the default Carafe engine allows you to customize the recall/precision bias (see the Carafe documentation for details). You might want to build a series of models exploring the effects of recall/precision bias values from -2.0 through 2.0, at increments of .5.

You can specify multiple iterators, and you'll get the cross-product of the settings. The iterator mechanism is flexible enough that you can build your own iterators if you need them.

The run settings support two built-in iterators, which can be configured using the <iterator> element.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either one of the two predefined iterator types "value" or "increment", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the two predefined iterator types is listed immediately below.

Here are the attributes and values available for the "value" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml.
values
a string
yes
A comma-delimited set of values to iterator over.
value_type
one of "float", "str", "int"
no
The type of the values, either strings, integers, or floats. Default is "str" (string).

Here are the attributes and values available for the "increment" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml.
start_val
an integer or float
yes
The initial value for the incrementer.
end_val
an integer or float
yes
The final value for the incrementer.
increment
an integer or float
yes
The increment to add to value on each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. For instance, if you're incrementing recall/precision bias from -2.0 to 2.2 by increments of .5, the last value processed will be 2.0, unless you provide this setting. This setting is also useful with float values even if it appears that the endpoints match the increment precisely, due to the way programming languages like Python deal with floats; asking to increment from .1 to .5 by .1 may or may not give you exactly .5 as the final value, so you might want to use this setting to force whatever value it is (e.g., .500000000001) to be processed.

<run> (of <runs>)

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of this experimental run. This is used as the name of the subdirectory in which this run is conducted.
model
a string
yes
The name of a model to use. This string must match the "name" value of some <model_set> element in the experiment file.

Children

Element
Obligatory?
Repeatable?
Description
<test_corpus>
yes
yes
One or more test corpora (and possibly partitions of corpora) to use in this run.

<test_corpus> (of <run>)

May be repeated. One or more test corpora (and possibly partitions of corpora) to use in this run.

Attributes

Attribute
Value
Obligatory?
Description
corpus
a string
yes
The name of the test corpus to use. This string must match the "name" value of some <corpus> element in the experiment file.
partition
a string
no
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.