Experiment XML Reference

The XML format for the experiment files (see MATExperimentEngine) is described in this document. Use cases are described here. Click here for a split-screen view.

Element hierarchy

<experiment>

The toplevel element in the file. Note that the three child elements are not obligatory; the experiment XML can be used simply to build corpora, or to build models, without performing any experimental runs, if, for instance, you want to build a model or corpus to be used in multiple experiments.

Attributes

Attribute	Value	Obligatory?	Description
dir	a pathname	no	The directory in which the experiment wil be conducted. If the directory does not exist, it will be created. If not specified, the directory must be provided when the experiment is run.
task	a string	yes	The name of a task, as would be passed to the --task argument of MATEngine. This setting is used to establish the task for the corpus preparation and for the experiment runs, and also to establish the set of available tags for the training.

Children

Element	Obligatory?	Repeatable?	Description
<binding>	no	yes	Bindings to be made globally available in the other elements.
<corpora>	no	yes	The corpora to be used in the experiment.
<model_sets>	no	yes	The model sets to be used in the experiment.
<runs>	no	yes	The experimental runs to be used in the experiment.

<binding> (of <experiment>)

This element allows the user to define global bindings which can be referred to in any other element of the experiment XML file (except the attributes of the <experiment> element itself, and the <binding> elements). These bindings can be referred to either in XML attributes or in text within XML elements. The pattern for each binding is $(...). The experiment directory, whether provided via the dir attribute of the <experiment> element or on the command line, is provided as EXP_DIR; the pattern directory, if provided by the --pattern_dir command line argument to MATExperimentEngine, is provided as PATTERN_DIR.

Attributes

Attribute	Value	Obligatory?	Description
name	a string	yes	The binding to be replaced. The engine will look for $(<name>) anywhere in the attribute values or text in the experiment XML file.
value	a string	yes	The value to replace $(<name>) with. This replacement is not recursive; that is, you should not include any $(<name>) substrings in your value, unless you want them to be included literally, because they will not be expanded.

<corpora> (of <experiment>)

Describes corpora to be used in the experiment. This element may be repeated; the intention is that a single <corpora> element will correspond to a shared set of preprocessing instructions.

The corpora may be local, in which case a set of patterns should be provided, or remote, in which case the source_corpus_dir attribute should be provided. Remote corpora are used directly unless one or more of the processing tags are specified (<partition>, <prep>). In this case, the specified processing steps are added or redone locally, on a separate copy of the corpus. For instance, if the remote corpus is split into test and train, but not preprocessed, and the <prep> tag is specified here, the corpus documents will be postprocessed here, and the remote split will be preserved. If the remote corpus is preprocessed and split, but the local <partition> tag specifies that the corpus type is "train", the remote corpus preprocessing will be preserved, but locally the split will be ignored. If the remote corpus contains enough patterns for 300 documents, but max_size remotely is 100 and max_size locally is 200, the local max_size will be used; this is possible because all the documents are preprocessed by default when a corpus is prepared, regardless of max_size, and the order of documents (after an initial randomization) is preserved from remote corpus to local copy.

Note that inside the experiment engine. MAT uses the MAT JSON document format exclusively. Therefore, if you want to provide documents which are in a different format which MAT also understand (e.g., XML inline), you must use the <prep> tag to convert the documents to MAT JSON format.

Attributes

Attribute	Value	Obligatory?	Description
dir	a pathname	no	The directory where the corpora are found, or should be built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "corpora" in the experiment directory.

Children

Element	Obligatory?	Repeatable?	Description
<partition>	no	yes	The partition settings for this group of corpora. If omitted, the corpus will not have any partitions.
<size>	no	no	The size settings for this group of corpora. If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established).
<corpus>	yes	yes	The individual corpora in this group.
<prep>	no	no	The arguments to the MATEngine command to use to preprocess the corpora. For instance, this command might take documents which have been deidentified and resynthesize fillers for the deidentified regions. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<partition> (of <corpora>)

Specifies a partition of the sister corpora specified with the <corpus> tag. May be repeated. If this tag is missing, the corpus has no partitions. The partitions segment the entire corpus, so the fraction values are normalized to shares of the corpus. If you want just a 10th of the corpus, for instance, you must divide the corpus into two partitions at a ratio of 9:1 and ignore the larger slice.

Attributes

Attribute	Value	Obligatory?	Description
name	a string	yes	The name of the partition.
fraction	a float	yes	The share of each corpus that should be allotted to this partition (a float between 0 and 1).

<size> (of <corpora>)

Specifies the size properties of the sister corpora specified with the <corpus> tag. If this tag is present, and the sister corpus has the source_corpus_dir attribute set, the specified values will override those in the source corpus.

Attributes

Attribute	Value	Obligatory?	Description
max_size	an integer	no	The maximum number of documents in each corpus. If specified, each corpus will not exceed this number. This limit is applied last, so the corpus can be reused with a greater max_size specified if the requisite number of documents are available.
truncate_document_list	"yes"	no	If present and max_size is also present, the max_size limit will be imposed first, rather than last. The consequence of this is that no more than max_size documents will be available to remote accesses of this corpus.

<corpus> (of <corpora>)

A corpus is specified either by a set of patterns, or by a reference to another corpus (via source_corpus_dir). The documents specified by a set of patterns are randomly reordered before any subsequent processing is performed (e.g., split, preprocess). If source_corpus_dir is present, patterns are ignored.

Attributes

Attribute	Value	Obligatory?	Description
name	a string	yes	The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built, if the corpus is either local (i.e., it has patterns), or remote with processing overrides.
source_corpus_dir	a string	no	If present, a pathname of an existing corpus directory. If the path is not an absolute path, the experiment directory will be prepended. The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain). Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name].

Children

Element	Obligatory?	Repeatable?	Description
<pattern>	yes	yes	A glob-style pattern of files to use to construct this corpus. "Glob" style is the UNIX shell file pattern matching; e.g., "*" matches everything. (This is in contrast to standard regular expressions.) If this path pattern isn't an absolute path, the --pattern_dir option of MATExperimentEngine must be used to provide the location of the patterns. This element has no attributes or element children; its value is the text it delimits.

<prep> (of <corpora>)

This element houses the arguments to the MATEngine command to use to preprocess the corpora. You might use this command to take documents which have been deidentified and resynthesize fillers for the deidentified regions.

Attributes

Attribute	Value	Obligatory?	Description
<attr>	a string	no	An attribute-value pair which corresponds to a command-line option to MATEngine. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<model_sets> (of <experiment>)

Each experiment also can contain a number of model sets. A model set is a sequence of models built out of the same corpus, with successively larger numbers of training inputs. This iterative capability does not have to be used, but is available if the user wants to track the change in performance relative to the number of training documents.

Attributes

Attribute	Value	Obligatory?	Description
dir	a pathname	no	If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory.

Children

Element	Obligatory?	Repeatable?	Description
<build_settings>	no	no	The instructions for building the model sets in this bundle.
<model_set>	yes	yes	A model set.

<build_settings> (of <model_sets>)

In order to run an experiment, you must at the very least have declared your <model_config> in your task.xml file and specified a value for the "class" attribute. You can override the <build_settings> values here. The training engine you're most likely to use is the Carafe engine.

Attributes

Attribute	Value	Obligatory?	Description
training_increment	integer	no	If present, the increment to use when constructing successively larger models in this model set. If absent, a single model will be constructed using all the documents in the training set.
truncate_to_increment	"yes"	no	If present, and training_increment is present, the model set will truncate its file list to match the training_increment. For example, if there are 176 files, and the training increment is 25, the engine will discard the files above 175 for training purposes. Otherwise, the engine would build a model for the first 175 documents, and then another model for all 176.
config_name	a string	no	By default, the settings here will override the attribute values for the default model build settings in task.xml. If this attribute is present, the experiment engine will look for the model build settings with the specified config_name.
<attr>	a string	no	An attribute value which overrides the attribute values for your chosen training engine.

<model_set> (of <model_sets>)

Attributes

Attribute	Value	Obligatory?	Description
name	a string	yes	The name of this model set, for subsequent reference in this experiment. It is also used as the name of the subdirectory in which this model set is built.

Children

Element	Obligatory?	Repeatable?	Description
<training_corpus>	yes	yes	One or more corpora (and possibly partitions of corpora) which should be used to construct this model set.

<training_corpus> (of <model_set>)

May be repeated. Specifies the training corpora to use in building this model set. Each corpus is referred to by name, and an optional partition name.

Attributes

Attribute	Value	Obligatory?	Description
corpus	a string	yes	The name of the training corpus to use. This name must match the "name" attribute of some <corpus> element in the experiment file.
partition	a string	no	If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.

<runs> (of <experiment>)

The experiment also can have a set of runs. The runs in each <runs> element share a set of run settings. Whenever the experiment is run, each <run> is scored, whether or not it's been scored before. This is a convenient way of reviewing the scores after an experiment is finished.

Attributes

Attribute	Value	Obligatory?	Description
dir	a pathname	no	If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory.

Children

Element	Obligatory?	Repeatable?	Description
<run_settings>	yes	no	A container for the arguments to run the processing engine with.
<run>	yes	yes	An experimental run.

<run_settings> (of <runs>)

Children

Element	Obligatory?	Repeatable?	Description
<args>	yes	no	The arguments to the MATEngine to use for these experiment runs.
<prep_args>	no	no	The arguments to the MATEngine to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.

<args> (of <run_settings>)

This element houses the arguments to the MATEngine command to perform the experiment runs.

Attributes

Attribute	Value	Obligatory?	Description
<attr>	a string	no	An attribute-value pair which corresponds to a command-line option to MATEngine. The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<prep_args> (of <run_settings>)

This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.

Attributes

Attribute	Value	Obligatory?	Description
<attr>	a string	no	An attribute-value pair which corresponds to a command-line option to MATEngine. The output_file_type attribute must be specified (you're restricted to mat-json and raw). The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<run> (of <runs>)

Attributes

Attribute	Value	Obligatory?	Description
name	a string	yes	The name of this experimental run. This is used as the name of the subdirectory in which this run is conducted.
model	a string	yes	The name of a model to use. This string must match the "name" value of some <model_set> element in the experiment file.

Children

Element	Obligatory?	Repeatable?	Description
<test_corpus>	yes	yes	One or more test corpora (and possibly partitions of corpora) to use in this run.

<test_corpus> (of <run>)

May be repeated. One or more test corpora (and possibly partitions of corpora) to use in this run.

Attributes

Attribute	Value	Obligatory?	Description
corpus	a string	yes	The name of the test corpus to use. This string must match the "name" value of some <corpus> element in the experiment file.
partition	a string	no	If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.