The XML format for the experiment files (see MATExperimentEngine) is
described in this document. Use cases are described here. Click here for a split-screen
view.
<experiment>
<binding>
<corpora>
<partition>
<fixed_partition>
<size>
<corpus>
<pattern>
<prep>
<workspace_corpora>
<workspace_corpus>
<partition>
<fixed_partition>
<size>
<model_sets>
<build_settings>
<iterator>
<corpus_settings>
<iterator>
<model_set>
<training_corpus>
<runs>
<run_settings>
<prep_args>
<score_args>
<args>
<iterator>
<run>
<test_corpus>
The toplevel element in the file. Note that none of the five
child elements are obligatory; the experiment XML can be used
simply to build corpora, or to build models, without performing
any experimental runs, if, for instance, you want to build a model
or corpus to be used in multiple experiments.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory in which the
experiment wil be conducted. If the directory does not
exist, it will be created. If not specified, the directory
must be provided when the experiment is run. |
task |
a string |
yes |
The name of a task, as would
be passed to the --task argument of MATEngine. This setting is used
to establish the task for the corpus preparation and for the
experiment runs, and also to establish the set of available
tags for the training. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<binding> |
no |
yes |
Bindings to be made globally
available in the other elements. |
<corpora> |
no |
yes |
The corpora to be used in the
experiment. |
<workspace_corpora> |
no |
yes |
The corpora to be used in the
experiment that will be drawn from workspaces. |
<model_sets> |
no |
yes |
The model sets to be used in
the experiment. |
<runs> |
no |
yes |
The experimental runs to be
used in the experiment. |
This element allows the user to define global bindings which can
be referred to in any other element of the experiment XML file
(except the attributes of the <experiment> element itself,
and the <binding> elements). These bindings can be referred
to either in XML attributes or in text within XML elements. The
pattern for each binding is $(...). The experiment directory,
whether provided via the dir attribute of the <experiment>
element or on the command line, is provided as EXP_DIR; the
pattern directory, if provided by the --pattern_dir command line
argument to MATExperimentEngine,
is provided as PATTERN_DIR.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The binding to be replaced.
The engine will look for $(<name>) anywhere in the
attribute values or text in the experiment XML file. |
value |
a string |
yes |
The value to replace
$(<name>) with. This replacement is not recursive;
that is, you should not include any $(<name>)
substrings in your value, unless you want them to be
included literally, because they will not be expanded. |
Describes corpora to be used in the experiment. This element may
be repeated; the intention is that a single <corpora>
element will correspond to a shared set of preprocessing
instructions.
The corpora may be local, in which case a set of patterns should
be provided, or remote, in which case the source_corpus_dir
attribute should be provided. Remote corpora are used directly
unless one or more of the processing tags are specified
(<partition>, <prep>). In this case, the specified
processing steps are added or redone locally, on a separate copy
of the corpus. For instance, if the remote corpus is split into
test and train, but not preprocessed, and the <prep> tag is
specified here, the corpus documents will be postprocessed here,
and the remote split will be preserved. If the remote corpus is
preprocessed and split, but the local <partition> tag
specifies that the corpus type is "train", the remote corpus
preprocessing will be preserved, but locally the split will be
ignored. If the remote corpus contains enough patterns for 300
documents, but max_size remotely is 100 and max_size locally is
200, the local max_size will be used; this is possible because all
the documents are preprocessed by default when a corpus is
prepared, regardless of max_size, and the order of documents
(after an initial randomization) is preserved from remote corpus
to local copy.
Note that inside the experiment engine. MAT uses the MAT JSON
document format exclusively. Therefore, if you want to provide
documents which are in a different format which MAT also
understand (e.g., XML inline), you must use the <prep> tag
to convert the documents to MAT JSON format.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory where the
corpora are found, or should be built. If the directory does
not exist, it will be created. The default value for this
attribute is a subdirectory named "corpora" in the
experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The proportional partition
settings for this group of corpora. If neither this nor
<fixed_partition> is present, the corpus will not have
any partitions. You cannot mix fixed and proportional
partitions within a given <corpora> element. |
<fixed_partition> |
no |
yes |
The fixed partition settings
for this group of corpora. If neither this nor
<partition> is present, the corpus will not have any
partitions. You cannot mix fixed and proportional partitions
within a given <corpora> element. |
<size> |
no |
no |
The size settings for this
group of corpora. If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established). |
<corpus> |
yes |
yes |
The individual corpora in
this group. |
<prep> |
no |
no |
The arguments to the MATEngine command to use to
preprocess the corpora. For instance, this command might
take documents which have been deidentified and resynthesize
fillers for the deidentified regions. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Specifies a proportional partition of the sister corpora
specified with the <corpus> tag. May be repeated. If there
are no instances of this element or <fixed_partition>, the
corpus has no partitions. The proportional partitions segment the
entire corpus, so the fraction values are normalized to shares of
the corpus. If you want just a 10th of the corpus, for instance,
you must divide the corpus into two partitions at a ratio of 9:1
and ignore the larger slice.
Each <corpora> element may have either proportional or
fixed partitions, but not both.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
fraction |
a float |
yes |
The share of each corpus that
should be allotted to this partition (a float between 0 and
1). |
Specifies a fixed partition of the sister corpora specified with
the <corpus> tag. May be repeated. If there are no instances
of this element or <partition>, the corpus has no
partitions. These partitions do not
segment the entire corpus. If you want a fixed partition to
encompass "everything else", use the special "remainder" value as
described below.
Each <corpora> element may have either proportional or
fixed partitions, but not both.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
size |
an integer or the string
"remainder" |
yes |
The portion of each corpus
that should be allotted to this partition (either an integer
number of documents or the string "remainder"). The
"remainder" value can only be used once in a <corpora>
element. |
Specifies the size properties of the sister corpora specified
with the <corpus> tag. If this tag is present, and the
sister corpus has the source_corpus_dir attribute set, the
specified values will override those in the source corpus.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
max_size |
an integer |
no |
The maximum number of
documents in each corpus. If specified, each corpus will not
exceed this number. This limit is applied last, so the
corpus can be reused with a greater max_size specified if
the requisite number of documents are available. |
truncate_document_list |
"yes" |
no |
If present and max_size is
also present, the max_size limit will be imposed first,
rather than last. The consequence of this is that no more
than max_size documents will be available to remote accesses
of this corpus. |
A corpus is specified either by a set of patterns, or by a
reference to another corpus (via source_corpus_dir). The documents
specified by a set of patterns are randomly reordered before any
subsequent processing is performed (e.g., split, preprocess). If
source_corpus_dir is present, patterns are ignored.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for
subsequent reference in the remainder of the experiment. It
is also used as the name of the subdirectory in which this
corpus is built, if the corpus is either local (i.e., it has
patterns), or remote with processing overrides. |
source_corpus_dir |
a string |
no |
If present, a pathname of an
existing corpus directory. If the path is not an absolute
path, the experiment directory will be prepended. The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain). Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name]. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<pattern> |
yes |
yes |
A glob-style pattern of files
to use to construct this corpus. "Glob" style is the UNIX
shell file pattern matching; e.g., "*" matches everything.
(This is in contrast to standard regular expressions.) If
this path pattern isn't an absolute path, the --pattern_dir
option of MATExperimentEngine
must be used to provide the location of the patterns. This element has no attributes or element children; its value is the text it delimits. |
This element houses the arguments to the MATEngine command to use to preprocess
the corpora. You might use this command to take documents which
have been deidentified and resynthesize fillers for the
deidentified regions.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
In order to run experiments against workspaces,
we've introduced a new element which allows you to specify a
collection of corpora to be drawn from a workspace, restricted by
various dimensions of the workspace contents. Each
<workspace_corpora> element establishes a context for
a workspace, by describing a subset of eligible workspace
files, and within that context you can define
<workspace_corpus> elements which are ultimately transformed
into the same objects which the <corpus> elements correspond
to.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
workspace_dir |
a pathname |
yes |
The directory root of the
workspace. |
document_statuses |
a string |
no |
If present, a comma-separated
list of document
statuses. The default is "partially
corrected,partially gold,gold,reconciled". Any document
which doesn't have one of these statuses in the workspace
will be excluded. |
users |
a string |
no |
If present, a comma-separated
list of workspace users.
Any document which is assigned to a user which isn't one of
these users will be excluded. |
include_unassigned |
"no" |
no |
If present, documents which
are not assigned to any workspace user will be excluded. |
basename_sets |
a string |
no |
If present, a comma-separated
list of workspace basename
sets. Any document which is not in one of these
basename sets in the given workspace will be excluded. |
basename_patterns |
a string |
no |
If present, a comma-separated
list of glob-style patterns to match the workspace basenames
against. Any document whose basename does not match one of
the patterns will be excluded. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<workspace_corpus> |
yes |
yes |
An actual corpus which will
be created in this workspace context. |
Each <workspace_corpus> element defines a set of files,
similar to a <corpus> element. Because its common context is
a workspace, we've chosen to allow the partition and size
information to be specified independently for each of these
elements, rather than establishing them in the common workspace
context (as <corpora> does for <corpus>). Within the
workspace context, you can further restrict each corpus in the
same way as you restrict the context itself, and also identify a
special, unique corpus as containing the remainder of the files in
the workspace context.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built. |
document_statuses |
a string |
no |
If present, a comma-separated
list of document
statuses. Any document in the workspace context which
doesn't have one of these statuses will be excluded. |
users |
a string |
no |
If present, a comma-separated
list of workspace users.
Any document in the workspace context which is assigned to a
user which isn't one of these users will be excluded. |
include_unassigned |
"no" |
no |
If present, documents in the
workspace context which are not assigned to any workspace
user will be excluded. |
basename_sets |
a string |
no |
If present, a comma-separated
list of workspace basename
sets. Any document in the workspace context which is
not in one of these basename sets will be excluded. |
basename_patterns |
a string |
no |
If present, a comma-separated
list of glob-style patterns to match the workspace basenames
against. Any document in the workspace context whose
basename does not match one of the patterns will be
excluded. |
use_remainder |
"yes" |
no |
If present, the corpus
consists of all those documents in the document context
which are not included in any sibling
<workspace_corpus>. This attribute-value pair must
occur without any of the other qualifying attributes. |
The semantics of these elements are identical to the semantics of
these elements in the scope of the <corpora> element.
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The proportional partition
settings for this workspace corpus. If neither this nor
<fixed_partition> is present, the corpus will not have
any partitions. |
<fixed_partition> |
no |
yes |
The fixed partition settings
for this workspace corpus. If neither this nor
<partition> is present, the corpus will not have any
partitions. You cannot mix fixed and proportional partitions
within a given <workspace_corpus> element. |
<size> |
no |
no |
The size settings for this
workspace corpus. |
See <partition> of
<corpora> for details.
See <fixed_partition>
of <corpora> for details.
See <size> of <corpora>
for details.
Each experiment also can contain a number of model sets. A model
set is a sequence of models built out of the same corpus, with
successively larger numbers of training inputs. This iterative
capability does not have to be used, but is available if the user
wants to track the change in performance relative to the number of
training documents.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<build_settings> |
no |
no |
The instructions for building
the model sets in this bundle. |
<model_set> |
yes |
yes |
A model set. |
In order to run an experiment, you must either (a) have declared
your <model_config> in your task.xml file and specified a
value for the "class" attribute there, in which case you can
override the <build_settings> values here, or (b) specify a
model_class directly here, in which case the only settings will be
the ones you specify explicitly here. The training engine you're
most likely to use is the Carafe
engine.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
config_name |
a string |
no |
By default, the settings here
will override the attribute values for the default model
build settings in task.xml. If this attribute is present,
the experiment engine will look for the model build settings
with the specified config_name. Either this attribute or
model_class can be provided, but not both. |
model_class |
a string |
no |
By default, the settings here
will override the attribute values for the default model
build settings in task.xml. If you have your own model
builder object (this is highly unlikely; such an object
would be a subclass of MAT.ModelBuilder.ModelBuilder, and
you'd have to work out how to customize it, but it is
possible), the task will not be consulted at all. Either
this attribute or config_name can be provided, but not both. |
<attr> |
a string |
no |
An attribute-value pair which
overrides the attribute values for your chosen training
engine (if config_name is provided) or specifies the
attribute values (if model_class is provided). |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be
applied to create a sequence of models for these model sets. |
It is possible to iterate through a set of values for the model
builder using an iterator.
For instance, the default Carafe
engine allows you to customize the degree of L1
regularization (see the Carafe documentation for details). You
might want to build a series of models exploring the effects of L1
regularization values from 0.0 through 2.0, at increments of .2.
Or, you might want to vary the number of training iterations the
engine performs from 10 to 150 at increments of 10.
You can specify multiple iterators, and you'll get the
cross-product of the settings. The iterator mechanism is flexible
enough that you can build
iterators which depend on the last model built, or iterators
which specify their own model builder class (if, for example, they
need to do some extensive computation on the training corpus
before they train).
Corpus setting iterators and build setting iterators are both
applied to the model, and the cross-product of the possible values
is used. The corpus setting iterators are applied first.
The build settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either one of the two
predefined iterator types "value" or "increment", or the
name of an iterator class defined in your task's Python
library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the two predefined iterator types is listed
immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training
engine attribute you're iterating over. This attribute would
be one that could be set in the <build_settings>
element above or in the <model_config> element in
task.xml. |
values |
a string |
yes |
A comma-delimited set of
values to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values,
either strings, integers, or floats. Default is "str"
(string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value
on each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. For instance, if you're incrementing model
iterations from 20 to 150 by increments of 20, the last
value processed will be 140, unless you provide this
setting. This setting is also useful with float values, due
to the way programming languages like Python deal with
floats; asking to increment from .1 to .5 by .1 may or may
not give you exactly .5 as the final value, so you might
want to use this setting to force whatever value it is
(e.g., .500000000001) to be processed. |
In addition to the model builder itself, you can configure
properties of the training corpus as well. At the moment, the only
property of the training corpus you can configure is its size, and
this is mostly in service of the iterator over corpus size.
When you specify the size of a corpus, it will truncate the
training corpus to the specified length.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
size |
an integer |
no |
By default, the corpus size
is defined in the <corpora> element. However, you can
further specify the size here, if you want the corpus to be
even smaller than what's specified in <corpora>, or if
your training corpus is a union of a number of different
corpora (see the <training_corpus> element below). |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be
applied to create a sequence of corpora for these model
sets. |
It is possible to iterate through a set of values for the model
corpus using an iterator.
Right now, the only available iterator is the "corpus_size"
iterator (although you can also define
your own if you need to).
Corpus setting iterators and build setting iterators are both
applied to the model, and the cross-product of the possible values
is used. The corpus setting iterators are applied first.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either the predefined
iterator type "corpus_size", or the name of an iterator
class defined in your task's Python library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the "corpus_size" iterator type is listed
immediately below. |
Here are the attributes and values available for the
"corpus_size" iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
start_val |
an integer |
no |
The initial value for the
corpus size. Defaults to the increment. |
end_val |
an integer |
no |
The final value for the
corpus size. Defaults to the size of the corpus. |
increment |
an integer |
yes |
The corpus size increment for
each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. So if the corpus has 176 documents in it, and
you've specified an increment of 20, the last corpus size
that will be processed is 160, unless this option is
specified. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this model set,
for subsequent reference in this experiment. It is also used
as the name of the subdirectory in which this model set is
built. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<training_corpus> |
yes |
yes |
One or more corpora (and
possibly partitions of corpora) which should be used to
construct this model set. |
May be repeated. Specifies the training corpora to use in
building this model set. Each corpus is referred to by name, and
an optional partition name.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the training
corpus to use. This name must match the "name" attribute of
some <corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a
partition in the specified corpus, which must match the
"name" attribute of some <partition> element in the
corpus. If not present, the entire corpus will be used. |
The experiment also can have a set of runs. The runs in each
<runs> element share a set of run settings. Whenever the
experiment is run, each <run> is scored, whether or not it's
been scored before. This is a convenient way of reviewing the
scores after an experiment is finished.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<run_settings> |
yes |
no |
A container for the arguments
to run the processing engine with. |
<run> |
yes |
yes |
An experimental run. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<args> |
yes |
no |
The arguments to the MATEngine to use for these
experiment runs. |
<prep_args> |
no |
no |
The arguments to the
MATEngine to use to prepare the annotated documents for the
experiment runs. By default, the documents are converted to
raw documents, but if instead you want to just undo a step
and leave them as MAT JSON documents, you can use this
element to achieve that. |
<score_args> |
no |
no |
Flags which control the
behavior of the scorer. |
This element houses the arguments to the MATEngine command to perform the
experiment runs.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The output_file_type attribute must be specified (you're restricted to mat-json and raw). The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
These flags control the behavior of the scorer. They are not yet generally processed the way <prep_args> and <args> are; they're individually defined and handled, for the moment.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
gold_only |
"yes" |
no |
Equivalent to the
--ref_gold_only option of MATScore.
If this attribute is provided, only gold or reconciled
segments in the reference will be used for scoring
comparison. This is particularly useful when defining
experiment files to be used with workspaces. |
similarity_profile |
a string |
no |
Equivalent to the --similarity_profile option
of MATScore. |
score_profile |
a string |
no |
Equivalent to the --score_profile option of MATScore. |
It is possible to iterate through a set of values for the run
using an iterator. For
instance, the default Carafe engine
allows you to customize the recall/precision bias (see the Carafe
documentation for details). You might want to build a series of
models exploring the effects of recall/precision bias values from
-2.0 through 2.0, at increments of .5.
You can specify multiple iterators, and you'll get the
cross-product of the settings. The iterator mechanism is flexible
enough that you can build
your own iterators if you need them.
The run settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either one of the two
predefined iterator types "value" or "increment", or the
name of an iterator class defined in your task's Python
library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the two predefined iterator types is listed
immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training
engine attribute you're iterating over. This attribute would
be one that could be set in the <run_settings> element
above or in the <run_settings> element in for workflow
steps task.xml. |
values |
a string |
yes |
A comma-delimited set of
values to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values,
either strings, integers, or floats. Default is "str"
(string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value
on each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. For instance, if you're incrementing
recall/precision bias from -2.0 to 2.2 by increments of .5,
the last value processed will be 2.0, unless you provide
this setting. This setting is also useful with float values
even if it appears that the endpoints match the increment
precisely, due to the way programming languages like Python
deal with floats; asking to increment from .1 to .5 by .1
may or may not give you exactly .5 as the final value, so
you might want to use this setting to force whatever value
it is (e.g., .500000000001) to be processed. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this experimental
run. This is used as the name of the subdirectory in which
this run is conducted. |
model |
a string |
yes |
The name of a model to use.
This string must match the "name" value of some
<model_set> element in the experiment file. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<test_corpus> |
yes |
yes |
One or more test corpora (and
possibly partitions of corpora) to use in this run. |
May be repeated. One or more test corpora (and possibly
partitions of corpora) to use in this run.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the test corpus
to use. This string must match the "name" value of some
<corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used. |