Use cases for the XML format for the experiment files (see MATExperimentEngine) are described
here. The reference document is found here.
Click
here for a split-screen
view.
In all the examples below, we're going to use the sample "Named
Entity" task.
The simplest possible experiment involves a single corpus, a single model, and a single run. Assume you have a set of completed in /documents/newswire/*.json.
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
This experiment takes a single set of documents, and designates
80% of the set for training and the remaining 20% for test. It then
generates a single model from the training documents, and executes a
single run using this model against the test documents.
If all your documents have the ".json" extension, and you want to
reuse this experiment XML file, just change the <pattern> element
entry to a relative pathname and use the --pattern_dir argument when
you call MATExperimentEngine.
...
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
...
Let's say you've set aside a test corpus which you want to hold
constant across a set of experiments, in
/documents/newswire-test/*.json. You can use an experiment XML file
such as this one:
<experiment task='Named Entity'>
<corpora dir="corpora">
<corpus name="train_nw">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
<corpus name="test_nw">
<pattern>/documents/newswire-test/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="train">
<training_corpus corpus="train_nw"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="test_nw"/>
</run>
</runs>
</experiment>
Here, we have two separate corpora, which are not split; one is used
as a training corpus, and the other as a testing corpus. We
generate one model, and one run.
Let's say you have two corpora, and you want to split each of them
4-to-1, and use the larger slice of each of them, together, to build a
single model, and test against the smaller slice of each of them, in a
single run:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="nw1">
<pattern>/documents/newswire-1/*.json</pattern>
</corpus>
<corpus name="nw2">
<pattern>/documents/newswire-2/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="train">
<training_corpus corpus="nw1" partition="train/>
<training_corpus corpus="nw2" partition="train/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="nw1" partition="test"/>
<test_corpus corpus="nw2" partition="test"/>
</run>
</runs>
</experiment>
Sometimes, you want to run the model against the corpus that
produced it. In the example in the previous use
case, you can modify the <runs> as follows:
...
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<training_corpus corpus="train_nw"/>
</run>
</runs>
...
To answer this question, we add a <build_settings> element to
<model_sets>, as follows:
...
<model_sets dir="model_sets">
<build_settings training_increment="50"/>
<model_set name="test">
<training_corpus corpus="test"/>
</model_set>
</model_sets>
...
In this case, we're telling the experiment engine to build a model
at 50-document increments. So if the corpus contains 150 documents, the
experiment engine will build three models, and produce one set of three
runs.
If your corpus has more than 100 documents, but less than 150, the
above values for <build_settings> will still build three models.
If you don't want a model built for the remainder, use the
"truncate_to_increment" attribute:
...
<model_sets dir="model_sets">
<build_settings training_increment="50" truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test"/>
</model_set>
</model_sets>
...
Let's say you have two sets of completed documents: a set of
newswire documents, in /documents/newswire/*.json, and a set of chat
transcripts, in /documents/chat/*.json. Both these document sets are
tagged with the same tag set. If you want to know how a model built
against each will work on the other, here's an experiment XML file that
accomplishes that:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
<corpus name="chat">
<pattern>/documents/chat/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
<model_set name="chat">
<training_corpus corpus="chat" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="nw_train_nw_test" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
<run name="nw_train_chat_test" model="newswire">
<test_corpus corpus="chat" partition="test"/>
</run>
<run name="chat_train_chat_test" model="chat">
<test_corpus corpus="chat" partition="test"/>
</run>
<run name="chat_train_nw_test" model="chat">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
This experiment XML file will split each corpus 80%/20%, and build
two models, one from each corpus. Finally, it performs a four-way
comparison between the models and the test subsets of the corpora.
Let's say that you have a Carafe lexicon directory, as described in
the documentation for MATModelBuilder.
You
want
to
know
whether using this lexicon results in a better model.
Here's an experiment XML file which accomplishes that:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<model_sets dir="model_sets">
<build_settings lexicon_dir="/documents/newswire_lexicon/"/>
<model_set name="newswire_w_lex">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="w_lex" model="newswire_w_lex">
<test_corpus corpus="newswire" partition="test"/>
</run>
<run name="wo_lex" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
In this case, there are two different <model_sets> elements,
because the build settings for the enclosed models differ. We have one
corpus, two models, and two runs.
You can further specify any of the advanced settings for the
trainer, if you know what you're doing. See MATModelBuilder for whatever
documentation is available.
The Carafe tagger has the option of biasing precision and recall differently during automated tagging, using the --prior_adjst flag. If you want to compare two decoding strategies, one which biases heavily toward recall and one toward precision, you might do this:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo" prior_adjst="-3.0"/>
</run_settings>
<run name="recall_bais" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo" prior_adjst="3.0"/>
</run_settings>
<run name="precision_bias" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
In this case we have two different <runs> elements, because
the run settings differ for the two runs. So we end up with one corpus,
one model, and two runs.
Sometimes, you may need to do some preprocessing of a corpus. Let's
assume:
To do this during the experiment, you'd use the <prep> element:
...
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<prep steps="zone,tokenize,align" workflow="Preprocess" input_file_type="xml-inline" xml_input_is_overlay="yes"/>
<corpus name="test">
<pattern>/documents/newswire/*.xml</pattern>
</corpus>
</corpora>
...
Let's say, for instance, that you're working with the MUC (Message
Understanding Conference) corpus, and you're not tagging the header
portions of the documents. Under normal circumstances, when it prepares
an experiment run, the experiment engine converts the test documents to
raw text, and processes them starting from raw text. However, in this
case, you can't actually recreate the zoning with your own zoner; you
need the zoning as it was provided in the MUC documents. In this
situation, you can use the <prep_args> element in the <run>
element to specify a set of parameters to MATEngine to modify the
default test document preparation:
...
<runs dir="runs">
<run_settings>
<prep_args output_file_type="mat-json" undo_through="tag" workflow="Demo"/>
<args steps="tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
...
Here, instead of undoing all steps by using an output_file_type of
"raw" (which is the default), we undo the "tag" step and use MAT JSON
documents as the inputs to the run; we see that the <args> for
the run only does the "tag" step.
Sometimes, you might want to prepare a corpus ahead of time, with a
fixed partition, a fixed prep phase, or the like. You can use the
experiment engine to create a corpus alone, and then refer to that
corpus elsewhere.
For instance, you might prepare the corpus in the previous use case
with nothing in the <experiment> element except the
<corpora>:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<prep steps="zone,tokenize,align" workflow="Preprocess" input_file_type="mat-json"/>
<corpus name="test">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
</experiment>
Assume we save this XML file to /experiments/xml/corpus.xml, and
output the experiment into /experiments/corpus1:
% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /experiments/corpus1 /experiments/xml/corpus.xml
The corpus will be in the "corpora" subdirectory, in the
subdirectory named "test" (the name of the corpus).
Now, let's refer to it in a different experiment XML file:
<experiment task='Named Entity'>
<corpora dir="corpora">
<corpus name="local_test" source_corpus_dir="/experiments/corpus1/corpora/test"/>
</corpora>
...
</experiment>
Instead of including a <pattern> element, we use the
"source_corpus_dir" attribute. The corpus referred to can itself have a
"source_corpus_dir" attribute (i.e., you can chain them). Local
<prep> or <partition> elements can augment or override
remote elements; the combinations are complex, and you can find more
documentation on them in the experiment
XML reference.