Tutorial 7: The Experiment Engine

At any point, you might want to know how you're doing.

These and other questions can be answered easily by the experiment engine, MATExperimentEngine. The power of the experiment engine lies largely in its rich XML configuration. In this tutorial, we'll learn how to use the experiment engine to answer one of the questions above, and you can examine the other documentation to see how you might answer other questions, as illustrated in the use cases.

We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.

Step 1: Review your XML file for question 1

This step is fairly easy, because the XML file to answer the first question is included as part of the distribution. The XML file is found in MAT_PKG_HOME/sample/ne/test/exp/exp.xml, and it looks like this:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>

This is one of the simplest complete experiment XML files you can create. As with all experiment XML files, it describes three types of entities.

So this experiment takes a single set of documents, and designates 80% of the set for training and the remaining 20% for test. It then generates a single model from the training documents, and executes a single run using this model against the test documents.

Step 2: Run the experiment

This operation is a command-line operation. Try it:

Unix:

% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/exp \
--pattern_dir $PWD/sample/ne/resources/data/json sample/ne/test/exp/exp.xml

Windows native:

> cd %MAT_PKG_HOME%%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\exp \
--pattern_dir %CD%\sample\ne\resources\data\json sample\ne\test\exp\exp.xml

The --exp_dir is the directory where the corpora, models and runs will be computed (and stored, if necessary), and where the results will be found. The --pattern_dir is the directory in which to look for the files referred to in the <pattern> elements in the experiment XML file; the patterns are so-called Unix "glob" patterns, which are standard file patterns which should be familiar to any user of the Unix shell. The final argument is the experiment XML file itself.

The engine will create the directory, copy the experiment XML file into it for archive purposes, and then run the experiment as described in step 1.

Step 3: Review the results

Look in the experiment directory.

Unix:

% ls /tmp/exp

Windows native:

> dir %TMP%/exp

allbytag.csv corpora model_sets
allbytoken.csv exp.xml runs

The corpora, model_sets and runs subdirectories are as specified in the experiment XML file above (that's what the "dir" attribute does). What you'll be most interested in are the files allbytag.csv and allbytoken.csv. These files contain the tag-level and token-level scoring results for all the runs. The format and interpretation of these results is found in the documentation for MATScore, except that the initial columns are different; you can find a description of the differences in the documentation for MATExperimentEngine.

Under /tmp/exp/runs, you'll see a directory for each named run (in this case, only "test"):

Unix:

% ls /tmp/exp/runs/test

Windows native:

> dir %TMP%\exp\runs\test

8 bytag.csv details.csv run_input
_done bytoken.csv properties.txt

The important elements here are the individual scoring files bytag.csv and bytoken.csv, which are (approximately) the subset of the corresponding overall scoring files which is relevant to this run. Of greater interest is details.csv, which is the detail spreadsheet for this run. These detail spreadsheets are not aggregated at the top level because they contain an entry for each tag, and the volume of data would likely be too great.

For more details about the structure of the experiment output directory, see MATExperimentEngine. For detailed examples for the other questions posed above, see the experiment XML documentation.

Step 4: Run an experiment against a workspace

Workspaces are just folders of files. If you've done Tutorial 6, and you kept your workspace around, you can run a simple experiment against that workspace using the following experiment XML file:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8">
<partition name="test" fraction=".2">
<corpus name="test">
<pattern>/tmp/ne_workspace/folders/completed/*</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>

This experiment engine refers directly to the contents of the "completed" folder in your workspace; so you can omit the --pattern_dir argument when you run MATExperimentEngine. If you save this file to ws.xml in your temp directory, you can run the experiment as follows:

Unix:

% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/ws_exp /tmp/ws.xml
Windows native:

> cd %MAT_PKG_HOME%%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\ws_exp %TMP%\ws.xml

Step 5: Clean up (optional)

Remove your experiment directories:
Unix:

% rm -rf /tmp/ws_exp /tmp/exp

Windows native:

> rd /s /q %TMP%\ws_exp
> rd /s /q %TMP%\exp
If you're not planning on doing any other tutorials, remove the workspace:
Unix:

% rm -rf /tmp/ne_workspace

Windows native:

> rd /s /q %TMP%\ne_workspace

If you don't want the "Named Entity" task hanging around, remove it as shown in the final step of Tutorial 1.

This concludes Tutorial 7.