At any point, you might want to know how you're doing.
These and other questions can be answered easily by the experiment
engine, MATExperimentEngine.
The power of the experiment engine lies largely in its rich XML configuration. In this
tutorial, we'll learn how to use the experiment engine to answer one of
the questions above, and you can examine the other documentation to
see how you might answer other questions, as illustrated in the use cases.
We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
This step is fairly easy, because the XML file to answer the first
question is included as part of the distribution. The XML file is found
in MAT_PKG_HOME/sample/ne/test/exp/exp.xml, and it looks like this:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
This is one of the simplest complete experiment XML files you can
create. As with all experiment XML files, it describes three types of
entities.
So this experiment takes a single set of documents, and designates
80% of the set for training and the remaining 20% for test. It then
generates a single model from the training documents, and executes a
single run using this model against the test documents.
This operation is a command-line operation. Try it:
Unix:
% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/exp \
--pattern_dir $PWD/sample/ne/resources/data/json sample/ne/test/exp/exp.xml
Windows native:
> cd %MAT_PKG_HOME%%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\exp \
--pattern_dir %CD%\sample\ne\resources\data\json sample\ne\test\exp\exp.xml
The --exp_dir is the directory where the corpora, models and runs
will be computed (and stored, if necessary), and where the results will
be found. The --pattern_dir is the directory in which to look for the
files referred to in the <pattern> elements in the experiment XML
file; the patterns are so-called Unix "glob" patterns, which are
standard file patterns which should be familiar to any user of the Unix
shell. The final argument is the experiment XML file itself.
The engine will create the directory, copy the experiment XML file
into it for archive purposes, and then run the experiment as described
in step 1.
Look in the experiment directory.
Unix:
% ls /tmp/exp
Windows native:
> dir %TMP%/exp
allbytag.csv corpora model_sets
allbytoken.csv exp.xml runs
The corpora, model_sets and runs subdirectories are as specified in
the experiment XML file above (that's what the "dir" attribute does).
What you'll be
most interested in are the files allbytag.csv and allbytoken.csv. These
files contain the tag-level and token-level scoring results for all the
runs. The format and interpretation of these results is found in the
documentation for MATScore, except that
the initial columns are different; you can find a description of the
differences in the documentation for MATExperimentEngine.
Under /tmp/exp/runs, you'll see a directory for each named run (in
this case, only "test"):
Unix:
% ls /tmp/exp/runs/test
Windows native:
> dir %TMP%\exp\runs\test
8 bytag.csv details.csv run_input
_done bytoken.csv properties.txt
The important elements here are the individual scoring files
bytag.csv and bytoken.csv, which are (approximately) the subset of the
corresponding overall scoring files which is relevant to this run. Of
greater interest is details.csv, which is the detail spreadsheet for
this run. These detail spreadsheets are not aggregated at the top level
because they contain an entry for each tag, and the volume of data
would likely be too great.
For more details about the structure of the experiment output
directory, see MATExperimentEngine.
For
detailed
examples
for the other questions posed above, see the experiment XML documentation.
Workspaces are just folders of files. If you've done Tutorial 6, and you kept your workspace
around, you can run a simple experiment against that workspace using
the following experiment XML file:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8">
<partition name="test" fraction=".2">
<corpus name="test">
<pattern>/tmp/ne_workspace/folders/completed/*</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
This experiment engine refers directly to the contents of the
"completed" folder in your workspace; so you can omit the --pattern_dir
argument when you run MATExperimentEngine. If you save this file to
ws.xml in your temp directory, you can run the experiment as follows:
Unix:
% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/ws_exp /tmp/ws.xml
Windows native:
> cd %MAT_PKG_HOME%%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\ws_exp %TMP%\ws.xml
Unix:If you're not planning on doing any other tutorials, remove the workspace:
% rm -rf /tmp/ws_exp /tmp/exp
Windows native:
> rd /s /q %TMP%\ws_exp
> rd /s /q %TMP%\exp
Unix:
% rm -rf /tmp/ne_workspace
Windows native:
> rd /s /q %TMP%\ne_workspace
If you don't
want the "Named Entity" task hanging around, remove it as shown in the
final step of Tutorial 1.