Experiment Engine

Description

The experiment engine runs the experiment described in an experiment XML file. The experiment XML file consists of three types of information:

The experiment engine runs this experiment in a directory which is provided to it either via the XML file (the "dir" attribute of the <experiment> element) or on the command line (the --exp_dir option). The experiment engine prepares the corpora, builds the models, and performs the experiment runs in this directory. The experiment XML file is copied into the experiment directory, if a file with the same name is not already present.

By default, the engine can continue an experiment which is halted in the middle. Each corpus, model set and run stores its metadata in a file called "properties.txt" in its specified directory, and keeps track of whether it's been completed or not. If the engine fails in the middle, it will not redo work it knows has been completed. The --force argument overrides this default behavior, and ought to force a full rerun of the experiment; however, the interactions among the components are extremely complex, and --force often fails. If you want to rerun an experiment, the safest thing to do is use a different experiment directory.

There is one exception to this generalization. If there are experimental runs present, the engine will always score them, even if it's scored them before. So an easy way to review the scores for an experiment is just to run the engine again.

Experiment output directory structure

The names of the toplevel directories might vary (or there may be more such directories), if you've provided a non-default value for "dir" to any of the elements in the XML file.

Engine output

When the engine completes its experiment runs, it produces a pair of toplevel spreadsheets, allbytag.csv and allbytoken.csv. These files contain the tag-level and token-level scoring results for all the runs. The format and interpretation of these results is found in the documentation for MATScore, except that the initial columns are different:

run
The name of the run in the experiment XML file.
train corpora
The name of the training corpora (and their partitions, if appropriate) for this run. Comma-separated.
test corpora
The name of the test corpora (and their partitions, if appropriate) for this run. Comma-separated.
tag
The label being scored, as described in MATScore.
train docs
The number of documents in the training corpus for this run.
train toks
The total number of tokens in the documents in the training corpus for this run.
train items
The total number of training "items" for this label in the documents in the training corpus for this run. For tag-level scoring, this is the number of annotations; for token-level scoring, this is the number of tokens in those annotations.

In the directory for each run, you'll find the individual scoring files bytag.csv and bytoken.csv, which are (approximately) the subset of the corresponding overall scoring files which is relevant to this run. Of greater interest is details.csv, which is the detail spreadsheet for this run. These detail spreadsheets are not aggregated at the top level because they contain an entry for each tag, and the volume of data would likely be too great.

The command-line options --no_csv_formulas and --oo_separator control the details of how these spreadsheets are generated.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd

Usage: MATExperimentEngine [options] <xml_file>

<xml_file>: An experiment XML file

Options

--exp_dir <dir>
Optionally, the directory the experiment will be run in. This directory may also be provided in the experiment XML file (if both are provided, the command-line setting is ignored). The directory will be created if it doesn't yet exist.
--pattern_dir <dir>
Optionally, this path is the prefix used for relative directory paths in file patterns in the <pattern> element in the corpora in the experiment XML file. Otherwise, these patterns must be absolute pathnames.
--debug
Run in debug mode. If an error is hit, the final cumulative score accumulation will be skipped.
--no_csv_formulas
By default, the experiment engine produces CSV score files with spreadsheet equations for computed values. If this flag is present, the CSV score files will contain actual values instead.
--oo_separator
By default, the experiment engine uses Excel-style formula separators in its spreadsheet equations in its CSV score files. If this flag is also present, the scorer will use OpenOffice formula separators. (The formula formats are incompatible, and the formulas will be recognized in either Excel or OpenOffice, but not both.)
--dont_compute_confidence
By default, the experiment engine computes confidence measures when it runs the scorer. This process can be time consuming. Disable it with this flag.
--dont_rescore
By default, the experiment engine rescores complete runs when it's restarted. Use this flag to disable this feature. This should only be used for debugging purposes, because the scores from the completed runs won't be accumulated in this mode.
--subprocess_debug <i>
Set the subprocess debug level to the value provided, overriding the global setting. 0 disables, 2 shows all subprocess activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the capability is available and it isn't globally enabled.

Advanced options

These options are more complicated, and not as well supported. Use them at your own risk.

--force
If present, forces the reprocess of the experiment file.
--batch_test_runs
By default, test runs are performed as soon as the relevant model is available. This flag postpones all test runs until after all models are constructed.
--mark_done
This flag is intended for the exceptional situation where you've interrupted an experiment before it's completed, and you just want to rerun the scoring for what's already done. This flag will force the engine to mark all corpora, models and runs as completed. The effect is that from this point on, the engine will only report scores for this experiment.

Examples

For examples of the experiment XML files themselves, look here.

Example 1

Let's say your experiment XML file /document/exp_files/exp.xml contains a value for the "dir" attribute of the <experiment> element, and all the paths in the <pattern> elements are absolute. Then your invocation is simple:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd c:\documents\exp_files\exp.xml

Example 2

Let's say that your experiment XML file does not contain a value for the "dir" attribute, and you want to create an experiment run in /documents/exp_runs/run1:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
/documents/exp_files/exp.xml


Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
c:\documents\exp_files\exp.xml

Example 3

Let's say you have the same situation as in example 2, but you don't want spreadsheet formulas in your output, because you're feeding the data to a statistical package like R instead of to Excel:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--no_csv_formulas /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--no_csv_formulas c:\documents\exp_files\exp.xml

Example 4

Let's say that you have the same situation as in example 2, and you want to view the results in a spreadsheet, but you can't afford Excel, and you're using OpenOffice instead:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--oo_separator /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--oo_separator c:\documents\exp_files\exp.xml

Example 5

Let's say you're in the same situation as in example 2, but you have relative pathnames in <pattern> elements in your XML file, and all the document paths are a suffix of /documents/completed:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--pattern_dir /documents/completed /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--pattern_dir c:\documents\completed c:\documents\exp_files\exp.xml