Experiment Engine

Description

The experiment engine runs the experiment described in an experiment XML file. The experiment XML file consists of three types of information:

descriptions of corpora
descriptions of model sets
descriptions of experiment runs

The experiment engine runs this experiment in a directory which is provided to it either via the XML file (the "dir" attribute of the <experiment> element) or on the command line (the --exp_dir option). The experiment engine prepares the corpora, builds the models, and performs the experiment runs in this directory. The experiment XML file is copied into the experiment directory, if a file with the same name is not already present.

By default, the engine can continue an experiment which is halted in the middle. Each corpus, model set and run stores its metadata in a file called "properties.txt" in its specified directory, and keeps track of whether it's been completed or not. If the engine fails in the middle, it will not redo work it knows has been completed. The --force argument overrides this default behavior, and ought to force a full rerun of the experiment; however, the interactions among the components are extremely complex, and --force often fails. If you want to rerun an experiment, the safest thing to do is use a different experiment directory.

There is one exception to this generalization. If there are experimental runs present, the engine will always score them, even if it's scored them before. So an easy way to review the scores for an experiment is just to run the engine again.

Experiment output directory structure

The names of the toplevel directories might vary (or there may be more such directories), if you've provided a non-default value for "dir" to any of the elements in the XML file.

<xml_file>.xml (file) - the experiment XML file that populated this directory
allbytag.csv, allbytoken.csv - aggregate spreadsheets for the experiment runs
corpora (dir) - the subdirectory containing the prepared corpus

<corpus_name> (dir) - one directory for each corpus <corpus_name>, as named in the experiment XML file

properties.txt - the properties of the corpus, as defined in the XML file
prepared_files.txt - the list of absolute pathnames of the documents in this corpus
file_seed.txt - the list of absolute pathnames that this corpus began with (which may differ from prepared_files.txt if any preprocessing was required)
file_partition.txt - if the corpus has a partition, this file consists of one line for each file in file_seed.txt, prepended with the name of the partition it's assigned to, separated by a comma.
preprocessed (dir) - if any <prep> is specified for the corpus, this directory contains the input and the output of the preprocessing step

in (dir) - the document inputs to the preprocessing step
out (dir) - the document outputs of the preprocessing step

model_sets (dir) - the directory containing the model sets

<model_set_name> (dir) - one directory for each model set <model_set_name>, as named in the experiment XML file

properties.txt - the properties of the model set, as defined in the XML file
<n> (dir) - one directory per training increment, where <n> is the number of documents which were used to create the model

model - the constructed model

runs (dir) - the directory containing the runs

<run_name> (dir) - one directory for each run <run_name> as named in the experiment XML file

bytag.csv, bytoken.csv, details.csv - spreadsheets for this run
properties.txt - the properties of the run, as defined in the XML file
run_input (dir) - the versions of the documents in the test corpus which serve as input to the run. Typically, these will be raw documents, unless the run has a prep phase.
<n> (dir) - one directory per training increment of the model for this run, where <n> is the number of documents which were used to create the model

hyp (dir) - the hypotheses (i.e., the documents created by the model), which will be compared to the actual "gold-standard" values in the test corpus

Engine output

When the engine completes its experiment runs, it produces a pair of toplevel spreadsheets, allbytag.csv and allbytoken.csv. These files contain the tag-level and token-level scoring results for all the runs. The format and interpretation of these results is found in the documentation for MATScore, except that the initial columns are different:

run	The name of the run in the experiment XML file.
train corpora	The name of the training corpora (and their partitions, if appropriate) for this run. Comma-separated.
test corpora	The name of the test corpora (and their partitions, if appropriate) for this run. Comma-separated.
tag	The label being scored, as described in MATScore.
train docs	The number of documents in the training corpus for this run.
train toks	The total number of tokens in the documents in the training corpus for this run.
train items	The total number of training "items" for this label in the documents in the training corpus for this run. For tag-level scoring, this is the number of annotations; for token-level scoring, this is the number of tokens in those annotations.

In the directory for each run, you'll find the individual scoring files bytag.csv and bytoken.csv, which are (approximately) the subset of the corresponding overall scoring files which is relevant to this run. Of greater interest is details.csv, which is the detail spreadsheet for this run. These detail spreadsheets are not aggregated at the top level because they contain an entry for each tag, and the volume of data would likely be too great.

The command-line options --no_csv_formulas and --oo_separator control the details of how these spreadsheets are generated.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd

Usage: MATExperimentEngine [options] <xml_file>

<xml_file>: An experiment XML file

Options

--exp_dir <dir>	Optionally, the directory the experiment will be run in. This directory may also be provided in the experiment XML file (if both are provided, the command-line setting is ignored). The directory will be created if it doesn't yet exist.
--pattern_dir <dir>	Optionally, this path is the prefix used for relative directory paths in file patterns in the <pattern> element in the corpora in the experiment XML file. Otherwise, these patterns must be absolute pathnames.
--debug	Run in debug mode. If an error is hit, the final cumulative score accumulation will be skipped.
--no_csv_formulas	By default, the experiment engine produces CSV score files with spreadsheet equations for computed values. If this flag is present, the CSV score files will contain actual values instead.
--oo_separator	By default, the experiment engine uses Excel-style formula separators in its spreadsheet equations in its CSV score files. If this flag is also present, the scorer will use OpenOffice formula separators. (The formula formats are incompatible, and the formulas will be recognized in either Excel or OpenOffice, but not both.)
--dont_compute_confidence	By default, the experiment engine computes confidence measures when it runs the scorer. This process can be time consuming. Disable it with this flag.
--dont_rescore	By default, the experiment engine rescores complete runs when it's restarted. Use this flag to disable this feature. This should only be used for debugging purposes, because the scores from the completed runs won't be accumulated in this mode.
--subprocess_debug <i>	Set the subprocess debug level to the value provided, overriding the global setting. 0 disables, 2 shows all subprocess activity.
--subprocess_statistics	Enable subprocess statistics (memory/time), if the capability is available and it isn't globally enabled.

Advanced options

These options are more complicated, and not as well supported. Use them at your own risk.

--force	If present, forces the reprocess of the experiment file.
--batch_test_runs	By default, test runs are performed as soon as the relevant model is available. This flag postpones all test runs until after all models are constructed.
--mark_done	This flag is intended for the exceptional situation where you've interrupted an experiment before it's completed, and you just want to rerun the scoring for what's already done. This flag will force the engine to mark all corpora, models and runs as completed. The effect is that from this point on, the engine will only report scores for this experiment.

Examples

For examples of the experiment XML files themselves, look here.

Example 1

Let's say your experiment XML file /document/exp_files/exp.xml contains a value for the "dir" attribute of the <experiment> element, and all the paths in the <pattern> elements are absolute. Then your invocation is simple:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd c:\documents\exp_files\exp.xml

Example 2

Let's say that your experiment XML file does not contain a value for the "dir" attribute, and you want to create an experiment run in /documents/exp_runs/run1:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
/documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
c:\documents\exp_files\exp.xml

Example 3

Let's say you have the same situation as in example 2, but you don't want spreadsheet formulas in your output, because you're feeding the data to a statistical package like R instead of to Excel:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--no_csv_formulas /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--no_csv_formulas c:\documents\exp_files\exp.xml

Example 4

Let's say that you have the same situation as in example 2, and you want to view the results in a spreadsheet, but you can't afford Excel, and you're using OpenOffice instead:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--oo_separator /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--oo_separator c:\documents\exp_files\exp.xml

Example 5

Let's say you're in the same situation as in example 2, but you have relative pathnames in <pattern> elements in your XML file, and all the document paths are a suffix of /documents/completed:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--pattern_dir /documents/completed /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--pattern_dir c:\documents\completed c:\documents\exp_files\exp.xml