The experiment engine runs the experiment described in an experiment XML file. The experiment XML
file consists of three types of information:
The experiment engine runs this experiment in a directory which is
provided to it either via the XML file (the "dir" attribute of the
<experiment> element) or on the command line (the --exp_dir
option). The experiment engine prepares the corpora, builds the models,
and performs the experiment runs in this directory. The experiment XML
file is copied into the experiment directory, if a file with the same
name is not already present.
By default, the engine can continue an experiment which is halted in
the middle. Each
corpus, model set and run stores its metadata in a file called
"properties.txt" in its specified directory, and keeps track of whether
it's been completed or not. If the engine fails in the middle, it will
not redo work it knows has been completed. The --force argument
overrides this default behavior, and ought to force a full rerun of the
experiment; however, the interactions among the components are
extremely complex, and --force often fails. If you want to rerun an
experiment, the safest thing to do is use a different experiment
directory.
There is one exception to this generalization. If there
are experimental runs present, the engine will always score them, even
if it's
scored them before. So an easy way to review the scores for an
experiment is just to run the engine again.
The names of the toplevel directories might vary (or there may be
more such directories), if you've provided a non-default value for
"dir" to any of the elements in the XML file.
When the engine completes its experiment runs, it produces a pair of toplevel spreadsheets, allbytag.csv and allbytoken.csv. These files contain the tag-level and token-level scoring results for all the runs. The format and interpretation of these results is found in the documentation for MATScore, except that the initial columns are different:
run family |
The name of the run in the
experiment XML file. |
run |
The actual directory of the run
(which may be affected by whatever iterators have applied) |
model family |
The name of the model set in the
experiment XML file. |
model |
The actual directory of the
model (which may be affected by whatever iterators have applied) |
train corpora |
The name of the training corpora
(and their partitions, if appropriate)
for this run. Comma-separated. |
test corpora |
The name of the test corpora
(and their partitions, if appropriate) for
this run. Comma-separated. |
tag |
The label being scored, as
described in MATScore. |
train docs |
The number of documents in the
training corpus for this run. |
train toks |
The total number of tokens in
the documents in the training corpus for this run. |
train items |
The total number of training
"items" for this label in the documents in the training corpus for this
run. For tag-level scoring, this is the number of annotations; for
token-level scoring, this is the number of tokens in those annotations. |
In the directory for each run, you'll find the individual scoring
files
bytag.csv and bytoken.csv, which are (approximately) the subset of the
corresponding overall scoring files which is relevant to this run. Of
greater interest is details.csv, which is the detail spreadsheet for
this run. These detail spreadsheets are not aggregated at the top level
because they contain an entry for each tag, and the volume of data
would likely be too great.
The command-line options --no_csv_formulas and --oo_separator
control the details of how these spreadsheets are generated.
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd
Usage: MATExperimentEngine [options] <xml_file>
<xml_file>: An experiment XML file
--exp_dir <dir> |
Optionally, the directory the
experiment will be run in. This directory may also be provided in the
experiment XML file (if both are provided, the command-line setting is
ignored). The directory will be created if it doesn't yet exist. |
--pattern_dir <dir> |
Optionally, this path is the
prefix used for relative directory paths in file patterns in the
<pattern> element in the corpora in the experiment XML file.
Otherwise, these patterns must be absolute pathnames. |
--debug |
Run in debug mode. If an error
is hit, the final cumulative score accumulation will be skipped. |
--no_csv_formulas |
By default, the experiment
engine produces CSV score files with spreadsheet equations for computed
values. If this flag is present, the CSV score files will contain
actual values instead. |
--oo_separator |
By default, the experiment
engine uses Excel-style formula separators in its spreadsheet equations
in its CSV score files. If this flag is also present, the scorer will
use OpenOffice formula separators. (The formula formats are
incompatible, and the formulas will be recognized in either Excel or
OpenOffice, but not both.) |
--dont_compute_confidence |
By default, the experiment
engine computes confidence measures when it runs the scorer. This
process can be time consuming. Disable it with this flag. |
--dont_rescore |
By default, the experiment
engine rescores complete runs when it's restarted. Use this flag to
disable this feature. This should only be used for debugging purposes,
because the scores from the completed runs won't be accumulated in this
mode. |
--subprocess_debug <i> |
Set the subprocess debug level
to the value provided, overriding the global setting. 0 disables, 2
shows all subprocess activity. |
--subprocess_statistics |
Enable subprocess statistics
(memory/time), if the capability is available and it isn't globally
enabled. |
--preserve_tempfiles |
Preserve the temporary files
created by the model builder, as a debugging aid. |
These options are more complicated, and not as well supported. Use
them at your own risk.
--force |
If present, forces the reprocess
of the experiment file. |
--batch_test_runs |
By default, test runs are
performed as soon as the relevant model is available. This flag
postpones all test runs until after all models are constructed. |
--mark_done |
This flag is intended for the
exceptional situation where you've interrupted an experiment before
it's completed, and you just want to rerun the scoring for what's
already done. This flag will force the engine to mark all corpora,
models and runs as completed. The effect is that from this point on,
the engine will only report scores for this experiment. |
For examples of the experiment XML files themselves, look here.
Let's say your experiment XML file /document/exp_files/exp.xml
contains a value for the "dir" attribute of the <experiment>
element, and all the paths in the <pattern> elements are
absolute. Then your invocation is simple:
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine /documents/exp_files/exp.xml
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd c:\documents\exp_files\exp.xml
Let's say that your experiment XML file does not contain a value for
the "dir" attribute, and you want to create an experiment run in
/documents/exp_runs/run1:
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
/documents/exp_files/exp.xml
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
c:\documents\exp_files\exp.xml
Let's say you have the same situation as in example 2, but you don't
want spreadsheet formulas in your output, because you're feeding the
data to a statistical package like R instead of to Excel:
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--no_csv_formulas /documents/exp_files/exp.xml
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--no_csv_formulas c:\documents\exp_files\exp.xml
Let's say that you have the same situation as in example 2, and you
want to view the results in a spreadsheet, but you can't afford Excel,
and you're using OpenOffice instead:
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--oo_separator /documents/exp_files/exp.xml
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--oo_separator c:\documents\exp_files\exp.xml
Let's say you're in the same situation as in example 2, but you have
relative pathnames in <pattern> elements in your XML file, and
all the document paths are a suffix of /documents/completed:
Unix:
% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--pattern_dir /documents/completed /documents/exp_files/exp.xml
Windows native:
> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--pattern_dir c:\documents\completed c:\documents\exp_files\exp.xml