Experiment output directory structure

The experiment engine records its output in a directory, which is specified when you call the experiment engine. Here, we describe the structure of the output directory.

The output directory has three toplevel subdirectories, one for the corpora ("corpora", by default), one for the models that are constructed ("model_sets", by default), and one for the test runs that are executed ("runs", by default). The names of the toplevel subdirectories might vary (or there may be more such directories), if you've provided a non-default value for "dir" to any of the elements in the XML file. The <format> value in the names of the CSV files are determined by the --csv_formula_output command-line option of the experiment engine.

<xml_file>.xml (file) - the experiment XML file that populated this directory
allbytag_<format>.csv, allbytoken_<format>.csv - aggregate spreadsheets for the experiment runs
corpora (dir) - the subdirectory containing the prepared corpus

<corpus_name> (dir) - one directory for each corpus <corpus_name>, as named in the experiment XML file

properties.txt - the properties of the corpus, as defined in the XML file
prepared_files.txt - the list of absolute pathnames of the documents in this corpus
file_seed.txt - the list of absolute pathnames that this corpus began with (which may differ from prepared_files.txt if any preprocessing was required)
file_partition.txt - if the corpus has a partition, this file consists of one line for each file in file_seed.txt, prepended with the name of the partition it's assigned to, separated by a comma.
preprocessed (dir) - if any <prep> is specified for the corpus, this directory contains the input and the output of the preprocessing step

in (dir) - the document inputs to the preprocessing step
out (dir) - the document outputs of the preprocessing step

model_sets (dir) - the directory containing the model sets

<model_set_name>_properties.txt - a properties file which stores the settings for the "family" of model sets represented by <model_set_name> (as named in the experiment XML file)
<model_set_name>_training_set.txt - a file listing the documents comprising the training set, including the corpus and partition from which the documents are drawn
<model_set_name>... (dir) - one directory for each actual model built. The model set <model_set_name>, as named in the experiment XML file, is the prefix; the remainder of the directory name encodes the iterators (if any) which applied to specify the settings for this model

properties.txt - the properties of the model, as defined in the XML file and the iterators
model - the constructed model

runs (dir) - the directory containing the runs

<run_name>_properties.txt - a properties file which stores the settings for the "family" of runs represented by <run_name> (as named in the experiment XML file)
<run_name>... (dir) - one directory for each actual run performed. The run <run_name>, as named in the experiment XML file, is the prefix; the remainder of the directory name encodes the iterators (if any) which applied to specify the settings for this run

<model_set_name>... (dir) - a subdirectory corresponding to the model which is being applied in this run (identical to the name of the directory for the actual model built, under model_sets)

bytag_<format>.csv, bytoken_<format>.csv, details.csv - spreadsheets for this run
properties.txt - the properties of the run, as defined in the XML file
run_input (dir) - the versions of the documents in the test corpus which serve as input to the run. Typically, these will be raw documents, unless the run has a prep phase.

hyp (dir) - the hypotheses (i.e., the documents created by the model), which will be compared to the actual "gold-standard" values in the test corpus

When the engine completes its experiment runs, it produces a pair of toplevel spreadsheets, allbytag_<format>.csv and allbytoken_<format>.csv. These files contain the tag-level and token-level scoring results for all the runs. The format and interpretation of these results is found in the documentation for MATScore, except that the initial columns are different:

run family	The name of the run in the experiment XML file.
run	The actual directory of the run (which may be affected by whatever iterators have applied)
model family	The name of the model set in the experiment XML file.
model	The actual directory of the model (which may be affected by whatever iterators have applied)
train corpora	The name of the training corpora (and their partitions, if appropriate) for this run. Comma-separated.
test corpora	The name of the test corpora (and their partitions, if appropriate) for this run. Comma-separated.
tag	The label being scored, as described in MATScore.
train docs	The number of documents in the training corpus for this run.
train toks	The total number of tokens in the documents in the training corpus for this run.
train items	The total number of training "items" for this label in the documents in the training corpus for this run. For tag-level scoring, this is the number of annotations; for token-level scoring, this is the number of tokens in those annotations.

In the directory for each run, you'll find the individual scoring files bytag_<format>.csv and bytoken_<format>.csv, which are (approximately) the subset of the corresponding overall scoring files which is relevant to this run. Of greater interest is details.csv, which is the detail spreadsheet for this run. These detail spreadsheets are not aggregated at the top level because they contain an entry for each tag, and the volume of data would likely be too great.