Model Builder

Description

The model builder constructs a model according to the configuration provided in the specified task and on the command line. This model can be used by MATEngine to automatically tag documents.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd

Usage: MATModelBuilder [task option] [config name option] [options]

Basic options

--task <task>
Name of the task to use. Must be the first argument, if present. Obligatory if the system knows of more than one task. The system will provide a list of known tasks.
--config_name <name>
Name of the model build config to use. Must be the first argument after --task, if present. Optional. Default model build config will be used if no config is specified.
--subprocess_debug <i>
When --subprocess_statistics is enabled, set the subprocess debug level to the value provided, overriding the global setting. 0 disables, 2 shows all subprocess activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the capability is available and it isn't globally enabled.
--preserve_tempfiles
Preserve the temporary files created by the model builder, as a debugging aid.
--input_dir <dir>
A directory, all of whose files will be used in the model construction. Can be repeated. May be specified with --input_files.
--input_files <pat>
A glob-style pattern describing full pathnames to use in the model construction. May be specified with --input_dir. Can be repeated. (If you're not familiar with Unix, glob patterns are file name patterns recognized by Unix shells. Consult your favorite Unix documentation for details.)
--file_type <t>
The file type of the input. One of the available readers. The "raw" reader is not permitted. The "mat-json" reader is the default.
--encoding <encoding>
The encoding of the input. The default is the appropriate default for the file type.
--model_file <file>
Location to save the created model. The directory must already exist. Obligatory if --save_as_default_model isn't specified.
--save_as_default_model
If the the task.xml file for the task specifies the <default_model> element, save the model in the specified location, possibly overriding any existing model.

Other options

The reader referenced in the --file_type option may introduce additional options, which are described here. These additional options must follow the --file_type option.

The particular training engine defined for the task in your task.xml file will make available other command-line options. The command-line options for the Carafe engine are described here. The examples below assume that you're using the Carafe engine.

Examples

Example 1

Let's say that you have several annotated documents in /path/to/my/docs, and there are no other files in that directory. Further, you have only one task, the task has no default model, and you have a default <model_config> in your task.xml file which contains appropriate settings for the engine, feature set and PSA training. The following command would write your model to the file named "task_model" in the current directory:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --input_dir /path/to/my/docs --model_file $PWD/task_model

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --input_dir c:\path\to\my\docs --model_file %CD%\task_model

To make use of this model, you could pass it to MATEngine as the value of the --tagger_model flag.

Example 2

Let's say you have multiple tasks, and the one you want to use is "Named Entity". Your documents are in the same place, but there are other documents there too; fortunately, all the documents you want to use end with '.json'. In addition, your documents have lots of really odd person names in them, but you conveniently have a list of the names you're looking for, and you've prepared a directory /path/to/my/lexicon which contains a single file named NAMES which contains each of the tokens of interest, like so:

Urbatz
Yuguwima
Florshin
Batywan

The task you're using has a default model. The following command would save your model as the default:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/*.json' \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\*.json" \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model

Example 3

Let's say we're in the same situation as example 2, except you only want to build a model out of the files 100.json through 199.json, as well as the files in /path/to/my/other/docs.

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].json" \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model

Example 4

Let's say we're in the same situation as example 3, except the documents are XML inline documents with the ".xml" suffix:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other\docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model

Example 5

Let's say we're in the same situation as example 3, but we have a non-default model configuration that we want to use:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity"
--config_name 'alt_config' \

--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Windows native:

> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity"
--config_name "alt_config" \

--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model