The model builder constructs a model according to the configuration provided in the specified task and on the command line. This model can be used by MATEngine to automatically tag documents.
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd
Usage: MATModelBuilder [task option] [config name option] [options]
--task <task> |
Name of the task to use.
Must be the first argument, if present. Obligatory if the system knows
of more than one task. The system will
provide a list of known tasks. |
--config_name <name> |
Name of the model build config
to use. Must be the first argument after --task, if present. Optional.
Default model build config will be used if no config is specified. |
--subprocess_debug <i> |
When --subprocess_statistics is
enabled, set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 2 shows all subprocess
activity. |
--subprocess_statistics |
Enable subprocess statistics
(memory/time), if the capability is available and it isn't globally
enabled. |
--preserve_tempfiles |
Preserve the temporary files
created by the model builder, as a debugging aid. |
--input_dir <dir> |
A directory, all of whose files
will be used in the model construction. Can be repeated. May be
specified with --input_files. |
--input_files <pat> |
A glob-style pattern describing
full pathnames to use in the model construction. May be specified with
--input_dir. Can be
repeated. (If you're not familiar with Unix, glob patterns are file
name patterns recognized by Unix shells. Consult your favorite Unix
documentation for details.) |
--file_type <t> |
The file type of the input. One
of the available readers. The
"raw" reader is not permitted. The "mat-json" reader is the default. |
--encoding <encoding> |
The encoding of the input. The
default is the appropriate default for the file type. |
--model_file <file> |
Location to save the created
model. The directory must already exist. Obligatory if
--save_as_default_model isn't specified. |
--save_as_default_model |
If the the task.xml file for the
task specifies the <default_model> element, save the model in the
specified location,
possibly overriding any existing model. |
The reader referenced in the --file_type
option may introduce additional options, which
are described here. These
additional options must follow the --file_type option.
The particular training engine defined for the task in your task.xml
file will make available other command-line options. The command-line
options for the Carafe engine are described here. The examples below assume that
you're using the Carafe engine.
Let's say that you have several annotated documents in
/path/to/my/docs, and there are no other files in that directory.
Further, you have only one task, the task has no default model, and you
have a default
<model_config> in your task.xml file which contains
appropriate
settings for the engine, feature set and PSA training. The following
command would write your model to the file named "task_model" in the
current directory:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --input_dir /path/to/my/docs --model_file $PWD/task_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --input_dir c:\path\to\my\docs --model_file %CD%\task_model
To make use of this model, you could pass it to MATEngine as the
value of the --tagger_model flag.
Let's say you have multiple tasks, and the one you want to use is
"Named Entity". Your documents are in the same place, but there are
other documents there too; fortunately, all the documents you want to
use end with '.json'. In addition, your documents have lots of really
odd person names in them, but you conveniently have a list of the names
you're looking for, and you've prepared a directory /path/to/my/lexicon
which contains a single file named NAMES which contains each of the
tokens of interest, like so:
Urbatz
Yuguwima
Florshin
Batywan
The task you're using has a default model. The following command
would save your model as the default:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/*.json' \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\*.json" \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 2, except you only
want to build a model out of the files 100.json through 199.json, as
well as the files in /path/to/my/other/docs.
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].json" \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 3, except the
documents are XML inline documents with the ".xml" suffix:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other\docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 3, but we have a
non-default model configuration that we want to use:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity"
--config_name 'alt_config' \
--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity"
--config_name "alt_config" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model