The model builder constructs a model according to the
configuration provided in the specified task and on the command
line. This model can be used by MATEngine
to automatically tag documents. Note that if you're using the
Carafe engine provided with MAT, the model that is built will only
train to find the simple spanned annotations in documents (no
spanless annotations will be trained for, and no attributes will
be trained for beyond those associated with the effective label).
Note that you should never use MATModelBuilder to save models into workspace; use MATWorkspaceEngine instead.
Note: if you create a model using this tool, and you want to do
autotagging in file mode in the MAT UI,
you must restart the MAT Web server. Otherwise, the UI will
not be able to access the newest model.
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd
Usage: MATModelBuilder [basic options] [other options]
--task <task> |
Name of the task to use. Must
be the first argument, if present. Obligatory if the system
knows of more than one task. The system will provide a list
of known tasks as part of its help string. |
--config_name <name> |
Name of the model build
config to use. Must be the first argument after --task, if
present. Optional. Default model build config will be used
if no config is specified. |
--input_dir <dir> |
A directory, all of whose
files will be used in the model construction. Can be
repeated. May be specified with --input_files. |
--input_files <pat> |
A glob-style pattern
describing full pathnames to use in the model construction.
May be specified with --input_dir. Can be repeated. (If
you're not familiar with Unix, glob patterns are file name
patterns recognized by Unix shells. Consult your favorite
Unix documentation for details.) |
--file_type <t> |
The file type of the input.
One of the available readers.
The "raw" reader is not permitted. The "mat-json" reader is
the default. |
--encoding <encoding> |
The encoding of the input.
The default is the appropriate default for the file type. |
--model_file <file> |
Location to save the created
model. The directory must already exist. Obligatory if
--save_as_default_model isn't specified. |
--save_as_default_model |
If the the task.xml file for
the task specifies the <default_model> element, save
the model in the specified location, possibly overriding any
existing model. |
MATModelBuilder also makes the common options available.
The reader referenced in the --file_type option may introduce
additional options, which are described here. These additional
options must follow the --file_type option.
The particular training engine defined for the task in your
task.xml file will make available other command-line options. The
command-line options for the Carafe engine are described here. The examples below assume
that you're using the Carafe engine.
Let's say that you have several annotated documents in
/path/to/my/docs, and there are no other files in that directory.
Further, you have only one task, the task has no default model,
and you have a default <model_config> in your task.xml file
which contains appropriate settings for the engine, feature set
and PSA training. The following command would write your model to
the file named "task_model" in the current directory:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --input_dir /path/to/my/docs --model_file $PWD/task_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --input_dir c:\path\to\my\docs --model_file %CD%\task_model
To make use of this model, you could pass it to MATEngine as the
value of the --tagger_model flag.
Let's say you have multiple tasks, and the one you want to use is
"Named Entity". Your documents are in the same place, but there
are other documents there too; fortunately, all the documents you
want to use end with '.json'. In addition, your documents have
lots of really odd person names in them, but you conveniently have
a list of the names you're looking for, and you've prepared a
directory /path/to/my/lexicon which contains a single file named
NAMES which contains each of the tokens of interest, like so:
Urbatz
Yuguwima
Florshin
Batywan
The task you're using has a default model. The following command
would save your model as the default:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/*.json' \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\*.json" \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 2, except you
only want to build a model out of the files 100.json through
199.json, as well as the files in /path/to/my/other/docs.
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].json" \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 3, except the
documents are XML inline documents with the ".xml" suffix:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity" \
--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other\docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model
Let's say we're in the same situation as example 3, but we have a
non-default model configuration that we want to use:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task "Named Entity"
--config_name 'alt_config' \
--input_files '/path/to/my/docs/1[0-9][0-9].xml' \
--file_type xml-inline \
--input_dir /path/to/my/other/docs \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
Windows native:
> %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "Named Entity"
--config_name "alt_config" \
--input_files "c:\path\to\my\docs\1[0-9][0-9].xml" \
--file_type xml-inline \
--input_dir c:\path\to\my\other/docs \
--lexicon_dir c:\path\to\my\lexicon\ --save_as_default_model