While it's possible to use
your own training and tagging engine in MAT, MAT provides Carafe, a
CRF-based sequence tagger, as a default training and tagging engine.
The flags for configuring Carafe are available in a number of locations
in MAT.
Carafe also provides an English tokenizer.
The Carafe tokenizer step is MAT.JavaCarafe.CarafeTokenizationStep.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The Carafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The Carafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
---handle_tags |
handle_tags |
"yes" |
If present, treat the signal as
XML and tokenize XML elements and entities as single tokens. |
The Carafe tagging engine step is MAT.JavaCarafe.CarafeTagStep.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--tagger_local |
tagger_local |
"yes" (XML) |
By default, the MAT engine will
contact the MAT Web server to tag a document, because the Web server
has the capability of starting up and monitoring a long-living tagger
task. The reason this is beneficial is that the Carafe tagger, like
many model-based taggers, has a fairly expensive startup cost. To block
the engine from contacting the Web server, and force it to start up and
shut down the tagger on its own, specify tagger_local="yes". |
--tagger_model <model> |
tagger_model |
a string, a filename of a Carafe
model |
If the task does not have a
default model, the user must specify the location of the tagger model. |
--prior_adjust |
prior_adjust |
a float |
The Carafe tagger can be biased toward recall or toward precision. This setting biases the Carafe tagger to favor precision (positive values) or recall (negative values). Default is -1.0 (slight recall bias). Practical range of values is usually +-6.0. |
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The Carafe tagger is a Java
application, and the default heap size may not be adequate for your
model. The value here is passed to the Java VM using the -Xmx argument.
Values like 512M or 2G are examples of expected values. This setting
overrides any equivalent setting in the
<java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The Carafe tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--tagging_pre_models <s> |
tagging_pre_models |
a string |
If present, a comma-separated
list of glob-style patterns specifying the models to include as
pre-taggers. This is an advanced feature that normal users will not be
using. |
--add_tokens_internally |
add_tokens_internally |
"yes" |
If present, Carafe will use its
internal tokenizer to tokenize the document before tagging. If your
workflow doesn't tokenize the document, you must provide this flag, or
Carafe will have no tokens to base its tagging on. We recommend strongly that you tokenize your
documents separately; you should not use this flag. |
The Carafe training engine class is
MAT.JavaCarafe.CarafeModelBuilder.
There is only one setting here that you should change on any regular
basis:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--lexicon_dir <dir> |
lexicon_dir |
a pathname |
If present, the name of a
directory which contains a Carafe training lexicon. This pathname
should be an absolute pathname, and should have a trailing slash. The
content of the directory should be a set of files, each of which
contains a sequence of tokens, one per line. The name of the file will
be used as a training feature for the token. You can use this feature,
for instance, to provide implicit part-of-speech information (e.g.,
create a file named ADJ which contains a sequence of words that are
adjectives) or name information (e.g., create a file named NAME which
contains a sequence of tokens which can occur in proper names). On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task. |
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The Carafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The Carafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--parallel |
parallel |
"yes" |
If present, parallelizes the
feature expectation computation, which reduces the clock time of model
building when multiple CPUs are available. |
--cpus |
cpus |
an integer |
If --parallel is used, by
default 3/4 of the available CPUs will be used. If you want to control
the absolute number of CPUs, use this flag. |
The options in this section are documented here for completeness. If you're not familiar with the Carafe training engine and its implementation, the chances are that you'll never use any of these values. If you want to use them, someone who is knowledgeable about Carafe should set these values for you in task.xml, and unless you really know what you're doing, you should not override them on the command line.
Carafe provides the option of using non-standard training methods.
One of those methods is called periodic
stepsize
adjustment (PSA). This method, when used correctly, is
significantly faster than
the
normal training mechanism. However, it sometimes performs less well in
situations which are not yet clear. You might prefer to use it if
you're doing comparative
analysis of multiple models, or you're just starting off with a
rough-and-ready system and you don't need to optimize on accuracy yet.
The --max_iterations flag governs the number of training
cycles; more is not necessarily better, because the engine may overfit
to the data.
The documentation for the --feature_spec flag below refers to Carafe
feature spec files. The documentation for how to create these files can
be found in the Carafe documentation, which is located here if you're
viewing this documentation via a Web server, or in src/jcarafe.../resources
if
you've received MAT as a tarball. Similarly, if
you want details on the --gaussian_prior, --no_begin, --l1, and --l1_c
flags, see the Java Carafe documentation.
The documentation for --tags and --pre_models refers to an advanced
feature of Carafe where it can use tagging models to generate input
features for multi-stage tagging. We will not discuss this advanced
capability of Carafe any further.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--feature_spec <file> |
feature_spec |
a filename |
Name of the file that contains
the Carafe feature specification. A simple default specification will
be used if none is provided. An example can be found in
resources/default.fspec in your Java Carafe directory. If the filename is not
an absolute filename, it will be interpreted relative to the directory
of the task which is being trained for. (This is because this option
more likely to be provided in your task.xml file rather than on the
command line.) On the command line, optional if feature_spec is set in the <build_settings> for the relevant model config in the task.xml file for the task. |
--training_method |
training_method |
"psa" |
If present, specify a training
method other than the standard method. Currently, the only recognized
value is psa. The psa method is noticeably faster, but may result in
somewhat poorer results. You can use a value of '' to override a
previously specified training method (e.g., a default method in your
task). |
--max_iterations <num> |
max_iterations |
an integer |
Number of iterations for the
training mechanism to use. Current defaults are 200 for standard
training, 10 for PSA training. On the command line, overrides
any possible default in the <build_settings>
for the relevant model config in the task.xml file for the task. |
---tags |
tags |
a string |
If present, a comma-separated
list of tags to pass to the training engine instead of the full tag set
for the task (used to create per-tag pre-tagging models for multi-stage
training and tagging). |
--pre_models |
pre_models |
a string |
If present, a comma-separated
list of glob-style patterns specifying the models to include as
pre-taggers. |
--gaussian_prior |
gaussian_prior |
a float |
A positive float, default is
10.0. See the jCarafe docs for details. |
--no_begin |
no_begin |
"yes" |
Don't introduce begin states
during training. Useful if you're certain that you won't have any
adjacent spans with the same label. See the jCarafe documentation for
more details. |
--l1 |
l1 |
"yes" |
Use L1 regularization for PSA
training. See the jCarafe docs for details. |
--l1_c |
l1_c |
a float |
Change the penalty factor for
the L1 regularizer. See the jCarafe docs for details. |
--add_tokens_internally |
add_tokens_internally |
"yes" |
If present, Carafe will use its
internal tokenizer to tokenize the document before training. If your
workflow doesn't tokenize the document, you must provide this flag, or
Carafe will have no tokens to base its training on. We recommend strongly that you tokenize your
documents separately; you should not use this flag. |