If you've received a previous version of MAT, this page contains
instructions on how to upgrade to the new version.
It is now possible to run MAT in Windows without Cygwin installed.
Unlike previous versions, there is a single distribution bundle for
MAT 1.2 for all supported platforms. For compatibility with Windows,
this bundle is now a zip file.
If you use mat_controller.sh or mat_controller.bat under Windows,
you'll find that there's a new tabbed terminal tool we're using, which
has the
advantage of not requiring Cygwin.
If you're using mat_controller.sh under MacOS X, and you intend to
install 10.6, note that the previous version of Terminator.app, which
supports the tabbed terminal behavior in mat_controller.sh, will not
work in 10.6; you must install the newer version provided with MAT 1.2.
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations. There
are a number of important changes that are required as a result. Among
other things, the Java tokenizer produces slightly different token
boundaries than the original OCaml tokenizer. This is problematic
because the entire basis of most annotation systems, including MAT, is
the subdivision into words (tokens). In order to have optimal
performance, the tokenization of documents which are to be
automatically tagged should match the tokenization of the documents
which were used to create the tagger model. This means that in order to
migrate from version 1.1 to version 1.2, among other things, you must
retokenize your
documents and update any references to the OCaml tokenizer.
First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back up
your data before you run this utility.
Next, if you refer to a tokenization step implementation in your task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTokenizationStep to
MAT.JavaCarafe.CarafeTokenizationStep. You may also need to specify the
heap_size attribute on the relevant tokenization <step> in any
workflow, if it turns out that the default Java heap size isn't
large enough for your purposes (this attribute can also be specified on
the command line; see the Carafe engine
documentation).
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger
have been replaced by the Java reimplementations. There are a number of
important changes that are required as a result. Among other things,
the model format for the Java engine is completely different than the
model format for the original OCaml tokenizer. This means that you must
rebuild all your models, and update any references to the OCaml
trainer/tagger.
First, retokenize your documents using MATRetokenize, as described
above, and update your tokenization steps.
Next, update
your tagger and trainer settings in task.xml according to the
documentation provided for the Carafe
engine.
Next, if you refer to a tagging step in your task.xml
file, you must change all occurrences of MAT.PluginMgr.CarafeTagStep to
MAT.JavaCarafe.CarafeTagStep. You may also need to specify the
heap_size attribute on the relevant tag <step> in any
workflow,
if it turns out that the default Java heap size isn't large enough for
your purposes (this attribute can also be specified on the command
line; see the Carafe engine
documentation). Similarly, if you have a <model_build_settings>
entry, you must change all occurrences of
MAT.CarafeModelBuilder.CarafeModelBuilder to
MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the heap_size
attribute as well. (Note below that you must also change the syntax of
<model_build_settings>.)
Note that for the tagger, the prior_adjst attribute has been renamed
to prior_adjust. For the trainer, the engine attribute has been
eliminated, and the feature_set attribute as well; there's now a new
feature_spec attribute which refers to a file in which you can describe
your feature set, if you don't want to use the default feature set.
Also, the psa_iterations flag has been removed, due to more
numerous options in the Carafe trainer;
psa_iterations="6"
becomes
training_method="psa" max_iterations="6"
Because PSA no longer requires random segments, the
no_random_psa_segments flag has been removed.
Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.
In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.
In order to support a more flexible way of specifying partitions in
experiments, we've changed the way partitions are specified in the
experiment XML files. We compare the relevant files below:
Version 1.1:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>
Version 1.2:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
Note the following changes:
In order to clarify how task settings are handled in MAT, a number
of changes have been made to the task.xml
file syntax.
First, the <step> element of <step_implementations> no
longer accepts arbitrary attributes. If you made use of this feature to
pass settings to the initialization methods of workflow steps, you must
now use the <create_settings> child element. We doubt that anyone
has made use of this feature.
Second, the <step> element of <workflow> no longer
accepts arbitrary attributes. If you make use of this feature to pass
settings to workflow steps, you must now use the
<create_settings>, <ui_settings>, or <run_settings>
child elements. The most likely situation where this might arise is in
passing defaults to the run methods of steps. For instance, if you used
this feature to increase the Java heap size for Java Carafe, your
task.xml file would have to be revised as follows.
Version 1.1:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...
Version 1.2:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...
Second, the way settings are specified for model configurations has
changed. The name and class for the configuration are now separated
from the settings which are passed to the model builder, as follows.
Version 1.1:
...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...
Version 1.2:
...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...
Finally, the <workflow> element no longer accepts arbitrary
settings; these settings must be passed using the <ui_settings>
child element. No task appears to use this option yet, so this
shouldn't affect anyone.
In order to support a more flexible way of invoking the MAT engine
in experiments, the way the configuration of experiments is cached has
changed in version 1.1. What this means is that you will not be able to
invoke MATExperimentEngine on experiment directories created using
version 1.0 to regenerate the experiment scores.
In order to support a more flexible way of invoking the MAT engine
in experiments, we've changed the way corpus preprocessing and test run
processing are specified. In version 1.0, the MAT engine was called as
a command-line tool, and the options were specified as a command line;
in version 1.1, the options are specified as XML attribute-value pairs.
We compare the relevant experiment XML blocks below:
Version 1.0:
<corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>
Version 1.1:
<corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>
Version 1.1 adds the ability to define different training engines.
Because of this change, if you've defined your own task and you
specified model build settings in your task.xml file, you must add a
class attribute to the model_build_settings element. This attribute is
not optional, and there is no default. If you're using the default
Carafe engine, the value you should use for this attribute is
MAT.CarafeModelBuilder.CarafeModelBuilder, as in the following example:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
Version 1.1 adds the ability to import MAT JSON documents into your
workspaces which haven't yet been processed (as well as other
annotation formats, like XML inline). Because of this change, if you
have a workspace, you must add a directory to it. This directory is
expected by the MAT workspace engine. For each workspace directory, do
this:
% mkdir <workspace_dir>/folders/rich_incoming
In version 1.1, it's possible to have multiple model build
configurations in your task.xml file. In order to ensure that the
correct configuration adds the appropriate command line options to the
MATModelBuilder executable, it was necessary to introduce a new
restriction on the --task option for MATModelBuilder: if it
appears, it must now be the first command-line option. In other words,
the following will now raise an error:
% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
In version 1.0, the default model was defined within the model build
settings. In version 1.1, because of the presence of multiple model
bulid configurations, we've separated the specification of the default
model in task.xml.
Version 1.0:
<model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>
Version 1.1:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>