If you've received a previous version of MAT, this page contains
instructions on how to upgrade to the new version.
We now provide a separate document on Web server security as it
pertains to workspace access. There are a number of new options to
MATWeb to support improved security. The
most visible effect is that you can restrict access to workspaces
from the MAT UI by using the --workspace_container_directory
option when you start up the MAT Web server.
In version 1.2, the text_right_to_left attribute lived on the
workflow element in task.xml; we anticipated that different
workflows might be used for different languages within the same
task. Since then, we've realized that the task is going to be the
appropriate level of encapsulation for language differences for
the foreseeable future. Furthermore, the current implementation of
right-to-left encoding did not work appropriately with workspaces.
Accordingly, we've moved this attribute to the web_customization
element, and it is now global to tasks.
The experiment engine has now been extended with general-purpose
iterators for sets of values and for value increments. So it's now
possible, for instance, to vary the number of model iterations
from 20 to 100 by increments of 10 without having to write a
separate model set specification for each possible value. These
iterators can be combined, in which case you'll get the
cross-product of the possible value settings, or you can define
your own iterators to get more sophisticated behavior (e.g.,
iterating over pairs of attribue-value sets). For the user, this
means that a couple of attributes have been removed from the
experiment engine, and a new set of elements and attributes has
been added.
In version 1.2, all you could iterate on was corpus size. The
mechanism for this iteration has now changed. In version 1.2, this
is what you'd do:
[...]
<model_sets dir="model_sets">
<build_settings training_increment="4"
truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
In version 1.3, it looks like this instead:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
You can see that the size processing has been removed from the
<build_settings> and added to a new <corpus_settings>
element, which contains an instance of the new <iterator>
element to specify the type of the iteration. See the documentation and examples for the
experiment engine for more details. Note that in version 1.2, you
had to specify explicitly that the iteration ends on an increment
exactly; in 1.3 this is the default, and to force the final corpus
size to be used, you'll need the force_last attribute:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.
In order to support the iterators in the experiment engine, we've
reorganized the structure of the experiment directory somewhat.
See the documentation on MATExperimentEngine
for details.
It is now possible to run MAT in Windows without Cygwin
installed.
Unlike previous versions, there is a single distribution bundle
for MAT 1.2 for all supported platforms. For compatibility with
Windows, this bundle is now a zip file.
If you use mat_controller.sh or mat_controller.bat under Windows,
you'll find that there's a new tabbed terminal tool we're using,
which has the advantage of not requiring Cygwin.
If you're using mat_controller.sh under MacOS X, and you intend
to install 10.6, note that the previous version of Terminator.app,
which supports the tabbed terminal behavior in mat_controller.sh,
will not work in 10.6; you must install the newer version provided
with MAT 1.2.
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the Java tokenizer produces slightly
different token boundaries than the original OCaml tokenizer. This
is problematic because the entire basis of most annotation
systems, including MAT, is the subdivision into words (tokens). In
order to have optimal performance, the tokenization of documents
which are to be automatically tagged should match the tokenization
of the documents which were used to create the tagger model. This
means that in order to migrate from version 1.1 to version 1.2,
among other things, you must retokenize your documents and update
any references to the OCaml tokenizer.
First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back
up your data before you run this utility.
Next, if you refer to a tokenization step implementation in your
task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTokenizationStep to
MAT.JavaCarafe.CarafeTokenizationStep. You may also need to
specify the heap_size attribute on the relevant tokenization
<step> in any workflow, if it turns out that the
default Java heap size isn't large enough for your purposes (this
attribute can also be specified on the command line; see the Carafe engine documentation).
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the model format for the Java engine
is completely different than the model format for the original
OCaml tokenizer. This means that you must rebuild all your models,
and update any references to the OCaml trainer/tagger.
First, retokenize your documents using MATRetokenize, as
described above, and update your tokenization steps.
Next, update your tagger and trainer settings in task.xml
according to the documentation provided for the Carafe engine.
Next, if you refer to a tagging step in your task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTagStep to
MAT.JavaCarafe.CarafeTagStep. You may also need to specify the
heap_size attribute on the relevant tag <step> in any
workflow, if it turns out that the default Java heap size
isn't large enough for your purposes (this attribute can also be
specified on the command line; see the Carafe engine documentation).
Similarly, if you have a <model_build_settings> entry, you
must change all occurrences of
MAT.CarafeModelBuilder.CarafeModelBuilder to
MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the
heap_size attribute as well. (Note below that you must also change
the syntax of <model_build_settings>.)
Note that for the tagger, the prior_adjst attribute has been
renamed to prior_adjust. For the trainer, the engine attribute has
been eliminated, and the feature_set attribute as well; there's
now a new feature_spec attribute which refers to a file in which
you can describe your feature set, if you don't want to use the
default feature set. Also, the psa_iterations flag has been
removed, due to more numerous options in the Carafe trainer;
psa_iterations="6"
becomes
training_method="psa" max_iterations="6"
Because PSA no longer requires random segments, the
no_random_psa_segments flag has been removed.
Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.
In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.
In order to support a more flexible way of specifying partitions
in experiments, we've changed the way partitions are specified in
the experiment XML files. We compare the relevant files below:
Version 1.1:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>
Version 1.2:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
Note the following changes:
In order to clarify how task settings are handled in MAT, a
number of changes have been made to the task.xml
file syntax.
First, the <step> element of <step_implementations>
no longer accepts arbitrary attributes. If you made use of this
feature to pass settings to the initialization methods of workflow
steps, you must now use the <create_settings> child element.
We doubt that anyone has made use of this feature.
Second, the <step> element of <workflow> no longer
accepts arbitrary attributes. If you make use of this feature to
pass settings to workflow steps, you must now use the
<create_settings>, <ui_settings>, or
<run_settings> child elements. The most likely situation
where this might arise is in passing defaults to the run methods
of steps. For instance, if you used this feature to increase the
Java heap size for Java Carafe, your task.xml file would have to
be revised as follows.
Version 1.1:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...
Version 1.2:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...
Second, the way settings are specified for model configurations
has changed. The name and class for the configuration are now
separated from the settings which are passed to the model builder,
as follows.
Version 1.1:
...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...
Version 1.2:
...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...
Finally, the <workflow> element no longer accepts arbitrary
settings; these settings must be passed using the
<ui_settings> child element. No task appears to use this
option yet, so this shouldn't affect anyone.
In order to support a more flexible way of invoking the MAT
engine in experiments, the way the configuration of experiments is
cached has changed in version 1.1. What this means is that you
will not be able to invoke MATExperimentEngine on experiment
directories created using version 1.0 to regenerate the experiment
scores.
In order to support a more flexible way of invoking the MAT
engine in experiments, we've changed the way corpus preprocessing
and test run processing are specified. In version 1.0, the MAT
engine was called as a command-line tool, and the options were
specified as a command line; in version 1.1, the options are
specified as XML attribute-value pairs. We compare the relevant
experiment XML blocks below:
Version 1.0:
<corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>
Version 1.1:
<corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>
Version 1.1 adds the ability to define different training
engines. Because of this change, if you've defined your own task
and you specified model build settings in your task.xml file, you
must add a class attribute to the model_build_settings element.
This attribute is not optional, and there is no default. If you're
using the default Carafe engine, the value you should use for this
attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the
following example:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
Version 1.1 adds the ability to import MAT JSON documents into
your workspaces which haven't yet been processed (as well as other
annotation formats, like XML inline). Because of this change, if
you have a workspace, you must add a directory to it. This
directory is expected by the MAT workspace engine. For each
workspace directory, do this:
% mkdir <workspace_dir>/folders/rich_incoming
In version 1.1, it's possible to have multiple model build
configurations in your task.xml file. In order to ensure that the
correct configuration adds the appropriate command line options to
the MATModelBuilder executable, it was necessary to introduce a
new restriction on the --task option for MATModelBuilder: if
it appears, it must now be the first command-line option. In other
words, the following will now raise an error:
% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
In version 1.0, the default model was defined within the model
build settings. In version 1.1, because of the presence of
multiple model bulid configurations, we've separated the
specification of the default model in task.xml.
Version 1.0:
<model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>
Version 1.1:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>