If you've received a previous version of MAT, this page contains
instructions on how to upgrade to the new version.
Support for Cygwin has been removed, because Python in Cygwin
does not support sqlite, and sqlite is required for the MAT
workspaces in 2.0. Migrate to Windows native.
Because MAT now explicitly defines the annotations and
well-formedness conditions for attributes separately from its
display information, the task.xml file has been reorganized. You
can use the MATUpdateTaskXML
tool to update your task.xml file automatically.
The version of jCarafe which is delivered with MAT 2.0 is
0.9.8.5.b-06, which has a different model structure than the
version delivered with 1.3. You must rebuild all your models,
either using MATModelBuilder
(in file mode) or the "modelbuild" operation of MATWorkspaceEngine (in
workspace mode).
The 1.3 UI used a desktop-in-a-browser metaphor, which raised a
number of issues, including poor use of screen real estate. In
2.0, we've completely reorganized the UI, and changed the URL.
In previous releases, you really didn't have the option to pass
any command-line options to the MATWeb server running under the
tabbed terminal. As the command-line options to MATWeb expanded,
and became more important, this turned out to be a bad idea. As a
result, we've now reorganized the tabbed terminal startup so that
it's part of MATWeb. The mat_controller.sh application is gone.
The Windows mat_controller.bat script is still present, but it
simply invokes MATWeb with the --spawn_tabbed_terminal option.
We have completely reorganized the internal structure of workspaces for 2.0. These new
workspaces are more powerful and impose fewer requirements on the
user. Your MAT 1.3 workspaces cannot be used with MAT 2.0 without
modification. We've provided an upgrade
tool which will allow you to convert your MAT 1.3 workspaces
to MAT 2.0.
The new workspaces feature many fewer folders; a SQLite database
which manages the document state information; real transaction and
file locking; document assignment, potentially to multiple
annotators; extensive logging capabilities; and infrastructure for
future capabilities like reconciliation and complex reconciliation
workflows, prioritization queues, and segment-by-segment
annotation.
As a result of this change, it's no longer possible to run an
experiment against a workspace by pointing to, e.g., the
"completed" folder. So as part of this change, there's now special
support for running experiments against workspaces, both from MATWorkspaceEngine and MATExperimentEngine.
MATScore and MATExperimentEngine have
long supported writing one of three CSV file formats (Excel
formulas, OpenOffice formulas, and no formulas). In 2.0, you can
now write multiple formats in the same run, and the name of each
CSV file clearly indicates the formula type. As a result, the
--no_csv_formulas and --oo_separator command-line options have
been removed, and replaced with --csv_formula_output.
Because the scorer now provides mismatch details for all
conditions, this flag has been renamed to
--tag_output_mismatch_details.
Due to enhancements to the scorer, some of the columns in the
output spreadsheets have been renamed or moved, and others have a
slightly different interpretation. Full details here.
In previous releases, we deprecated, but retained, the "operate"
operation in MATWorkspaceEngine. This operation has finally been
removed in 2.0. If you had still been doing something like this:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory operate core modelbuild
you should now do this:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory modelbuild core
See the workspace
documentation for more details.
In version 1.2, the text_right_to_left attribute lived on the
workflow element in task.xml; we anticipated that different
workflows might be used for different languages within the same
task. Since then, we've realized that the task is going to be the
appropriate level of encapsulation for language differences for
the foreseeable future. Furthermore, the current implementation of
right-to-left encoding did not work appropriately with workspaces.
Accordingly, we've moved this attribute to the web_customization
element, and it is now global to tasks.
The experiment engine has now been extended with general-purpose
iterators for sets of values and for value increments. So it's now
possible, for instance, to vary the number of model iterations
from 20 to 100 by increments of 10 without having to write a
separate model set specification for each possible value. These
iterators can be combined, in which case you'll get the
cross-product of the possible value settings, or you can define
your own iterators to get more sophisticated behavior (e.g.,
iterating over pairs of attribue-value sets). For the user, this
means that a couple of attributes have been removed from the
experiment engine, and a new set of elements and attributes has
been added.
In version 1.2, all you could iterate on was corpus size. The
mechanism for this iteration has now changed. In version 1.2, this
is what you'd do:
[...]
<model_sets dir="model_sets">
<build_settings training_increment="4"
truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
In version 1.3, it looks like this instead:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
You can see that the size processing has been removed from the
<build_settings> and added to a new <corpus_settings>
element, which contains an instance of the new <iterator>
element to specify the type of the iteration. See the documentation and examples for the
experiment engine for more details. Note that in version 1.2, you
had to specify explicitly that the iteration ends on an increment
exactly; in 1.3 this is the default, and to force the final corpus
size to be used, you'll need the force_last attribute:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.
In order to support the iterators in the experiment engine, we've
reorganized the structure of the experiment directory somewhat.
See the documentation on MATExperimentEngine
for details.
It is now possible to run MAT in Windows without Cygwin
installed.
Unlike previous versions, there is a single distribution bundle
for MAT 1.2 for all supported platforms. For compatibility with
Windows, this bundle is now a zip file.
If you use mat_controller.sh or mat_controller.bat under Windows,
you'll find that there's a new tabbed terminal tool we're using,
which has the advantage of not requiring Cygwin.
If you're using mat_controller.sh under MacOS X, and you intend
to install 10.6, note that the previous version of Terminator.app,
which supports the tabbed terminal behavior in mat_controller.sh,
will not work in 10.6; you must install the newer version provided
with MAT 1.2.
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the Java tokenizer produces slightly
different token boundaries than the original OCaml tokenizer. This
is problematic because the entire basis of most annotation
systems, including MAT, is the subdivision into words (tokens). In
order to have optimal performance, the tokenization of documents
which are to be automatically tagged should match the tokenization
of the documents which were used to create the tagger model. This
means that in order to migrate from version 1.1 to version 1.2,
among other things, you must retokenize your documents and update
any references to the OCaml tokenizer.
First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back
up your data before you run this utility.
Next, if you refer to a tokenization step implementation in your
task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTokenizationStep to
MAT.JavaCarafe.CarafeTokenizationStep. You may also need to
specify the heap_size attribute on the relevant tokenization
<step> in any workflow, if it turns out that the
default Java heap size isn't large enough for your purposes (this
attribute can also be specified on the command line; see the Carafe engine documentation).
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the model format for the Java engine
is completely different than the model format for the original
OCaml tokenizer. This means that you must rebuild all your models,
and update any references to the OCaml trainer/tagger.
First, retokenize your documents using MATRetokenize, as
described above, and update your tokenization steps.
Next, update your tagger and trainer settings in task.xml
according to the documentation provided for the Carafe engine.
Next, if you refer to a tagging step in your task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTagStep to
MAT.JavaCarafe.CarafeTagStep. You may also need to specify the
heap_size attribute on the relevant tag <step> in any
workflow, if it turns out that the default Java heap size
isn't large enough for your purposes (this attribute can also be
specified on the command line; see the Carafe engine documentation).
Similarly, if you have a <model_build_settings> entry, you
must change all occurrences of
MAT.CarafeModelBuilder.CarafeModelBuilder to
MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the
heap_size attribute as well. (Note below that you must also change
the syntax of <model_build_settings>.)
Note that for the tagger, the prior_adjst attribute has been
renamed to prior_adjust. For the trainer, the engine attribute has
been eliminated, and the feature_set attribute as well; there's
now a new feature_spec attribute which refers to a file in which
you can describe your feature set, if you don't want to use the
default feature set. Also, the psa_iterations flag has been
removed, due to more numerous options in the Carafe trainer;
psa_iterations="6"
becomes
training_method="psa" max_iterations="6"
Because PSA no longer requires random segments, the
no_random_psa_segments flag has been removed.
Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.
In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.
In order to support a more flexible way of specifying partitions
in experiments, we've changed the way partitions are specified in
the experiment XML files. We compare the relevant files below:
Version 1.1:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>
Version 1.2:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
Note the following changes:
In order to clarify how task settings are handled in MAT, a
number of changes have been made to the task.xml
file syntax.
First, the <step> element of <step_implementations>
no longer accepts arbitrary attributes. If you made use of this
feature to pass settings to the initialization methods of workflow
steps, you must now use the <create_settings> child element.
We doubt that anyone has made use of this feature.
Second, the <step> element of <workflow> no longer
accepts arbitrary attributes. If you make use of this feature to
pass settings to workflow steps, you must now use the
<create_settings>, <ui_settings>, or
<run_settings> child elements. The most likely situation
where this might arise is in passing defaults to the run methods
of steps. For instance, if you used this feature to increase the
Java heap size for Java Carafe, your task.xml file would have to
be revised as follows.
Version 1.1:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...
Version 1.2:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...
Second, the way settings are specified for model configurations
has changed. The name and class for the configuration are now
separated from the settings which are passed to the model builder,
as follows.
Version 1.1:
...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...
Version 1.2:
...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...
Finally, the <workflow> element no longer accepts arbitrary
settings; these settings must be passed using the
<ui_settings> child element. No task appears to use this
option yet, so this shouldn't affect anyone.
In order to support a more flexible way of invoking the MAT
engine in experiments, the way the configuration of experiments is
cached has changed in version 1.1. What this means is that you
will not be able to invoke MATExperimentEngine on experiment
directories created using version 1.0 to regenerate the experiment
scores.
In order to support a more flexible way of invoking the MAT
engine in experiments, we've changed the way corpus preprocessing
and test run processing are specified. In version 1.0, the MAT
engine was called as a command-line tool, and the options were
specified as a command line; in version 1.1, the options are
specified as XML attribute-value pairs. We compare the relevant
experiment XML blocks below:
Version 1.0:
<corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>
Version 1.1:
<corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>
Version 1.1 adds the ability to define different training
engines. Because of this change, if you've defined your own task
and you specified model build settings in your task.xml file, you
must add a class attribute to the model_build_settings element.
This attribute is not optional, and there is no default. If you're
using the default Carafe engine, the value you should use for this
attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the
following example:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
Version 1.1 adds the ability to import MAT JSON documents into
your workspaces which haven't yet been processed (as well as other
annotation formats, like XML inline). Because of this change, if
you have a workspace, you must add a directory to it. This
directory is expected by the MAT workspace engine. For each
workspace directory, do this:
% mkdir <workspace_dir>/folders/rich_incoming
In version 1.1, it's possible to have multiple model build
configurations in your task.xml file. In order to ensure that the
correct configuration adds the appropriate command line options to
the MATModelBuilder executable, it was necessary to introduce a
new restriction on the --task option for MATModelBuilder: if
it appears, it must now be the first command-line option. In other
words, the following will now raise an error:
% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
In version 1.0, the default model was defined within the model
build settings. In version 1.1, because of the presence of
multiple model bulid configurations, we've separated the
specification of the default model in task.xml.
Version 1.0:
<model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>
Version 1.1:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>