Upgrade Notes

If you've received a previous version of MAT, this page contains instructions on how to upgrade to the new version.

Upgrading from version 1.1 to version 1.2

New native Windows port

It is now possible to run MAT in Windows without Cygwin installed.

Single distribution bundle for all platforms

Unlike previous versions, there is a single distribution bundle for MAT 1.2 for all supported platforms. For compatibility with Windows, this bundle is now a zip file.

New tabbed terminal for Windows

If you use mat_controller.sh or mat_controller.bat under Windows, you'll find that there's a new tabbed terminal tool we're using, which has the advantage of not requiring Cygwin.

New version of Terminator.app for MacOS X 10.6

If you're using mat_controller.sh under MacOS X, and you intend to install 10.6, note that the previous version of Terminator.app, which supports the tabbed terminal behavior in mat_controller.sh, will not work in 10.6; you must install the newer version provided with MAT 1.2.

Tokenizer has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the Java tokenizer produces slightly different token boundaries than the original OCaml tokenizer. This is problematic because the entire basis of most annotation systems, including MAT, is the subdivision into words (tokens). In order to have optimal performance, the tokenization of documents which are to be automatically tagged should match the tokenization of the documents which were used to create the tagger model. This means that in order to migrate from version 1.1 to version 1.2, among other things, you must retokenize your documents and update any references to the OCaml tokenizer.

First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back up your data before you run this utility.

Next, if you refer to a tokenization step implementation in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTokenizationStep to MAT.JavaCarafe.CarafeTokenizationStep. You may also need to specify the heap_size attribute on the relevant tokenization <step> in any workflow, if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation).

Trainer/tagger has changed

First, retokenize your documents using MATRetokenize, as described above, and update your tokenization steps.

Next, update your tagger and trainer settings in task.xml according to the documentation provided for the Carafe engine.

Next, if you refer to a tagging step in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTagStep to MAT.JavaCarafe.CarafeTagStep. You may also need to specify the heap_size attribute on the relevant tag <step> in any workflow, if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation). Similarly, if you have a <model_build_settings> entry, you must change all occurrences of MAT.CarafeModelBuilder.CarafeModelBuilder to MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the heap_size attribute as well. (Note below that you must also change the syntax of <model_build_settings>.)

Note that for the tagger, the prior_adjst attribute has been renamed to prior_adjust. For the trainer, the engine attribute has been eliminated, and the feature_set attribute as well; there's now a new feature_spec attribute which refers to a file in which you can describe your feature set, if you don't want to use the default feature set. Also, the psa_iterations flag has been removed, due to more numerous options in the Carafe trainer;

psa_iterations="6"

becomes

training_method="psa" max_iterations="6"

Because PSA no longer requires random segments, the no_random_psa_segments flag has been removed.

Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.

Internals of experiment directories have changed

In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of specifying partitions in experiments, we've changed the way partitions are specified in the experiment XML files. We compare the relevant files below:

Version 1.1:

<experiment task='Named Entity'>
  <corpora dir="corpora">
    <partition split_fraction=".2" ctype="split"/>
    <corpus name="test">
      <pattern>*.json</pattern>
    </corpus>
  </corpora>
  <model_sets dir="model_sets">
    <model_set name="test" corpus="test"/>
  </model_sets>
  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    <run name="test" model="test" corpus="test"/>
  </runs>
</experiment>

Version 1.2:

<experiment task='Named Entity'>
  <corpora dir="corpora">
    <partition name="train" fraction=".8"/>
    <partition name="test" fraction=".2"/>
    <corpus name="test">
      <pattern>*.json</pattern>
    </corpus>
  </corpora>
  <model_sets dir="model_sets">
    <model_set name="test">
      <training_corpus corpus="test" partition="train"/>
    </model_set>
  </model_sets>
  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    <run name="test" model="test">
      <test_corpus corpus="test" partition="test"/>
    </run>
  </runs>
</experiment>

Note the following changes:

The <partition> element now explicitly specifies named partitions and their fractions. You are no longer restricted to designating a corpus exclusively as test, exclusively as train, or as a single split.
The remainder of the attributes of the <partition> element have been moved to a new <size> element (not exemplified here).
Multiple corpora, and partitions, can be associated with a single <model_set> or <run>.
The "corpus" attributes of the <model_set> and <run> arguments are no longer recognized. The <training_corpus> and <test_corpus> child elements replace them.

Settings in task.xml have changed

In order to clarify how task settings are handled in MAT, a number of changes have been made to the task.xml file syntax.

First, the <step> element of <step_implementations> no longer accepts arbitrary attributes. If you made use of this feature to pass settings to the initialization methods of workflow steps, you must now use the <create_settings> child element. We doubt that anyone has made use of this feature.

Second, the <step> element of <workflow> no longer accepts arbitrary attributes. If you make use of this feature to pass settings to workflow steps, you must now use the <create_settings>, <ui_settings>, or <run_settings> child elements. The most likely situation where this might arise is in passing defaults to the run methods of steps. For instance, if you used this feature to increase the Java heap size for Java Carafe, your task.xml file would have to be revised as follows.

Version 1.1:

  ...
  <workflows>
    <workflow name="Demo" hand_annotation_available_at_end="yes">
      <step name="zone"/>
      <step name="tokenize"/>
      <step name="tag" heap_size="2G"/>
    </workflow>
    ...
  </workflows>
  ...

Version 1.2:

  ...
  <workflows>
    <workflow name="Demo" hand_annotation_available_at_end="yes">
      <step name="zone"/>
      <step name="tokenize"/>
      <step name="tag">
        <run_settings heap_size="2G"/>
      </step>
    </workflow>
    ...
  </workflows>
  ...

Second, the way settings are specified for model configurations has changed. The name and class for the configuration are now separated from the settings which are passed to the model builder, as follows.

Version 1.1:

  ...
  <model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
                        training_method="psa" max_iterations="6"/>
  </model_build_settings>
  ...

Version 1.2:

  ...
  <model_config class="MAT.JavaCarafe.CarafeModelBuilder">
    <build_settings training_method="psa" max_iterations="6"/>
  </model_config>
  ...

Finally, the <workflow> element no longer accepts arbitrary settings; these settings must be passed using the <ui_settings> child element. No task appears to use this option yet, so this shouldn't affect anyone.

Upgrading from version 1.0 to version 1.1

Internals of experiment directories have changed

In order to support a more flexible way of invoking the MAT engine in experiments, the way the configuration of experiments is cached has changed in version 1.1. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.0 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of invoking the MAT engine in experiments, we've changed the way corpus preprocessing and test run processing are specified. In version 1.0, the MAT engine was called as a command-line tool, and the options were specified as a command line; in version 1.1, the options are specified as XML attribute-value pairs. We compare the relevant experiment XML blocks below:

Version 1.0:

  <corpora dir="corpora">
    <prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
    [...]
  </corpora>

  <runs dir="runs">
    <run_settings>
      <args>--steps zone,tokenize,tag --workflow Demo</args>
    </run_settings>
    [...]
  </runs>

Version 1.1:

  <corpora dir="corpora">
    <prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
    [...]
  </corpora>

  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    [...]
  </runs>

New training engine configuration in task.xml

Version 1.1 adds the ability to define different training engines. Because of this change, if you've defined your own task and you specified model build settings in your task.xml file, you must add a class attribute to the model_build_settings element. This attribute is not optional, and there is no default. If you're using the default Carafe engine, the value you should use for this attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the following example:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
                        engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6"/>

New folder in workspaces

Version 1.1 adds the ability to import MAT JSON documents into your workspaces which haven't yet been processed (as well as other annotation formats, like XML inline). Because of this change, if you have a workspace, you must add a directory to it. This directory is expected by the MAT workspace engine. For each workspace directory, do this:

% mkdir <workspace_dir>/folders/rich_incoming

New command line option restriction for MATModelBuilder

In version 1.1, it's possible to have multiple model build configurations in your task.xml file. In order to ensure that the correct configuration adds the appropriate command line options to the MATModelBuilder executable, it was necessary to introduce a new restriction on the --task option for MATModelBuilder: if it appears, it must now be the first command-line option. In other words, the following will now raise an error:

% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Change to how the default model is specified in task.xml

In version 1.0, the default model was defined within the model build settings. In version 1.1, because of the presence of multiple model bulid configurations, we've separated the specification of the default model in task.xml.

Version 1.0:

  <model_build_settings engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6" default_model="default_model"/>

Version 1.1:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
                        engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6"/>
  <default_model>default_model</default_model>