Upgrade Notes

If you've received a previous version of MAT, this page contains instructions on how to upgrade to the new version.

Upgrading from version 1.3 to version 2.0

New features

All non-automated components now handle annotation attributes: Attributes can be strings, ints, floats, booleans, or other annotations, or set or list aggregations of these types. Strings and ints support choice lists; annotation attributes support type restrictions; ints and floats support range restrictions. All attributes other than annotations support default values.
All non-automated components now handle spanless annotations: These annotations have no direct anchor in the text, and can be used to model relations, coreference entities, and other elements.
Scorer now handles document-internal overlaps: In version 1.3, MATScore did not deal with documents which contained internal content annotation overlaps. In 2.0, we've implemented a sophisticated annotation matching algorithm which addresses this issue.
Scorer is now customizable in task.xml: You can declare which dimensions of annotations will be compared when annotations are compared; how those dimensions will be compared; and what the weight of each dimension is.
New transducer tool now available: In previous versions, if you simply wanted to convert documents from one format to another, you'd use MATEngine. The problem with MATEngine is that it requires a task; it only succeeds if all documents can be processed; and it requires a workflow. There's now a new tool, MATTransducer, which addresses all these issues. The transducer also supports a new XML-driven annotation conversion language.
All tools now support better temporary file management: In 2.0, every command-line tool which invokes a subprocess (e.g., the Carafe tagger) now takes the --tmpdir_root and --preserve_tempfiles options, which gives you better control over debugging and the placement of temporary files created during processing.
MAT JSON format has been expanded: To support its expanded annotation and attribute model, MAT 2.0 now uses version 2 of the MAT JSON document format by default. All readers recognize previous versions of the format as well. We've also introduced a new writer, mat-json-v1, to allow users to save documents in the format used by MAT 1.3.
New span reconciliation capability: In MAT 2.0, we've deployed a tool for reconciliation of simple span annotations. This tool will be replaced by a general-purpose reconciliation tool in the next version of MAT.
New standalone document viewer and annotation tool: You can now embed a standalone version of the MAT document viewing component in your own Web application. This viewer can be enabled for hand annotation, and it also supports document comparison.
New capabilities in MATReport: Expanded support for annotation attributes, including the ability to generate per-label expanded report spreadsheets.
Hand annotation now supports adding overlapping annotations: This corrects an enormous deficiency in previous versions of MAT. While annotating, the overlapping annotations are also vertically stacked, ensuring that they're visible.

No Cygwin support

Support for Cygwin has been removed, because Python in Cygwin does not support sqlite, and sqlite is required for the MAT workspaces in 2.0. Migrate to Windows native.

Python 2.6 or later required

MAT 2.0 makes extensive use of JSON and sqlite, which are best supported in Python 2.6 or 2.7. It also relies on Python's "with" statement, which is supported first in 2.6.

Task.xml schema has changed

Because MAT now explicitly defines the annotations and well-formedness conditions for attributes separately from its display information, the task.xml file has been reorganized. You can use the MATUpdateTaskXML tool to update your task.xml file automatically.

The <tags> element has been replaced by the <annotation_set_descriptors> and <annotation_display> elements. These new elements are quite different than the old ones. If you're receiving MAT as a zip file distribution with tasks included, your tasks have been updated.
Because the UI has been completely redesigned, the <web_customization> element no longer accepts the default_tag_window_position and default_tag_window_size attributes.
The tagging_step attribute of <step_implementations> is no longer accepted (or needed).
Because of changes in the implementation of workspaces, the tagprep operation has been replaced by the import operation, and the list of steps required for the tag operation is now only "tag".

All models must be rebuilt (new version of jCarafe)

The version of jCarafe which is delivered with MAT 2.0 is 0.9.8.5.b-06, which has a different model structure than the version delivered with 1.3. You must rebuild all your models, either using MATModelBuilder (in file mode) or the "modelbuild" operation of MATWorkspaceEngine (in workspace mode).

UI has been completely reorganized, with a new URL

The 1.3 UI used a desktop-in-a-browser metaphor, which raised a number of issues, including poor use of screen real estate. In 2.0, we've completely reorganized the UI, and changed the URL.

mat_controller.sh is replaced by the --spawn_tabbed_terminal option of MATWeb

In previous releases, you really didn't have the option to pass any command-line options to the MATWeb server running under the tabbed terminal. As the command-line options to MATWeb expanded, and became more important, this turned out to be a bad idea. As a result, we've now reorganized the tabbed terminal startup so that it's part of MATWeb. The mat_controller.sh application is gone. The Windows mat_controller.bat script is still present, but it simply invokes MATWeb with the --spawn_tabbed_terminal option.

Workspaces have been completely reorganized

We have completely reorganized the internal structure of workspaces for 2.0. These new workspaces are more powerful and impose fewer requirements on the user. Your MAT 1.3 workspaces cannot be used with MAT 2.0 without modification. We've provided an upgrade tool which will allow you to convert your MAT 1.3 workspaces to MAT 2.0.

The new workspaces feature many fewer folders; a SQLite database which manages the document state information; real transaction and file locking; document assignment, potentially to multiple annotators; extensive logging capabilities; and infrastructure for future capabilities like reconciliation and complex reconciliation workflows, prioritization queues, and segment-by-segment annotation.

As a result of this change, it's no longer possible to run an experiment against a workspace by pointing to, e.g., the "completed" folder. So as part of this change, there's now special support for running experiments against workspaces, both from MATWorkspaceEngine and MATExperimentEngine.

Scorer output ranges have changed

In version 1.3, recall, precision and f-measure were all scaled from 0 to 100. In 2.0, they're scaled from 0 to 1.

CSV spreadsheet management in MATScore and MATExperimentEngine has changed

MATScore and MATExperimentEngine have long supported writing one of three CSV file formats (Excel formulas, OpenOffice formulas, and no formulas). In 2.0, you can now write multiple formats in the same run, and the name of each CSV file clearly indicates the formula type. As a result, the --no_csv_formulas and --oo_separator command-line options have been removed, and replaced with --csv_formula_output.

MATScore --tag_span_details renamed

Because the scorer now provides mismatch details for all conditions, this flag has been renamed to --tag_output_mismatch_details.

MATScore spreadsheet output has changed

Due to enhancements to the scorer, some of the columns in the output spreadsheets have been renamed or moved, and others have a slightly different interpretation. Full details here.

Command-line options to MATWorkspaceEngine have changed

In previous releases, we deprecated, but retained, the "operate" operation in MATWorkspaceEngine. This operation has finally been removed in 2.0. If you had still been doing something like this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory operate core modelbuild

you should now do this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory modelbuild core

See the workspace documentation for more details.

Upgrading from version 1.2 to version 1.3

Web server security has been improved

We now provide a separate document on Web server security as it pertains to workspace access. There are a number of new options to MATWeb to support improved security. The most visible effect is that you can restrict access to workspaces from the MAT UI by using the --workspace_container_directory option when you start up the MAT Web server.

Attribute to control right-to-left text display has moved

In version 1.2, the text_right_to_left attribute lived on the workflow element in task.xml; we anticipated that different workflows might be used for different languages within the same task. Since then, we've realized that the task is going to be the appropriate level of encapsulation for language differences for the foreseeable future. Furthermore, the current implementation of right-to-left encoding did not work appropriately with workspaces. Accordingly, we've moved this attribute to the web_customization element, and it is now global to tasks.

Corpus size iteration has changed

The experiment engine has now been extended with general-purpose iterators for sets of values and for value increments. So it's now possible, for instance, to vary the number of model iterations from 20 to 100 by increments of 10 without having to write a separate model set specification for each possible value. These iterators can be combined, in which case you'll get the cross-product of the possible value settings, or you can define your own iterators to get more sophisticated behavior (e.g., iterating over pairs of attribue-value sets). For the user, this means that a couple of attributes have been removed from the experiment engine, and a new set of elements and attributes has been added.

In version 1.2, all you could iterate on was corpus size. The mechanism for this iteration has now changed. In version 1.2, this is what you'd do:

  [...]
  <model_sets dir="model_sets">
    <build_settings training_increment="4" 
                    truncate_to_increment="yes"/>
    <model_set name="test">
      <training_corpus corpus="test" partition="train"/>
    </model_set>
  </model_sets>
  [...]

In version 1.3, it looks like this instead:

  [...]
  <model_sets dir="model_sets">
    <corpus_settings>
      <iterator type="corpus_size" increment="4"/>
    </corpus_settings>
    <model_set name="test">
      <training_corpus corpus="test" partition="train"/>
    </model_set>
  </model_sets>
  [...]

You can see that the size processing has been removed from the <build_settings> and added to a new <corpus_settings> element, which contains an instance of the new <iterator> element to specify the type of the iteration. See the documentation and examples for the experiment engine for more details. Note that in version 1.2, you had to specify explicitly that the iteration ends on an increment exactly; in 1.3 this is the default, and to force the final corpus size to be used, you'll need the force_last attribute:

  [...]
  <model_sets dir="model_sets">
    <corpus_settings>
      <iterator type="corpus_size" increment="4" force_last="yes"/>
    </corpus_settings>
    <model_set name="test">
      <training_corpus corpus="test" partition="train"/>
    </model_set>
  </model_sets>
  [...]

Experiment spreadsheet columns have been expanded

The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.

Experiment directory structure has changed

In order to support the iterators in the experiment engine, we've reorganized the structure of the experiment directory somewhat. See the documentation on MATExperimentEngine for details.

Upgrading from version 1.1 to version 1.2

New native Windows port

It is now possible to run MAT in Windows without Cygwin installed.

Single distribution bundle for all platforms

Unlike previous versions, there is a single distribution bundle for MAT 1.2 for all supported platforms. For compatibility with Windows, this bundle is now a zip file.

New tabbed terminal for Windows

If you use mat_controller.sh or mat_controller.bat under Windows, you'll find that there's a new tabbed terminal tool we're using, which has the advantage of not requiring Cygwin.

New version of Terminator.app for MacOS X 10.6

If you're using mat_controller.sh under MacOS X, and you intend to install 10.6, note that the previous version of Terminator.app, which supports the tabbed terminal behavior in mat_controller.sh, will not work in 10.6; you must install the newer version provided with MAT 1.2.

Tokenizer has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the Java tokenizer produces slightly different token boundaries than the original OCaml tokenizer. This is problematic because the entire basis of most annotation systems, including MAT, is the subdivision into words (tokens). In order to have optimal performance, the tokenization of documents which are to be automatically tagged should match the tokenization of the documents which were used to create the tagger model. This means that in order to migrate from version 1.1 to version 1.2, among other things, you must retokenize your documents and update any references to the OCaml tokenizer.

First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back up your data before you run this utility.

Next, if you refer to a tokenization step implementation in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTokenizationStep to MAT.JavaCarafe.CarafeTokenizationStep. You may also need to specify the heap_size attribute on the relevant tokenization <step> in any workflow, if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation).

Trainer/tagger has changed

First, retokenize your documents using MATRetokenize, as described above, and update your tokenization steps.

Next, update your tagger and trainer settings in task.xml according to the documentation provided for the Carafe engine.

Next, if you refer to a tagging step in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTagStep to MAT.JavaCarafe.CarafeTagStep. You may also need to specify the heap_size attribute on the relevant tag <step> in any workflow, if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation). Similarly, if you have a <model_build_settings> entry, you must change all occurrences of MAT.CarafeModelBuilder.CarafeModelBuilder to MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the heap_size attribute as well. (Note below that you must also change the syntax of <model_build_settings>.)

Note that for the tagger, the prior_adjst attribute has been renamed to prior_adjust. For the trainer, the engine attribute has been eliminated, and the feature_set attribute as well; there's now a new feature_spec attribute which refers to a file in which you can describe your feature set, if you don't want to use the default feature set. Also, the psa_iterations flag has been removed, due to more numerous options in the Carafe trainer;

psa_iterations="6"

becomes

training_method="psa" max_iterations="6"

Because PSA no longer requires random segments, the no_random_psa_segments flag has been removed.

Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.

Internals of experiment directories have changed

In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of specifying partitions in experiments, we've changed the way partitions are specified in the experiment XML files. We compare the relevant files below:

Version 1.1:

<experiment task='Named Entity'>
  <corpora dir="corpora">
    <partition split_fraction=".2" ctype="split"/>
    <corpus name="test">
      <pattern>*.json</pattern>
    </corpus>
  </corpora>
  <model_sets dir="model_sets">
    <model_set name="test" corpus="test"/>
  </model_sets>
  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    <run name="test" model="test" corpus="test"/>
  </runs>
</experiment>

Version 1.2:

<experiment task='Named Entity'>
  <corpora dir="corpora">
    <partition name="train" fraction=".8"/>
    <partition name="test" fraction=".2"/>
    <corpus name="test">
      <pattern>*.json</pattern>
    </corpus>
  </corpora>
  <model_sets dir="model_sets">
    <model_set name="test">
      <training_corpus corpus="test" partition="train"/>
    </model_set>
  </model_sets>
  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    <run name="test" model="test">
      <test_corpus corpus="test" partition="test"/>
    </run>
  </runs>
</experiment>

Note the following changes:

The <partition> element now explicitly specifies named partitions and their fractions. You are no longer restricted to designating a corpus exclusively as test, exclusively as train, or as a single split.
The remainder of the attributes of the <partition> element have been moved to a new <size> element (not exemplified here).
Multiple corpora, and partitions, can be associated with a single <model_set> or <run>.
The "corpus" attributes of the <model_set> and <run> arguments are no longer recognized. The <training_corpus> and <test_corpus> child elements replace them.

Settings in task.xml have changed

In order to clarify how task settings are handled in MAT, a number of changes have been made to the task.xml file syntax.

First, the <step> element of <step_implementations> no longer accepts arbitrary attributes. If you made use of this feature to pass settings to the initialization methods of workflow steps, you must now use the <create_settings> child element. We doubt that anyone has made use of this feature.

Second, the <step> element of <workflow> no longer accepts arbitrary attributes. If you make use of this feature to pass settings to workflow steps, you must now use the <create_settings>, <ui_settings>, or <run_settings> child elements. The most likely situation where this might arise is in passing defaults to the run methods of steps. For instance, if you used this feature to increase the Java heap size for Java Carafe, your task.xml file would have to be revised as follows.

Version 1.1:

  ...
  <workflows>
    <workflow name="Demo" hand_annotation_available_at_end="yes">
      <step name="zone"/>
      <step name="tokenize"/>
      <step name="tag" heap_size="2G"/>
    </workflow>
    ...
  </workflows>
  ...

Version 1.2:

  ...
  <workflows>
    <workflow name="Demo" hand_annotation_available_at_end="yes">
      <step name="zone"/>
      <step name="tokenize"/>
      <step name="tag">
        <run_settings heap_size="2G"/>
      </step>
    </workflow>
    ...
  </workflows>
  ...

Second, the way settings are specified for model configurations has changed. The name and class for the configuration are now separated from the settings which are passed to the model builder, as follows.

Version 1.1:

  ...
  <model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
                        training_method="psa" max_iterations="6"/>
  </model_build_settings>
  ...

Version 1.2:

  ...
  <model_config class="MAT.JavaCarafe.CarafeModelBuilder">
    <build_settings training_method="psa" max_iterations="6"/>
  </model_config>
  ...

Finally, the <workflow> element no longer accepts arbitrary settings; these settings must be passed using the <ui_settings> child element. No task appears to use this option yet, so this shouldn't affect anyone.

Upgrading from version 1.0 to version 1.1

Internals of experiment directories have changed

In order to support a more flexible way of invoking the MAT engine in experiments, the way the configuration of experiments is cached has changed in version 1.1. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.0 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of invoking the MAT engine in experiments, we've changed the way corpus preprocessing and test run processing are specified. In version 1.0, the MAT engine was called as a command-line tool, and the options were specified as a command line; in version 1.1, the options are specified as XML attribute-value pairs. We compare the relevant experiment XML blocks below:

Version 1.0:

  <corpora dir="corpora">
    <prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
    [...]
  </corpora>

  <runs dir="runs">
    <run_settings>
      <args>--steps zone,tokenize,tag --workflow Demo</args>
    </run_settings>
    [...]
  </runs>

Version 1.1:

  <corpora dir="corpora">
    <prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
    [...]
  </corpora>

  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    [...]
  </runs>

New training engine configuration in task.xml

Version 1.1 adds the ability to define different training engines. Because of this change, if you've defined your own task and you specified model build settings in your task.xml file, you must add a class attribute to the model_build_settings element. This attribute is not optional, and there is no default. If you're using the default Carafe engine, the value you should use for this attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the following example:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
                        engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6"/>

New folder in workspaces

Version 1.1 adds the ability to import MAT JSON documents into your workspaces which haven't yet been processed (as well as other annotation formats, like XML inline). Because of this change, if you have a workspace, you must add a directory to it. This directory is expected by the MAT workspace engine. For each workspace directory, do this:

% mkdir <workspace_dir>/folders/rich_incoming

New command line option restriction for MATModelBuilder

In version 1.1, it's possible to have multiple model build configurations in your task.xml file. In order to ensure that the correct configuration adds the appropriate command line options to the MATModelBuilder executable, it was necessary to introduce a new restriction on the --task option for MATModelBuilder: if it appears, it must now be the first command-line option. In other words, the following will now raise an error:

% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Change to how the default model is specified in task.xml

In version 1.0, the default model was defined within the model build settings. In version 1.1, because of the presence of multiple model bulid configurations, we've separated the specification of the default model in task.xml.

Version 1.0:

  <model_build_settings engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6" default_model="default_model"/>

Version 1.1:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
                        engine="anonTrain.native" feature_set="ANON-1"
                        psa_iterations="6"/>
  <default_model>default_model</default_model>