Tutorial 2: Build a Model

Now that you've completed Tutorial 1, let's move on to how you might use your tagged documents to build a model. We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. Where Tutorial 1 involved the UI, this tutorial (and the next one) involves one of the command-line tools. Like Tutorial 1, we're going to do this tutorial in file mode. And because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.

As we saw in Tutorial 1, the sample task contains ten raw ASCII files in the directory MAT_PKG_HOME/sample/ne/resources/data/raw. The sample task also contains annotated versions of these files, in MAT_PKG_HOME/sample/ne/resources/data/json. (These files aren't necessarily correctly annotated; we prepared them using an automated tagger, and haven't corrected them. But that's not particularly important for this exercise.) Rather than ask you to hand-annotate all ten of these documents, we'll use the already-annotated versions to build a model.

The tool we're going to use here is MATModelBuilder.

Step 1: Build a model, version 1

In a shell:

Unix:

$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Named Entity' --model_file /tmp/ne_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Named Entity" --model_file %TMP%\ne_model \
--input_files "%CD%\sample\ne\resources\data\json\*.json"

Each call to the model builder requires a task, just as the UI required in Tutorial 1. The --model_file directive tells the tool where to save the model, and the --input_files directive tells the tool which files to use. There are many other arguments available to this tool; see the tool documentation for more details.

When you run this, you should see something like the following output:

===== Initiating Stochastic Gradient Descent Training with Periodic Stepsize Adjustment (PSA) =====
batch_size = 1
period_size = 10
initial learning rate = 0.100000
eta (adjustment) range = 0.990000 to 0.999900
-------------------------
number of parameters = 39031
number of training seqs = 175
======================================================================================================

........Epoch 1 complete (of 6)
.........Epoch 2 complete (of 6)
.........Epoch 3 complete (of 6)
.........Epoch 4 complete (of 6)
........Epoch 5 complete (of 6)
.........Epoch 6 complete (of 6)

The default behavior of the model builder is specified in the task.xml file associated with this task. This task is configured to use periodic stepsize adjustment, which is significantly faster than the normal training mechanism, but also requires that the model builder ensure that the document is segmented into "sentence-sized" chunks. This segmentation is temporary, and is used only in the context of the model builder.

We've successfully built a model, but we're not going to use it quite yet.

Step 2: Build a model, version 2

Our task has also been configured, in the task.xml file, to recognize the location of a default model. The default model is a location, usually a relative pathname referring to the directory which contains the task.xml file or one of its descendants, which is checked by default when the MAT tools look for a model in file mode. The user has the option of overwriting the default model when MATModelBuilder is called. Let's do that, so we can make use of the default model in the next tutorial.

In a shell:

Unix:

$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Named Entity' --save_as_default_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Named Entity" --save_as_default_model \
--input_files "%CD%\sample\ne\resources\data\json\*.json"

The output you see should be similar to that in step 1.

Step 3: Clean up (optional)

If you're not planning on doing any other tutorials, and you don't want the "Named Entity" task hanging around, remove it as shown in the final step of Tutorial 1.

This concludes Tutorial 2.