Now that we've covered file mode
in the first five tutorials, we're going to address workspace mode. In workspace mode,
you don't have nearly as much control over
On the other hand, you don't need to worry about any of those
things, either.
We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
The only way to create a workspace is on the command line. We use MATWorkspaceEngine. The first
argument of MATWorkspaceEngine is the path of the affected workspace,
and the second argument is the operation. Options and arguments for the
chosen operation follow.
Creating a workspace requires a task, so we provide the --task
directive:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace create --task 'Named Entity'
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace create --task "Named Entity"
Created workspace for task 'Named Entity' in directory /tmp/ne_workspace.
You now have a workspace in the specified directory. If you're
interested in the structure of a workspace, look here.
Workspaces organize files by putting them in folders. The three
folders we'll be concerned with in this tutorial are:
We'll begin by importing a single raw file.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" %CD%/sample\ne\resources\data\raw\voa2.txt
So here we use the "import" operation, which takes two arguments:
the folder name ("raw, unprocessed") and the file to import.
We've also used the --strip_suffix directive to modify the name by
which the workspace knows the file. We can see the contents of the
workspace (and of each folder), with the "list" operation:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list "raw, unprocessed"
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list "raw, unprocessed"
raw, unprocessed:
voa2
If you try to import the file again, you'll get an error:
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample\ne\resources\data\raw\voa2.txt
Basename for sample/ne/resources/data/raw/voa2.txt already exists in workspace; not importing
In other words, once you create a particular basename in the
workspace using the "import" operation, you can't do it again.
In this step, we're going to learn about the UI aspects of the
workspace.
First, see the documentation on starting
the Web
server and starting the UI. We'll
assume that you're running one of the tabbed terminal applications. In
the first pane, you should see something like:
Web server started on port 7801.
Web server command loop. Commands are:
exit - exit the command loop and stop the Web server
loopexit - exit the command loop, but leave the Web server running
taggerexit - shut down the tagger service, if it's running
restart - restart the Web server
ws_key - show the workspace key
help, ? - this message
Workspace key is XJ9dGBaCNveYHk9CZzw6wTM5WH8x05y1
Command:
Note the workspace key. This key is randomly generated, and known
only to the user who starts the Web server. This key must be provided
to the UI when the user opens the workspace. This simple security
feature ensures that even though the Web server will be modifying the
workspace, it does so if the UI user has proved that s/he has the
appropriate access.
Next:
You should see a window that looks like this:
Select "raw, unprocessed" from the folder menu. You should now see
this:
A single left click on the file name in the workspace window should
open the file:
Note how this file window differs from the one in file mode:
Operations make changes to files and move them around the workspace.
For instance, the "Prepare for hand tagging" operation removes a
document from the "raw, unprocessed" folder, applies the appropriate
engine steps, and saves it in the "in process" folder, at which point
the document is ready for hand tagging.
Make sure that the operations menu says "Prepare for hand tagging"
and press "Go". Your display should now look like this:
Note that the name of the folder in the file window has changes, and
the list of available operations has changed. Note, too, that the
workspace pane now shows that the "raw, unprocessed" folder is empty.
If you were to switch the folder using the folder menu to "in process",
you'd find this document there.
At this point, you can annotate your document as you did in Tutorial 1. If you want to leave the
workspace without finishing your annotation, just select the Save
operation in the operations menu and press Go; you can always return to
the document. Once you're satisfied with your annotations, select "Mark
completed" in the operations menu and press Go; your document will be
saved and moved to the completed folder.
You'd typically annotate several documents in the first round before
building a model, but we want to move directly to that step. Since we
only have one hand-annotated document at the moment, what we're going
to do is import some other documents into the workspace. We're going to
import some of the annotated documents that come with the Named Entity
task into the completed folder, and we're going to import one of them
into the "raw, unprocessed" folder.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa1.txt
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt.json" \
"completed" sample/ne/resources/data/json/voa[3-9].txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample\ne\resources\data\raw\voa1.txt
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt.json" \
"completed" sample\ne\resources\data\json\voa3.txt.json \
sample\ne\resources\data\json\voa4.txt.json \
sample\ne\resources\data\json\voa5.txt.json \
sample\ne\resources\data\json\voa6.txt.json \
sample\ne\resources\data\json\voa7.txt.json \
sample\ne\resources\data\json\voa8.txt.json \
sample\ne\resources\data\json\voa9.txt.json
Now, let's list the workspace to see what we have:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
rich, incoming:
raw, processed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
completed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
in process:
raw, unprocessed:
voa1
autotagged:
You can see that the document you tagged is in "completed", along
with the documents you just imported. You can also see that for each
annotated document, there's a raw copy of the document in "raw,
processed" (you can mostly ignore these). And finally, you can see that
there is one document in "raw, unprocessed" waiting to be annotated.
Now, we build a model. This is a command line operation only. We're
going to ask the workspace to autotag afterwards, which should move
"voa1" into the "autotagged" folder. Each time we build a model and
autotag, any documents that aren't in process or completed are
autotagged; documents which have already been autotagged are returned
to "raw, unprocessed" first.
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace modelbuild \
--do_autotag "completed"
Windows native:
% %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace modelbuild \
--do_autotag "completed"
Once this is done, we can look at the contents of the workspace
again:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
rich, incoming:
raw, processed:
voa6 voa7 voa4 voa5 voa2 voa3 voa1 voa8 voa9
completed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
in process:
raw, unprocessed:
autotagged:
voa1
So you can see that there's no longer anything in "raw,
unprocessed", but now there's one document in autotagged.
Now, you'll want to hand-correct the autotagged document.
If the Web server has been running while you've performed the last
two steps, the UI won't know that the state of the workspace has
changed. The safe thing is to close all open workspace documents, and
press the "Refresh" button on the workspace folder window. Now, the
state of the UI and the state of the workspace will be synchronized.
Select the autotagged folder from the folder menu. You should see
"voa1". Open the document. If you want to hand correct it, select the
"Hand correct" operation and press Go, and the document will be moved
into the "in process" folder; if the document is correct, choose "Mark
completed" and press Go, and the document will be moved into the
"completed" folder.
Once the document is in the "in process" folder, its status is
identical to the document at the end of step 5 above, and at this
point, you should be able to produce completed documents either with
full hand annotation or corrected automated annotation, and repeat the
cycle of model building and automated tagging.
In the next tutorial, we'll learn about the experiment engine. If
you want to learn how to use the experiment engine with workspaces,
don't remove your workspace.
If you're not planning on doing any other tutorials, remove the
workspace:
Unix:
% rm -rf /tmp/ne_workspace
Windows native:
> rd /s /q %TMP%\ne_workspace list
If you don't
want the "Named Entity" task hanging around, remove it as shown in the
final step of Tutorial 1.