Now that we've covered file
mode
in the first five tutorials, we're going to address workspace mode. In workspace
mode,
you don't have nearly as much control over
On the other hand, you don't need to worry about any of those
things, either.
We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
The only way to create a workspace is on the command line. We use
MATWorkspaceEngine. The
first
argument of MATWorkspaceEngine is the path of the affected
workspace,
and the second argument is the operation. Options and arguments
for the
chosen operation follow.
Creating a workspace requires a task, so we provide the --task
directive:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace create --task 'Named Entity'
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace create --task "Named Entity"
Created workspace for task 'Named Entity' in directory /tmp/ne_workspace.
You now have a workspace in the specified directory. If you're
interested in the structure of a workspace, look here.
Workspaces organize files by putting them in folders. The three
folders we'll be concerned with in this tutorial are:
We'll begin by importing a single raw file.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" %CD%/sample\ne\resources\data\raw\voa2.txt
So here we use the "import" operation, which takes two arguments:
the folder name ("raw, unprocessed") and the file to import.
We've also used the --strip_suffix directive to modify the name
by
which the workspace knows the file. We can see the contents of the
workspace (and of each folder), with the "list" operation:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list "raw, unprocessed"
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list "raw, unprocessed"
raw, unprocessed:
voa2
If you try to import the file again, you'll get an error:
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample\ne\resources\data\raw\voa2.txt
Basename for sample/ne/resources/data/raw/voa2.txt already exists in workspace; not importing
In other words, once you create a particular basename in the
workspace using the "import" operation, you can't do it again.
In this step, we're going to learn about the UI aspects of the
workspace.
First, see the documentation on starting
the
Web
server and starting the UI.
We'll
assume that you're running one of the tabbed terminal
applications. In
the first pane, you should see something like:
Web server started on port 7801.
Web server command loop. Commands are:
exit - exit the command loop and stop the Web server
loopexit - exit the command loop, but leave the Web server running
taggerexit - shut down the tagger service, if it's running
restart - restart the Web server
ws_key - show the workspace key
help, ? - this message
Workspace key is XJ9dGBaCNveYHk9CZzw6wTM5WH8x05y1
Command:
Note the workspace key. This key is randomly generated, and known
only to the user who starts the Web server. This key must be
provided
to the UI when the user opens the workspace. This simple security
feature ensures that even though the Web server will be modifying
the
workspace, it does so if the UI user has proved that s/he has the
appropriate access.
Next:
You should see a window that looks like this:
Select "raw, unprocessed" from the folder menu. You should now
see
this:
A single left click on the file name in the workspace window
should
open the file:
Note how this file window differs from the one in file mode:
Operations make changes to files and move them around the
workspace.
For instance, the "Prepare for hand tagging" operation removes a
document from the "raw, unprocessed" folder, applies the
appropriate
engine steps, and saves it in the "in process" folder, at which
point
the document is ready for hand tagging.
Make sure that the operations menu says "Prepare for hand
tagging"
and press "Go". Your display should now look like this:
Note that the name of the folder in the file window has changes,
and
the list of available operations has changed. Note, too, that the
workspace pane now shows that the "raw, unprocessed" folder is
empty.
If you were to switch the folder using the folder menu to "in
process",
you'd find this document there.
At this point, you can annotate your document as you did in Tutorial 1. If you want to leave the
workspace without finishing your annotation, just select the Save
operation in the operations menu and press Go; you can always
return to
the document. Once you're satisfied with your annotations, select
"Mark
completed" in the operations menu and press Go; your document will
be
saved and moved to the completed folder.
You'd typically annotate several documents in the first round
before
building a model, but we want to move directly to that step. Since
we
only have one hand-annotated document at the moment, what we're
going
to do is import some other documents into the workspace. We're
going to
import some of the annotated documents that come with the Named
Entity
task into the completed folder, and we're going to import one of
them
into the "raw, unprocessed" folder.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample/ne/resources/data/raw/voa1.txt
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt.json" \
"completed" sample/ne/resources/data/json/voa[3-9].txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
"raw, unprocessed" sample\ne\resources\data\raw\voa1.txt
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt.json" \
"completed" sample\ne\resources\data\json\voa3.txt.json \
sample\ne\resources\data\json\voa4.txt.json \
sample\ne\resources\data\json\voa5.txt.json \
sample\ne\resources\data\json\voa6.txt.json \
sample\ne\resources\data\json\voa7.txt.json \
sample\ne\resources\data\json\voa8.txt.json \
sample\ne\resources\data\json\voa9.txt.json
Now, let's list the workspace to see what we have:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
rich, incoming:
raw, processed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
completed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
in process:
raw, unprocessed:
voa1
autotagged:
You can see that the document you tagged is in "completed", along
with the documents you just imported. You can also see that for
each
annotated document, there's a raw copy of the document in "raw,
processed" (you can mostly ignore these). And finally, you can see
that
there is one document in "raw, unprocessed" waiting to be
annotated.
Now, we build a model. This is a command line operation only.
We're
going to ask the workspace to autotag afterwards, which should
move
"voa1" into the "autotagged" folder. Each time we build a model
and
autotag, any documents that aren't in process or completed are
autotagged; documents which have already been autotagged are
returned
to "raw, unprocessed" first.
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace modelbuild \
--do_autotag "completed"
Windows native:
% %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace modelbuild \
--do_autotag "completed"
Once this is done, we can look at the contents of the workspace
again:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
rich, incoming:
raw, processed:
voa6 voa7 voa4 voa5 voa2 voa3 voa1 voa8 voa9
completed:
voa6 voa7 voa4 voa5 voa2 voa3 voa8 voa9
in process:
raw, unprocessed:
autotagged:
voa1
So you can see that there's no longer anything in "raw,
unprocessed", but now there's one document in autotagged.
Now, you'll want to hand-correct the autotagged document.
If the Web server has been running while you've performed the
last
two steps, the UI won't know that the state of the workspace has
changed. The safe thing is to close all open workspace documents,
and
press the "Refresh" button on the workspace folder window. Now,
the
state of the UI and the state of the workspace will be
synchronized.
Select the autotagged folder from the folder menu. You should see
"voa1". Open the document. If you want to hand correct it, select
the
"Hand correct" operation and press Go, and the document will be
moved
into the "in process" folder; if the document is correct, choose
"Mark
completed" and press Go, and the document will be moved into the
"completed" folder.
Once the document is in the "in process" folder, its status is
identical to the document at the end of step 5 above, and at this
point, you should be able to produce completed documents either
with
full hand annotation or corrected automated annotation, and repeat
the
cycle of model building and automated tagging.
In the next tutorial, we'll learn about the experiment engine. If
you want to learn how to use the experiment engine with
workspaces,
don't remove your workspace.
If you're not planning on doing any other tutorials, remove the
workspace:
Unix:
% rm -rf /tmp/ne_workspace
Windows native:
> rd /s /q %TMP%\ne_workspace list
If you don't
want the "Named Entity" task hanging around, remove it as shown in
the
final step of Tutorial 1.