You can work with documents in MAT either in file mode or in workspace mode. In this section, we
describe each mode and the differences between the two.
In file mode, you work with documents on an individual basis. MAT
doesn't care where they're loaded from, or where they're saved to. If
they're in MAT's rich standoff annotation format, they'll know what
steps have already been applied to them, but other than that, the user
must specify all the other parameters of any file mode operation:
File mode is provided by MATEngine on
the command line, and via "File -> Open file..." in the Web UI.
From the point of view of the UI, file mode in the Web server is stateless. Files are loaded from the
client, and saved to the client, and the Web server has no access to
the file system to load and save the files.
A workspace is a directory, which contains a set of predefined
directories for storing documents. We call these subdirectories folders. Each folder has a set of
operations that you can perform on documents in that folder; these
operations may create versions of the file in other folders, or move
the file to another folder as a result of the operation. Unlike file
mode, the way you interact with a workspace is almost entirely defined
for you.
Workspace mode is provided by MATWorkspaceEngine on the command
line, and via "File -> Open workspace..." in the Web UI.
Unlike file mode, workspace mode is stateful
from the point of view of the UI. It is the server, rather than the
client, which loads and saves the files. However, we don't want just
anybody to be able to cause the server to perform these stateful
operations, so the MAT web server implements
a very simple security mechanism.
The MAT web server doesn't support accounts, logins, or special file
permissions. Instead, when the server starts, it generates and prints a
workspace key. This key is a
32-character random alphanumeric sequence. When the user wants to
interact with the server in workspace mode, the UI prompts the user for
this key, and transmits it to the server, which compares it to the key
it generated. If they match, workspace mode is enabled; if they don't,
it's not. This mechanism guarantees that only the person who started
the Web server, or someone that person has transmitted the workspace
key to, can access the workspace via that Web server.
This mechanism, although simple and straightforward, has some
drawbacks. For instance, permissions aren't issued per workspace; the
Web server has exactly as much file access as the user who started the
Web server, which means that any user who has the workspace key can
modify any workspaces that the user who started the Web server can
modify. On the other hand, there's no account management required, and
the server can only be interrogated for the workspace key via its
command loop, which means that unless you have console access to the
Web server, you can't discover the key. We believe that for the
purposes of MAT, this security mechanism is good enough.
There is one more simple security mechanism involving workspaces and
the Web server. By default, the Web server only allows local clients to
access workspaces; if you're contacting the Web server from another
machine, you won't be able to open any workspaces. However, you can
override this behavior using the --allow_remote_workspace_access option.
Workspaces maintain an internal lock to ensure that any operations
which change the state of the workspace are exclusive. This locking
mechanism is quite simple - it relies on the presence or absence of the
"opLockfile" file. If something goes horribly wrong, it's
possible that the workspace may get in a stranded state, where it fails
to remove "opLockfile" at the end of the operation. If you're getting a
notification that the workspace is in use, and you're sure it's not,
you can remove the file by hand. As an added bonus, the file contents
will tell you what operation was being performed by which user, and
what time the lock was established.
As we said above, workspaces are just directories. The structure of
these directories looks like this:
With this background, let's see how you can use workspaces. Tutorial 6 presents examples of most of the
steps below, and more examples can be found in the documentation for MATWorkspaceEngine.
First, you create the workspace. The workspace must have an assigned
task, which you specify when you create it. Creating the workspace
creates the directory, the folder subdirectories, a place to store the
models, and some administrative information.
Workspace creation is currently only available on the command line.
Next, you import documents into the workspace. You'll import
documents into any one of a number of predefined folders:
You import documents as many times as you like, and at any point
while you work
with your workspace. For instance, you can import some documents, hand
annotate them, and then build a model, and then import more raw
documents to autotag.
File import is currently only available on the command line.
The vast majority of your time in the workspace will be spent
interacting with your documents. Each folder has predefined operations
which you can perform on documents in the folder.
folder |
operation |
availability |
description |
flag |
value |
---|---|---|---|---|---|
raw,
unprocessed |
autotag |
UI, command line |
Automatically tag documents with
the current model. Deposit the results in the "autotagged" folder. If
no specific basenames are specified, all eligible documents are
autotagged, including those which have already been autotagged
and those in the "rich, incoming" directory. Already autotagged
documents will be unwound according to the engine settings for the
autotag operation in the task.xml file. Note: this operation does not use the Carafe server, even in the UI. So the startup cost is incurred each time the autotag step is executed. |
||
tagprep |
UI, command line |
Prepare the documents for hand
tagging. Deposit the results in the "in process" folder. |
|||
rich,
incoming |
autotag |
UI, command line |
Automatically tag documents with
the current model. Deposit the results in the "autotagged" folder. If
no specific basenames are specified, all eligible documents are
autotagged, including those which have already been autotagged
and
those in the "raw, unprocessed" directory. Already autotagged documents
will be unwound according to the engine settings for the autotag
operation in the task.xml file. Note: this operation does not use the Carafe server, even in the UI. So the startup cost is incurred each time the autotag step is executed. |
||
tagprep |
UI, command line |
Prepare the documents for hand
tagging. Deposit the results in the "in process" folder. |
|||
in
process |
markcompleted |
UI, command line |
Move the documents into the
"completed" folder. In the UI, save the document if hand tagging has
been done. |
||
save |
UI |
Save the current hand tagging. |
mark_completed |
if present and the value is
"yes", the markcompleted operation will be applied immediately after
the save. |
|
completed |
modelbuild |
command
line |
Create a
model based on the
specified files in the folder (all of them, by default). Optionally,
perform the autotag step on other documents after the model is built. |
do_autotag |
if present and the value is
"yes", the autotag operation will be applied in the "raw, unprocessed"
folder immediately afterward. |
autotag_basenames |
if do_autotag is specified, a
space-separated sequence of basenames which are in "raw, unprocessed"
to autotag, rather than the entire contents of the "raw, unprocessed"
folder. |
||||
autotag_basename |
if do_autotag is specified, a basename which is in "raw, unprocessed" to autotag, rather than the entire contents of the "raw, unprocessed" folder. | ||||
markincomplete |
UI, command line |
Move the documents into the "in
process" folder. |
|||
autotagged |
handcorrect |
UI, command line |
Move the documents into the "in
process" folder. |
On the command line, these operations are applied by default to all
the files in the folder, and optional to a specified subset. In the UI,
on the other hand, these operations are only available on a
file-by-file basis. We haven't yet tackled managing the more
time-consuming folder-level
operations in the UI.
Because interacting with the workspace means switching between
longer-duration batch operations (e.g., model building) and quicker
file-level operations, (e.g., hand tagging), the user will end up
moving back and forth between the UI and the terminal. This is
currently unavoidable. Here's what a typical interaction might look
like.
(Alternatively, steps 3 and 6 can happen, per document, in the UI.)
Steps 5 and 6 can be repeated with newly imported documents, so you can
iteratively expand the model and your supply of hand-corrected
documents.
File mode requires more of the user at each step, but is also
significantly more flexible than workspace mode. Workspace mode, on the
other hand, provides considerably more structured support and
bookkeeping for the user, at the sacrifice of flexibility. For instance:
It's important to stress that file mode and workspace mode cannot be freely mixed. You can
invoke the file mode engine on a file in a workspace, but you'll likely
make a mess of things if you save it back to the workspace. Similarly,
you can't invoke the workspace engine on any file that hasn't been
imported into it. You can, for instance, process some documents in file
mode, and then import them into the workspace, but you can make a mess
of things by importing them into the wrong folder in the workspace.
Ideally, you'll load raw documents into the "raw, unprocessed" folder
in the workspace and do all your
operations on those documents starting from there.