Files and workspaces

You can work with documents in MAT either in file mode or in workspace mode. In this section, we describe each mode and the differences between the two.

File mode

In file mode, you work with documents on an individual basis. MAT doesn't care where they're loaded from, or where they're saved to. If they're in MAT's rich standoff annotation format, they'll know what steps have already been applied to them, but other than that, the user must specify all the other parameters of any file mode operation:

the task
the workflow
the input encoding of raw documents
the exact location of the document or directory of documents to be processed
the exact name of the input and output documents, including whatever suffixes should be removed or added in order to distinguish them from one another
whether the input and output are rich or raw
the workflow steps to apply to the document

File mode is provided by MATEngine on the command line, and via "File -> Open file..." in the Web UI.

From the point of view of the UI, file mode in the Web server is stateless. Files are loaded from the client, and saved to the client, and the Web server has no access to the file system to load and save the files.

Workspace mode

A workspace is a directory, which contains a set of predefined directories for storing documents. We call these subdirectories folders. Each folder has a set of operations that you can perform on documents in that folder; these operations may create versions of the file in other folders, or move the file to another folder as a result of the operation. Unlike file mode, the way you interact with a workspace is almost entirely defined for you.

Workspace mode is provided by MATWorkspaceEngine on the command line, and via "File -> Open workspace..." in the Web UI.

Statefulness and the workspace key

Unlike file mode, workspace mode is stateful from the point of view of the UI. It is the server, rather than the client, which loads and saves the files. However, we don't want just anybody to be able to cause the server to perform these stateful operations, so the MAT web server implements a very simple security mechanism.

The MAT web server doesn't support accounts, logins, or special file permissions. Instead, when the server starts, it generates and prints a workspace key. This key is a 32-character random alphanumeric sequence. When the user wants to interact with the server in workspace mode, the UI prompts the user for this key, and transmits it to the server, which compares it to the key it generated. If they match, workspace mode is enabled; if they don't, it's not. This mechanism guarantees that only the person who started the Web server, or someone that person has transmitted the workspace key to, can access the workspace via that Web server.

This mechanism, although simple and straightforward, has some drawbacks. For instance, permissions aren't issued per workspace; the Web server has exactly as much file access as the user who started the Web server, which means that any user who has the workspace key can modify any workspaces that the user who started the Web server can modify. On the other hand, there's no account management required, and the server can only be interrogated for the workspace key via its command loop, which means that unless you have console access to the Web server, you can't discover the key. We believe that for the purposes of MAT, this security mechanism is good enough.

There is one more simple security mechanism involving workspaces and the Web server. By default, the Web server only allows local clients to access workspaces; if you're contacting the Web server from another machine, you won't be able to open any workspaces. However, you can override this behavior using the --allow_remote_workspace_access option.

Workspace locking

Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong, it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.

The structure of the workspace directory

As we said above, workspaces are just directories. The structure of these directories looks like this:

filenames.txt (file) - a file containing a list of every file basename in the workspace (see step 2 below)
folders (dir) - a directory containing the workspace folders, which are themselves directories (see step 1 below). This directory will contain, at least:

raw_unprocessed (dir)
in_process (dir)
completed (dir)
autotagged (dir)

models (dir) - a directory containing any models that are built during workspace operations

model (file) - the most recently generated Carafe model (may not exist)
model_basenames (file) - a file containing a list of the basenames used to create the most recent model

properties.txt (file) - the properties of the workspace
opLockfile (file) - if present, the workspace is locked.
last_import (file) - a file containing the timestamp of the last workspace import operation (see step 2 below)

With this background, let's see how you can use workspaces. Tutorial 6 presents examples of most of the steps below, and more examples can be found in the documentation for MATWorkspaceEngine.

Step 1: create the workspace

First, you create the workspace. The workspace must have an assigned task, which you specify when you create it. Creating the workspace creates the directory, the folder subdirectories, a place to store the models, and some administrative information.

Workspace creation is currently only available on the command line.

Step 2: import documents

Next, you import documents into the workspace. You'll import documents into any one of a number of predefined folders:

"raw, unprocessed": this is the folder for raw documents to which nothing has been done.
"rich, incoming": this is the folder for rich documents which can't be imported into any other folder (e.g., they're zoned, but the "in process" folder requires both zoning and tokenization to be completed).
"in process": this is the folder for documents for which hand tagging is underway. If you've already hand-tagged some documents and you're not done with them, import them into this folder. Documents imported into this folder must be in "mat-json" format (not raw).
"completed": this is the folder for documents for which hand tagging is done. If you've already hand-tagged some documents and you're done with them, import them into this folder. Documents imported into this folder must be in "mat-json" format (not raw).
"autotagged": this is the folder for documents which have been automatically tagged with the Carafe engine. If you've applied the "tag" step to documents in file mode, and gone no farther. Documents imported into this folder must be in "mat-json" format (not raw).

There are other predefined folders (e.g., "raw, processed" contains raw versions of documents which have already been processed), but these are the only folders you can import documents into. Your task may also define additional folders.

When a document is imported, it is assigned a unique basename, which is usually the basename of the path of the imported file (i.e., the final path component). All versions of this file in the various workspace folders have the identical basename.

You import documents as many times as you like, and at any point while you work with your workspace. For instance, you can import some documents, hand annotate them, and then build a model, and then import more raw documents to autotag.

File import is currently only available on the command line.

Step 3: perform operations on documents.

The vast majority of your time in the workspace will be spent interacting with your documents. Each folder has predefined operations which you can perform on documents in the folder.

folder	operation	availability	description	flag	value
raw, unprocessed	autotag	UI, command line	Automatically tag documents with the current model. Deposit the results in the "autotagged" folder. If no specific basenames are specified, all eligible documents are autotagged, including those which have already been autotagged and those in the "rich, incoming" directory. Already autotagged documents will be unwound according to the engine settings for the autotag operation in the task.xml file. Note: this operation does not use the Carafe server, even in the UI. So the startup cost is incurred each time the autotag step is executed.
raw, unprocessed	tagprep	UI, command line	Prepare the documents for hand tagging. Deposit the results in the "in process" folder.
rich, incoming	autotag	UI, command line	Automatically tag documents with the current model. Deposit the results in the "autotagged" folder. If no specific basenames are specified, all eligible documents are autotagged, including those which have already been autotagged and those in the "raw, unprocessed" directory. Already autotagged documents will be unwound according to the engine settings for the autotag operation in the task.xml file. Note: this operation does not use the Carafe server, even in the UI. So the startup cost is incurred each time the autotag step is executed.
rich, incoming	tagprep	UI, command line	Prepare the documents for hand tagging. Deposit the results in the "in process" folder.
in process	markcompleted	UI, command line	Move the documents into the "completed" folder. In the UI, save the document if hand tagging has been done.
in process	save	UI	Save the current hand tagging.	mark_completed	if present and the value is "yes", the markcompleted operation will be applied immediately after the save.
completed	modelbuild	command line	Create a model based on the specified files in the folder (all of them, by default). Optionally, perform the autotag step on other documents after the model is built.	do_autotag	if present and the value is "yes", the autotag operation will be applied in the "raw, unprocessed" folder immediately afterward.
				autotag_basenames	if do_autotag is specified, a space-separated sequence of basenames which are in "raw, unprocessed" to autotag, rather than the entire contents of the "raw, unprocessed" folder.
				autotag_basename	if do_autotag is specified, a basename which is in "raw, unprocessed" to autotag, rather than the entire contents of the "raw, unprocessed" folder.
	markincomplete	UI, command line	Move the documents into the "in process" folder.
autotagged	handcorrect	UI, command line	Move the documents into the "in process" folder.

On the command line, these operations are applied by default to all the files in the folder, and optional to a specified subset. In the UI, on the other hand, these operations are only available on a file-by-file basis. We haven't yet tackled managing the more time-consuming folder-level operations in the UI.

A typical interaction

Because interacting with the workspace means switching between longer-duration batch operations (e.g., model building) and quicker file-level operations, (e.g., hand tagging), the user will end up moving back and forth between the UI and the terminal. This is currently unavoidable. Here's what a typical interaction might look like.

Command line: Create a workspace
Command line: Import a batch of documents
Command line: Prepare the documents for hand tagging
UI: Hand annotate some documents
Command line: Build a model and autotag the raw documents you haven't hand annotated
Command line: Make the autotagged documents available for hand correction

(Alternatively, steps 3 and 6 can happen, per document, in the UI.) Steps 5 and 6 can be repeated with newly imported documents, so you can iteratively expand the model and your supply of hand-corrected documents.

Comparing the two modes

File mode requires more of the user at each step, but is also significantly more flexible than workspace mode. Workspace mode, on the other hand, provides considerably more structured support and bookkeeping for the user, at the sacrifice of flexibility. For instance:

In file mode, you can undo steps in the UI. You can't undo steps at all in workspace mode.
In file mode, you can apply an arbitrary sequence of steps from the current workflow at the same time. In workspace mode, on the other hand, you must move forward one operation at a time.
In workspace mode, you never have to worry about where your documents live, or what they're named. In file mode, you have to manage this yourself.

It's important to stress that file mode and workspace mode cannot be freely mixed. You can invoke the file mode engine on a file in a workspace, but you'll likely make a mess of things if you save it back to the workspace. Similarly, you can't invoke the workspace engine on any file that hasn't been imported into it. You can, for instance, process some documents in file mode, and then import them into the workspace, but you can make a mess of things by importing them into the wrong folder in the workspace. Ideally, you'll load raw documents into the "raw, unprocessed" folder in the workspace and do all your operations on those documents starting from there.