Tutorial 6: Workspaces

Now that we've covered file mode in the first five tutorials, we're going to address workspace mode. In workspace mode, you don't have nearly as much control over

what your documents are named
how their annotation status is managed
where they live in the file system
where models are stored

On the other hand, you don't need to worry about any of those things, either.

We're going to use the same simple 'Named Entity' task, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.

Step 1: Create your workspace

The only way to create a workspace is on the command line. We use MATWorkspaceEngine. The first argument of MATWorkspaceEngine is the path of the affected workspace, and the second argument is the operation. Options and arguments for the chosen operation follow.

Creating a workspace requires a task, so we provide the --task directive. Workspaces also track annotation progress by user, so we need at least one user name to create the workspace:

Unix:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace create \
--task 'Named Entity' --initial_users user1

Windows native:

> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace create \
--task "Named Entity" --initial_users user1

Created workspace for task 'Named Entity' in directory ...

You now have a workspace in the specified directory. If you're interested in the structure of a workspace, look here.

Step 2: Import files into your workspace

Workspaces organize files by folders, and they track the status of the files as they're processed. The "core" folder supports all the normal annotation functions. We'll begin by importing a single raw file into the core folder.

Unix:

% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa2.txt 

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" %CD%\sample\ne\resources\data\raw\voa2.txt

So here we use the "import" operation, which takes two arguments: the folder name ("core") and the file to import. We've also used the --strip_suffix directive to modify the name by which the workspace knows the file. Finally, we've told the workspace engine, via the --file_type option, that the file we're importing is a raw file (rather than a rich MAT JSON file). For more details on importing documents, see here.

We can see the contents of the workspace (and of each folder), with the "list" operation:

Unix:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list "core"

Windows native:

> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list "core"

core:
  voa2 (unannotated)

Note that the listing tells you the status of the document.

You can only import a file name once. If you try to import the file again, you'll get an error:

Unix: 

% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa2.txt 

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample\ne\resources\data\raw\voa2.txt 

Basename for sample/ne/resources/data/raw/voa2.txt already exists in workspace; not importing

In other words, once you create a particular basename in the workspace using the "import" operation, you can't do it again.

Step 3: Open the workspace in the UI

In this step, we're going to learn about the UI aspects of the workspace.

First, start up the UI as we described in tutorial 1.

Note: when you start up the Web server in its default mode, workspaces will only be accessible from a browser client running on the same host. There are many options available to the Web server at startup which affect the workspaces, so if you want to use workspaces in the UI, we recommend that you familiarize yourself with the MATWeb documentation.

In the terminal in which you're running the Web server, you'll see this when it starts up:

Web server started on port 7801.

Web server command loop. Commands are:

exit       - exit the command loop and stop the Web server
loopexit   - exit the command loop, but leave the Web server running
taggerexit - shut down the tagger service, if it's running
restart    - restart the Web server
ws_key     - show the workspace key
help, ?    - this message

Workspace key is XJ9dGBaCNveYHk9CZzw6wTM5WH8x05y1
Command:

Note the workspace key. This key is randomly generated, and known only to the user who starts the Web server. This key must be provided to the UI when the user opens the workspace. This simple security feature ensures that even though the Web server will be modifying the workspace, it does so if the UI user has proved that s/he has the appropriate access. For more about workspace security and the UI, see here.

In the UI, select File -> Open workspace... . You'll see a popup window.
In the "User ID" field, specify "user1" (without the quotes; this is the user name we provided when we created the workspace), and press <tab> to advance and activate the next input field.
Copy the workspace key from the Web server output. If you can't see it due to the output from the Web server, type "ws_key" in the Web server terminal, and then press <return>. Paste the key into the "Workspace key" field in the UI. Press <tab> to advance to the next input field.
In the "Directory:" field, type "/tmp/ne_workspace". Press <tab>.
Press the "Open" button.

You should see a window that looks like this:

[core folder]

Step 4: Open a document

A single left click on the file name in the workspace tab should open the file. You'll see that this document has been prepared for annotation (it has been zoned and tokenized, in particular). You'll see in the controls on the right that its status, as shown in the listing above, is "unannotated", which means that no human annotator has touched it yet:

[core view]

Note how the controls area here differs from the one in file mode:

The workspace is listed, instead of the task.
The workflow menu is missing, and the folder is listed instead.
The status fields and forward and backward buttons are missing, and there's an operation menu instead.
There's no reload or save button.

If you select the folder tab now, you'll see that the document is now listed as "unannotated, locked by user1". Workspaces maintain document locks to ensure that no one else trounces your changes. This lock will be freed when you close the document.

Step 5: Hand annotate

At this point, you can annotate your document as you did in Tutorial 1. If you want to leave the workspace without finishing your annotation, just select the Save operation in the operations menu and press Go; you can always return to the document. Once you're satisfied with your annotations, select "Mark gold" in the operations menu and press Go; your document will be saved and the document status updated.

Finally, close the document. In a minute, we're going to do some automated tagging in the workspace, and currently this is not possible while documents are locked.

Step 6: Import more documents

You'd typically annotate several documents in the first round before building a model, but we want to move directly to that step. Since we only have one hand-annotated document at the moment, what we're going to do is import some other documents into the workspace. We're going to import some of the annotated documents that come with the Named Entity task into the core folder; these documents are already marked internally as gold-standard reconciled documents (i.e., in addition to being marked gold, their correctness has been validated by further review). We're also going to import one of them as a raw document.

Unix:

% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa1.txt
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt.json" \
"core" sample/ne/resources/data/json/voa[3-9].txt.json

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw  "core" sample\ne\resources\data\raw\voa1.txt
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt.json" \
"core" sample\ne\resources\data\json\voa3.txt.json \
sample\ne\resources\data\json\voa4.txt.json \
sample\ne\resources\data\json\voa5.txt.json \
sample\ne\resources\data\json\voa6.txt.json \
sample\ne\resources\data\json\voa7.txt.json \
sample\ne\resources\data\json\voa8.txt.json \
sample\ne\resources\data\json\voa9.txt.json

Now, let's list the workspace to see what we have:

Unix:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list

Windows native:

> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list

core:
  voa1 (unannotated)
  voa2 (gold)
  voa3 (reconciled)
  voa4 (reconciled)
  voa5 (reconciled)
  voa6 (reconciled)
  voa7 (reconciled)
  voa8 (reconciled)
  voa9 (reconciled)

export:

You can see that the document you tagged is marked gold, and the documents you just imported are marked reconciled. And finally, you can see that there is one document - the raw document you just imported - which is marked annotated.

Step 7: Build a model

Now, we build a model. Workspace models are completely distinct from from default task models, like the one we built in Tutorial 2. They're built exclusively from the documents in the workspace.

This is a command line operation only. We're going to ask the workspace to autotag afterwards, which should mark "voa1" as uncorrected (since now it's been automatically annotated). Each time we build a model and autotag, any documents that are either unannotated or uncorrected are autotagged.

Unix:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace modelbuild \
--do_autotag "core"

Windows native:

% %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace modelbuild \
--do_autotag "core"

Once this is done, we can look at the contents of the workspace again:

Unix:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list

Windows native:

> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list

core:
  voa1 (uncorrected)
  voa2 (gold)
  voa3 (reconciled)
  voa4 (reconciled)
  voa5 (reconciled)
  voa6 (reconciled)
  voa7 (reconciled)
  voa8 (reconciled)
  voa9 (reconciled)

export:

Note that voa1, which was previously unannotated, is now uncorrected - i.e., it's been autotagged but not hand-corrected. The other documents, because they're gold or reconciled, were used to create the model which the workspace applied to voa1.

Step 8: Hand correct

Now, you'll want to hand-correct the autotagged document.

If your UI has been open while you've performed the last two steps on the command line, the UI won't know that the state of the workspace has changed. You can select the workspace tab and press the "Refresh" button in the controls area. Now, the state of the UI and the state of the workspace will be synchronized.

Select the core folder from the folder menu. You should see "voa1", among other documents. Open it. Review the annotations and correct whatever is needed. When the document is correct, choose "Mark gold" and press Go, and the document will be marked gold.

Step 9: Clean up (optional)

In the next tutorial, we'll learn about the experiment engine. If you want to learn how to use the experiment engine with workspaces, don't remove your workspace.

If you're not planning on doing any other tutorials, remove the workspace:

Unix:

% rm -rf /tmp/ne_workspace

Windows native:

> rd /s /q %TMP%\ne_workspace list

If you don't want the "Named Entity" task hanging around, remove it as shown in the final step of Tutorial 1.

This concludes Tutorial 6.