Using workspaces

Workspaces provide a guided, structured way of managing and processing your documents. Make sure that this is what you want. Workspace mode is provided by MATWorkspaceEngine on the command line, and via "File -> Open workspace..." in the Web UI. You can find a summary of the highlights about using workspaces here; this document provides the details.

The structure of the workspace directory
Workspace users
Workspace operations
Workspace security
Workspace locking
Troubleshooting
Advanced topic: workspace reconciliation
Advanced topic: the workspace database

The structure of the workspace directory

Workspaces are just directories. The structure of these directories looks like this:

folders (dir) - a directory containing the workspace folders, which are themselves directories. This directory will contain, at least:

core (dir)
reconciliation (dir) (not yet used)
export (dir) (not yet used)

models (dir) - a directory containing any models that are built during workspace operations

model (file) - the most recently generated Carafe model (may not exist)
model_basenames (file) - a file containing a list of the basenames used to create the most recent model

opLockfile (file) - if present, the workspace is locked.
_checkpoint (dir) - if logging is enabled, contains the log
ws_db.db (file) - the workspace SQLite database, which contains metadata about the documents and the workspace itself

Workspace users

One of the innovations in workspaces in MAT 2.0 is its close connection with segments and annotation progress. All documents in workspaces are now closely tracked for their annotation state (including, ultimately, annotation of subsections of documents), which includes tracking who modified the annotations in the various document portions. As a result, every document edit in workspaces is linked to a workspace user.

The inventory of users of a workspace is entirely up to its creators and managers. Every workspace must be created with at least one initial user. The names of these users are not bound to any external resource; they're not required to be the same as login names, for instance. They're merely there to provide a way of attributing document changes. There's no account management or passwords; you can "claim" to be any registered user you want to claim to be when you edit a workspace. We're assuming that you're using MAT workspaces in a cooperative environment in which this sort of inappropriate behavior won't arise.

Although there's no requirement that registered user names correspond to external resources like login names, you may find it easiest to use login names anyway, so that your workspace annotators don't have to remember a different name when they open a workspace.

Workspace users are assigned roles, which indicate what they can do within the workspace. By default, all users can annotate documents in the core folder. If workspace reconciliation is configured, workspace users are also assigned reconciliation roles. By default, each user has all reconciliation roles except "human_decision" (the ability to make an enforceable judgment about a reconciliation choice).

Workspace operations

The available operations are:

topic	operation	availability	folder
creation	create	command line	(global)
file management	import	command line	(global)
	remove	command line	(global)
	assign	command line	(global)
	open_file	UI, command line debug	(global)
	markgold	UI, command line debug	core
	unmarkgold	UI, command line debug	core
	save	UI, command line debug	core, reconciliation
inspection	list	UI, command line	(global)
	workspace_configuration	command line	(global)
	dump_database	command line	(global)
logging	enable_logging	command line	(global)
	disable_logging	command line	(global)
	rerun_log	command line	(global)
users	register_users	command line	(global)
	list_users	command line	(global)
	add_roles	command line	(global)
	remove_roles	command line	(global)
automated tagging	modelbuild	command line	core
automated tagging	autotag	UI, command line	core
experimentation	list_basename_sets	command line	(global)
	add_to_basename_set	command line	(global)
	remove_from_basename_set	command line	(global)
	run_experiment	command line	(global)
reconciliation (not yet enabled)	configure_reconciliation	command line	(global)
	submit_to_reconciliation	command line	core
	remove_from_reconciliation	command line	reconciliation
administration	force_unlock	command line	core

There are also internal operations which are not publicly visible (release_lock, update_ui_log).

We'll review each of these operations in turn.

Creation

create

The create operation creates a workspace. It requires a task and an initial user.

This operation is available only on the command line.

File management

import

The import operation ingests documents into the workspace. The documents are all converted to MAT JSON format, and are prepared for annotation. You can optionally assign documents to users.

This operation is only available on the command line.

Historically, the import operation could target multiple folders, but in MAT 2.0, only the core folder is eligible for import.

Configuring the import operation in task.xml

In task.xml, you can specify the default process by which documents are prepared for annotation when they're imported. Here's an example:

  <workspace>
    ...
    <operation name="import">
      <settings workflow="Demo" steps="zone,tokenize"/>
    </operation>
    ...
  </workspace>

As described here, these settings can be overridden using the --workflows and --steps options described in MATWorkspaceEngine.

remove

The remove operation removes all copies of the basename from the workspace. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

This operation is only available on the command line.

assign

This operation assigns the specified basenames to the specified users. Each user gets his or her own copy of the document to annotate. If the document's annotations have been already altered by a human, the basename cannot be assigned.

This operation is only available on the command line.

open_file

This operation opens a workspace file and returns its contents. It also locks the workspace file in the workspace database. This lock is typically released when a file is closed in the UI, using the private release_lock operation. If this document is "stranded" - if, for instance, a user forgets to close the document - you can use the force_unlock operation to fix this.

This operation is available in the MAT UI, or on the command line if --debug is provided.

markgold

This operation marks all of the "non-gold" segments in a document "human gold".

This operation is available in the MAT UI, or indirectly on the command line via the import operation, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

unmarkgold

This operation marks all of the "human gold" or "reconciled" segments in a document "non-gold".

This operation is available in the MAT UI, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

save

This operation saves the contents of a workspace file.

This operation is available in the MAT UI, or on the command line if --debug is provided.

Logging

MAT provides a rich and extensive logging infrastructure specifically for workspaces. When logging is enabled, MAT workspace operations log every action and data modification, so that the activities in the workspace can be rerun from the point that logging was enabled, exactly as they were originally performed.

Workspace logging is distinct from UI logging. The MAT UI has the capability of capturing all the user gestures, and save these gestures to a CSV file at the user's request. If workspace logging is enabled, the UI turns on this capability specifically for the current workspace, and uploads the log fragments to the MAT server with every save operation in the "core" folder. The format of this log is identical to the format of the UI logger. Unlike general UI logging, this logging cannot be configured or controlled from the UI. Finally, this logging does not interfere with general UI logging; if you choose to enable UI logging, you'll still get all the user gestures, including those that are captured for workspace logging.

enable_logging

This operation enables the logging. The log will be saved in the _checkpoint subdirectory of the workspace directory.

This operation is available on the command line.

disable_logging

This operation disables logging. If a log is being collected, by default it is moved to the first available _checkpoint_<n> path. However, the user can force the log to be disabled if she chooses. In either case, this ensures that _checkpoint never contains a discontinuous log.

This operation is available on the command line.

rerun_log

This operation allows you to rerun the log. It will use the _checkpoint/_rerun subdirectory of the workspace directory to store the rerun state. You can use this capability to recreate any intermediate state of your workspace, e.g., for experiment analysis.

This operation is available on the command line.

Inspection

list

This operation shows you the contents of the folders in the workspace. The listing shows you the status of the document, as well as who it's assigned to.

It is available both on the command line, and in the MAT UI as part of the workspace interface.

workspace_configuration

This operation describes a number of properties of the workspace. Most of these properties are capabilities of MAT which are currently in development, but not yet publicly released. We've included the infrastructure for supporting these emerging capabilities in order to ensure that users of MAT will not have to update their workspaces when these capabilities are released. The properties reported are:

Task: the name of the task that the workspace uses.
Users: the workspace users that are registered.
Reconciliation phases: in a future release, MAT will support reconciliation within workspaces, in some fairly flexible, powerful configurations. In MAT 2.0, this capability is disabled.
Logging: in a future release, MAT will support an elaborate workspace logging capability, which includes the ability to capture all workspace actions and rerun workspace activity from the point that logging was enabled. In MAT 2.0, this capability is disabled.
Prioritization: in a future release, MAT may support prioritization queues, to enable techniques such as active learning. In MAT 2.0, this capability is disabled.

dump_database

This operation describes all the tables in the workspace database. It is a useful debugging tool for the technically inclined.

This operation is only available on the command line.

Users

Workspace users have roles which say what they can do in the workspace, but unless workspace reconciliation is enabled, users have only one available role, "core_annotation", which means the user is eligible to perform annotation. If reconciliation is enabled, each reconciliation phase is also recognized as a role. The role "all" is a shorthand for all available roles.

You can explicitly specify user roles which you register the users, or afterward. You may want to vary the available roles for annotators because, e.g., you may want only some of them to participate in particular reconciliation phases; say, you might want only some annotators to be able to perform the decisive human_decision reconciliation step.

register_users

This operation allows you to add registered users to your workspace. Perhaps you want to be able to track the contributions of multiple annotators, or you might want to actually assign documents to multiple annotators and do multiple annotation. You may also want to assign roles to your users. You cannot unregister users once they're registered, although you can remove all their roles.

This operation is only available on the command line.

list_users

This operation lists the users in a workspace. It is also available as part of the workspace_configuration operation.

This operation is only available on the command line.

add_roles

The add_roles operation adds roles to existing users.

This operation is only available on the command line.

remove_roles

The remove_roles operation removes roles from existing users.

This operation is only available on the command line.

Automated tagging

modelbuild

This operation builds a model which can be used to autotag other documents. Every document segment in the workspace which has been touched by a human annotator is used to build this model. If there are multiple copies of a document because the document is multiply assigned, all copies will be used (so that document will be overrepresented in the model, and all conflicting annotations will be used as well). You can optionally ask the workspace to autotag documents after the model is built.

Note: the workspace model is completely distinct from the default task model.

This operation is only available on the command line.

Configuring the modelbuild operation in task.xml

If you want to customize your modelbuild operation, e.g., restrict it to just the gold segments, you can do so in task.xml. You can use any setting that's available to the training engine.

  <workspace>
    ...
    <operation name="modelbuild">
      <settings partial_training_on_gold_only="yes"/>
    </operation>
    ...
  </workspace>

autotag

This operation automatically tags documents using the current workspace model. You can specify individual basenames to tag, or tag all documents. The tagging engine will only tag those document segments which have not yet been touched by a human annotator. Existing (machine-generated) annotations in those segments will be discarded and new ones added.

Note: this operation does not use the Carafe tagging server, even in the UI. So the startup cost of the tagging engine is incurred each time the autotag operation is executed. This operation also does not use the default task model, ever; it only uses models constructed using the modelbuild operation.

This operation is available in the MAT UI (for individual documents) and on the command line. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

Experimentation

We can establish basename sets which we can reference when we run experiments.

list_basename_sets

This operation lists the basename sets and their contents. This operation is only available on the command line.

add_to_basename_set

This operation adds basenames to a given basename set (and implicitly creates the set if necessary). This operation is only available on the command line.

remove_from_basename_set

This operation removes basenames from a given basename set (and implicitly removes the set if necessary). This operation is only available on the command line.

run_experiment

This operation allows you to run an experiment based on this workspace, either using an experiment file or by specifying the properties of the test set in terms of properties of the workspace basenames.

Reconciliation

Each MAT workspace has the ability to support reconciliation, which is the process by which the consistency of annotations is checked, and conflicts are possibly resolved. This process is not yet available due to UI limitations, but will be in our next release. As part of this process, you'll be able to perform cross-validation of the input documents, to help identify inconsistencies in human annotation.

You can submit any document to reconciliation at any point in the annotation process (as long as it isn't being annotated by someone). You must configure reconciliation before you submit any documents.

configure_reconciliation

Use this operation to establish the active reconciliation phases for your workspace.

This operation is only available on the command line.

submit_to_reconciliation

Use this operation to submit documents for reconciliation. The phases that are assigned to the documents will be the phases provided in the most recent configure_reconciliation operation.

This operation is only available on the command line.

remove_from_reconciliation

If, for some reason, a document fails to exit reconciliation naturally (if some of the users fail to complete their reconciliation steps, for example), you can use this operation to remove the document forcibly from reconciliation. You have the option of discarding the reconciliation decisions that were made.

This operation is only available on the command line.

Administration

force_unlock

This operation forces a basename in the named folder to be unlocked. Warning: be very certain that you apply the force_unlock operation only to basenames whose locks have been stranded. If you unlock a basename which is being annotated, the annotator will not be able to save her changes.

This operation is only available on the command line.

Workspace security

Unlike file mode, workspace mode is stateful from the point of view of the UI. It is the server, rather than the client, which loads and saves the files. However, we don't want just anybody to be able to cause the server to perform these stateful operations, so the MAT web server implements some security mechanisms.

Note, however, that the MAT workspace functionality is not an enterprise-secure implementation, and will never be one. It does not use SSL; it does not perform any sort of user authentication beyond the workspace key; it does not provide any security logging or traceability; and it does not currently implement transactions. You should assume that anyone who has access to your network can see your workspace traffic, and overwrite your data.

Note that workspace users play no role in workspace security.

Workspace locking

Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong, it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.

Troubleshooting

Failed import

You may realize, once you've completed an import operation, that you didn't import the basenames the way you'd wanted; perhaps you'd intended to strip a suffix, or you assigned them to the wrong workspace user. You can use the remove operation to remove the basenames from the workspace in preparation for re-importing. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove basename1...

If you're not sure what basenames are available, the --help option will list them:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove --help

Locked files

The workspaces do not permit documents to be edited by more than one annotator at a time. The workspaces achieve this exclusivity through the use of file locks, which are recorded in the workspace database. When an annotator opens a document for annotation, the annotation UI is given a lock ID which it can use to release the document when the editing session is over. In some circumstances, unfortunately, the document is not unlocked; for instance, if the UI encounters an unexpected error and crashes before unlocking the document. You can use the force_unlock operation to clear this lock from the database.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> force_unlock --user user1 core basename1

If you just want to unlock everything, don't specify any basenames. If you want to know what's locked, use the dump_databsae operation:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> dump_database

This will show you the content of the workspace database tables.

Warning: be very certain that you apply the force_unlock operation only to basenames whose locks have been stranded. If you unlock a basename which is being annotated, the annotator will not be able to save her changes.

Error "workspace is currently unavailable (processing another request)"

If you get this error message, and you're absolutely certain that no one else is working on the workspace, something horrible has happened, and a previous operation has failed in such a way to fail to remove the "opLockfile" file. More on how to deal with this here.

Advanced topic: workspace reconciliation

Each MAT workspace has ability to support reconciliation, which is the process by which the consistency of annotations is checked, and conflicts are possibly resolved. This facility is not yet available due to UI limitations, but it will be in our next release. This section describes the behavior that we plan to make available.

You can submit any document to reconciliation at any point in the annotation process (as long as it isn't being annotated by someone). You'll use the submit_to_reconciliation operation to submit the documents. This operation will lock the basenames in the core folder (so no one can open those documents for annotation) and prepare a document, called a reconciliation document, which contains all the annotations in all the documents that correspond to that basename, sorted into "votes" which indicate, for each segment of the document in conflict, which annotator produced which pattern of annotations. The workspace annotators will then follow the reconciliation steps which are configured at the time that the documents are submitted for reconciliation.

Reconciliation steps

There are three possible reconciliation steps currently supported. You can set up your workspace to use any or all of these reconciliation steps, and you can change what steps are enabled for future submissions at any given time. The available steps, in order, are:

crossvalidation_challenge: in this step, the available documents are divided into several subsets, and a set of candidate annotations for each document is prepared by building an annotation model using n-1 subsets and applying it to the nth subset. For each segment that a given annotator has annotated, these candidate annotations are compared to the annotations that the annotator provided. If they match, that segment is marked as reconciled. If they don't match, the annotator is presented with the option of preferring the automatically generated annotations. If the annotator accepts that option, the segment is marked as reconciled; if not, the segment remains unreconciled.
human_vote: in this step, each annotator is given the opportunity to vote on the available annotation patterns for each unreconciled segment. The annotator also has the option of adding a new pattern. If one of the patterns garners the votes of more than half of the assigned annotators, the segment is marked as reconciled; if not, the segment remains unreconciled.
human_decision: in this step, a designated review makes a choice among the available annotation patterns, or introduces her own. This choice is final, and the segment is marked as reconciled.

The system will advance documents through these steps automatically if possible (so, for instance, if an annotator makes a choice during the crossvalidation_challenge step, and no annotator adds any new annotation patterns, the system assumes that the annotator's vote will not change). Once all segments have been marked as reconciled, or the document has passed through all assigned annotators and steps, it exits reconciliation, and the agreed-upon changes are folded back into the documents in the core annotation folder which were submitted to reconciliation. So if the same document is assigned to two annotators, and it passes through reconciliation and the conflicts are resolved, those assigned documents will be altered to reflect the reconciliations.

SEGMENTs in reconciliation

The use of the SEGMENTs in reconciliation differs slightly from its use in core annotation, especially with respect to the value of its "status" attribute. The three significant "status" attribute values in reconciliation are:

"ignore during reconciliation" (some input document has this segment marked "non-gold", so it's not reconcilable yet)
"human gold" (all input documents have this segment marked "human gold", but not all the segments match in their annotations)
"reconciled" (all input documents have this segment marked "human gold" or "reconciled", and all the segments match in their annotations)

In addition, there's additional administrative information on the segment that records the state of the reconciliation.

Stranded reconciliation documents

If you submit a document to reconciliation, it may remain in reconciliation because, e.g., an annotator who was registered with one of the relevant roles is no longer working on the project. Or you may have submitted it to reconciliation in error. You can use the remove_from_reconciliation operation to remove the document.

Keep in mind that the document may already be partially reconciled. If you want to remove the document and preserve the decisions already made, you can use the operation as follows:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove_from_reconciliation reconciliation basename1

This will migrate the agreed-upon document segments back into the documents which were used to create the reconciliation document. If you do not want to preserve those decisions, and simply want to stop the document from being reconciled, do this instead:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove_from_reconciliation --dont_reintegrate reconciliation basename1

Advanced topic: the workspace database

The workspace database is an SQLite database which tracks the status of documents, users, and the workspace itself. The schema can be found in MAT_PKG_HOME/lib/mat/python/MAT/ws_db.sql. The tables are:

document_info: contains the basenames and document names in the core folder, the user they're assigned to, the transaction ID, and the document status
users: lists the users in the workspace
workspace_state: specifies the workspace-level metadata, including the task and the number of retained models
basename_sets: specifies the basename sets and basenames in them

There are other tables and columns which relate to workspace features we have yet to enable. We will document those features of the database as the corresponding workspace features are enabled.