Workspaces provide a guided, structured way of managing and
processing your documents. Make
sure that this is what you want. Workspace mode is provided
by MATWorkspaceEngine on
the command line, and via "File -> Open workspace..." in the
Web UI. You can find a summary of the highlights about using
workspaces here; this document
provides the details.
Workspaces are just directories. The structure of these
directories looks like this:
One of the innovations in workspaces in MAT 2.0 is its close
connection with segments
and annotation progress. All documents in workspaces are now
closely tracked for their annotation state (including, ultimately,
annotation of subsections of documents), which includes tracking
who modified the annotations in the various document portions. As
a result, every document edit in workspaces is linked to a
workspace user.
The inventory of users of a workspace is entirely up to its
creators and managers. Every workspace must be created with at
least one initial user. The names of these users are not bound to
any external resource; they're not required to be the same as
login names, for instance. They're merely there to provide a way
of attributing document changes. There's no account management or
passwords; you can "claim" to be any registered user you want to
claim to be when you edit a workspace. We're assuming that you're
using MAT workspaces in a cooperative environment in which this
sort of inappropriate behavior won't arise.
Although there's no requirement that registered user names
correspond to external resources like login names, you may find it
easiest to use login names anyway, so that your workspace
annotators don't have to remember a different name when they open
a workspace.
Workspace users are assigned roles, which indicate what
they can do within the workspace. By default, all users can
annotate documents in the core folder. If workspace
reconciliation is configured, workspace users are also
assigned reconciliation roles. By default, each user has all
reconciliation roles except "human_decision" (the ability to make
an enforceable judgment about a reconciliation choice).
The available operations are:
topic |
operation |
availability |
folder |
---|---|---|---|
creation |
create |
command line |
(global) |
file management |
import |
command line |
(global) |
remove |
command line |
(global) |
|
assign |
command line |
(global) |
|
open_file |
UI, command line debug |
(global) |
|
markgold |
UI, command line debug |
core |
|
unmarkgold |
UI, command line debug |
core |
|
save |
UI, command line debug |
core, reconciliation |
|
inspection |
list |
UI, command line |
(global) |
workspace_configuration |
command line |
(global) |
|
dump_database |
command line |
(global) |
|
logging |
enable_logging |
command line |
(global) |
disable_logging |
command line |
(global) |
|
rerun_log |
command line |
(global) |
|
users |
register_users |
command line |
(global) |
list_users |
command line |
(global) |
|
add_roles |
command line |
(global) |
|
remove_roles |
command line |
(global) |
|
automated tagging |
modelbuild |
command line |
core |
autotag |
UI, command line |
core |
|
experimentation |
list_basename_sets |
command line |
(global) |
add_to_basename_set |
command line |
(global) |
|
remove_from_basename_set |
command line |
(global) |
|
run_experiment |
command line |
(global) |
|
reconciliation (not yet enabled) |
configure_reconciliation |
command line |
(global) |
submit_to_reconciliation |
command line |
core |
|
remove_from_reconciliation |
command line |
reconciliation |
|
administration |
force_unlock |
command line |
core |
There are also internal operations which are not publicly visible (release_lock, update_ui_log).
We'll review each of these operations in turn.
The create operation creates a
workspace. It requires a task
and an initial user.
This operation is available only on
the command line.
The import operation ingests
documents into the workspace. The documents are all converted to
MAT JSON format, and are prepared for annotation. You can
optionally assign documents to users.
This operation is only available on the command line.
Historically, the import operation could target multiple folders, but in MAT 2.0, only the core folder is eligible for import.
In task.xml, you can specify the
default process by which documents are prepared for annotation
when they're imported. Here's an example:
<workspace>
...
<operation name="import">
<settings workflow="Demo" steps="zone,tokenize"/>
</operation>
...
</workspace>
As described here, these settings can
be overridden using the --workflows and --steps options described
in MATWorkspaceEngine.
The remove operation removes all
copies of the basename from the workspace. Warning: this operation will
remove all traces of the basenames from the workspace folders and
the database. Do not use it unless you really want them removed.
This operation is only available on the command line.
This operation assigns the specified
basenames to the specified users. Each user gets his or her own
copy of the document to annotate. If the document's annotations
have been already altered by a human, the basename cannot be
assigned.
This operation is only available on
the command line.
This operation opens a workspace file
and returns its contents. It also locks the workspace file in the
workspace database. This lock is typically released when a file is
closed in the UI, using the private release_lock operation. If
this document is "stranded" - if, for instance, a user forgets to
close the document - you can use the force_unlock
operation to fix this.
This operation is available in the MAT UI, or on the command line if
--debug is provided.
This operation marks all of the
"non-gold" segments in a document "human
gold".
This operation is available in the MAT
UI, or indirectly on the command line via the import operation, or
on the command line
if --debug is provided. When used in the UI, it will trigger a
save operation first if the document has unsaved changes.
This operation marks all of the "human
gold" or "reconciled" segments in a document "non-gold".
This operation saves the contents of a
workspace file.
This operation is available in the MAT UI, or on the command line if --debug
is provided.
MAT provides a rich and extensive logging infrastructure
specifically for workspaces. When logging is enabled, MAT
workspace operations log every action and data modification, so
that the activities in the workspace can be rerun from the point
that logging was enabled, exactly as they were originally
performed.
Workspace logging is distinct from UI logging. The MAT UI has the capability of capturing all the user gestures, and save these gestures to a CSV file at the user's request. If workspace logging is enabled, the UI turns on this capability specifically for the current workspace, and uploads the log fragments to the MAT server with every save operation in the "core" folder. The format of this log is identical to the format of the UI logger. Unlike general UI logging, this logging cannot be configured or controlled from the UI. Finally, this logging does not interfere with general UI logging; if you choose to enable UI logging, you'll still get all the user gestures, including those that are captured for workspace logging.
This operation enables the logging. The log will be saved in the _checkpoint subdirectory of the workspace directory.
This operation is available on the command line.
This operation disables logging. If a log is being collected, by default it is moved to the first available _checkpoint_<n> path. However, the user can force the log to be disabled if she chooses. In either case, this ensures that _checkpoint never contains a discontinuous log.
This operation is available on the command line.
This operation allows you to rerun the log. It will use the _checkpoint/_rerun subdirectory of the workspace directory to store the rerun state. You can use this capability to recreate any intermediate state of your workspace, e.g., for experiment analysis.
This operation is available on the command line.
This operation shows you the contents
of the folders in the workspace. The listing shows you the status
of the document, as well as who it's assigned
to.
It is available both on the command line, and in the MAT UI as part of the workspace interface.
This operation describes a number of
properties of the workspace. Most of these properties are
capabilities of MAT which are currently in development, but not
yet publicly released. We've included the infrastructure for
supporting these emerging capabilities in order to ensure that
users of MAT will not have to update their workspaces when these
capabilities are released. The properties reported are:
This operation describes all the
tables in the workspace
database. It is a useful debugging tool for the technically
inclined.
This operation is only available on
the command line.
Workspace users have roles which say what they can do in the
workspace, but unless workspace
reconciliation is enabled, users have only one available
role, "core_annotation", which means the user is eligible to
perform annotation. If reconciliation is enabled, each reconciliation phase is
also recognized as a role. The role "all" is a shorthand for all
available roles.
You can explicitly specify user roles which you register the
users, or afterward. You may want to vary the available roles for
annotators because, e.g., you may want only some of them to
participate in particular reconciliation phases; say, you might
want only some annotators to be able to perform the decisive
human_decision reconciliation step.
This operation allows you to add
registered users to your
workspace. Perhaps you want to be able to track the contributions
of multiple annotators, or you might want to actually assign documents to multiple annotators and
do multiple annotation. You may also want to assign roles to your
users. You cannot unregister users once they're registered,
although you can remove all their roles.
This operation is only available on
the command line.
This operation lists the users in a workspace. It is also
available as part of the workspace_configuration
operation.
This operation is only available on
the command line.
The add_roles operation adds roles to existing users.
This operation is only available on the command line.
The remove_roles operation removes roles from existing users.
This operation is only available on the command line.
This operation builds a model which
can be used to autotag other documents.
Every document
segment in the workspace which has been touched by a human
annotator is used to build this model. If there are multiple
copies of a document because the document is multiply assigned,
all copies will be used (so that document will be overrepresented
in the model, and all conflicting annotations will be used as
well). You can optionally ask the workspace to autotag documents
after the model is built.
Note:
the workspace model is completely
distinct from the default task model.
This operation is only available on
the command line.
If you want to customize your
modelbuild operation, e.g., restrict it to just the gold segments,
you can do so in task.xml. You can use any setting that's
available to the training
engine.
<workspace>
...
<operation name="modelbuild">
<settings partial_training_on_gold_only="yes"/>
</operation>
...
</workspace>
This operation automatically tags
documents using the current workspace model. You can specify
individual basenames to tag, or tag all documents. The tagging engine
will only tag those document segments which have not yet been
touched by a human annotator. Existing (machine-generated)
annotations in those segments will be discarded and new ones
added.
Note: this operation does
not use the Carafe tagging server, even in the UI. So the startup
cost of the tagging engine is incurred each time the autotag
operation is executed. This operation also does not use the
default task model, ever; it only uses models constructed using
the modelbuild operation.
This operation is available in the MAT UI (for individual documents) and on
the command line.
When used in the UI, it will trigger a save operation first if the
document has unsaved changes.
We can establish basename sets which we can reference when we run
experiments.
This operation lists the basename sets and their contents. This operation is only available on the command line.
This operation adds basenames to a given basename set (and implicitly creates the set if necessary). This operation is only available on the command line.
This operation removes basenames from a given basename set (and implicitly removes the set if necessary). This operation is only available on the command line.
This operation allows you to run an
experiment based on this workspace, either using an experiment file or by specifying
the properties of the test set in terms of properties of the
workspace basenames.
Each MAT workspace has the ability to support reconciliation,
which is the process by which the consistency of annotations is
checked, and conflicts are possibly resolved. This process is not
yet available due to UI limitations, but will be in our next
release. As part of this process, you'll be able to perform
cross-validation of the input documents, to help identify
inconsistencies in human annotation.
You can submit any document to reconciliation at any point in the
annotation process (as long as it isn't being annotated by
someone). You must configure reconciliation before you submit any
documents.
Use this operation to establish the active reconciliation phases for your workspace.
This operation is only available on the command line.
Use this operation to submit documents for reconciliation. The phases that are assigned to the documents will be the phases provided in the most recent configure_reconciliation operation.
This operation is only available on the command line.
If, for some reason, a document fails to exit reconciliation naturally (if some of the users fail to complete their reconciliation steps, for example), you can use this operation to remove the document forcibly from reconciliation. You have the option of discarding the reconciliation decisions that were made.
This operation is only available on the command line.
This operation forces a basename in
the named folder to be unlocked. Warning:
be very certain that you apply the force_unlock operation only to basenames whose locks
have been stranded. If you unlock a basename which is being
annotated, the annotator will not be able to save her changes.
This operation is only available on
the command line.
Unlike file mode, workspace mode is stateful from the point of view of the UI. It is
the server, rather than the client, which loads and saves the
files. However, we don't want just anybody to be able to cause the
server to perform these stateful operations, so the MAT web server implements some security mechanisms.
Note, however, that the MAT workspace functionality is not an enterprise-secure implementation, and will never be one. It does not use SSL; it does not perform any sort of user authentication beyond the workspace key; it does not provide any security logging or traceability; and it does not currently implement transactions. You should assume that anyone who has access to your network can see your workspace traffic, and overwrite your data.
Note that workspace users play no
role in workspace security.
Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong, it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.
You may realize, once you've completed an import operation, that
you didn't import the basenames the way you'd wanted; perhaps
you'd intended to strip a suffix, or you assigned them to the
wrong workspace user. You can use the remove operation to remove
the basenames from the workspace in preparation for re-importing.
Warning: this operation
will remove all traces of the basenames from the workspace folders
and the database. Do not use it unless you really want them
removed.
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove basename1...
If you're not sure what basenames are available, the --help
option will list them:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove --help
More on the remove operation here.
The workspaces do not permit documents to be edited by more than
one annotator at a time. The workspaces achieve this exclusivity
through the use of file locks, which are recorded in the workspace
database. When an annotator opens a document for annotation, the
annotation UI is given a lock ID which it can use to release the
document when the editing session is over. In some circumstances,
unfortunately, the document is not unlocked; for instance, if the
UI encounters an unexpected error and crashes before unlocking the
document. You can use the force_unlock operation to clear this
lock from the database.
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> force_unlock --user user1 core basename1
If you just want to unlock everything, don't specify any
basenames. If you want to know what's locked, use the
dump_databsae operation:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> dump_database
This will show you the content of the workspace database tables.
Warning: be very certain
that you apply the force_unlock operation only to basenames whose locks
have been stranded. If you unlock a basename which is being
annotated, the annotator will not be able to save her changes.
More on force_unlock here.
If you get this error message, and you're absolutely certain that
no one else is working on the workspace, something horrible has
happened, and a previous operation has failed in such a way to
fail to remove the "opLockfile" file. More on how to deal with
this here.
The system will advance documents through these steps
automatically if possible (so, for instance, if an annotator makes
a choice during the crossvalidation_challenge step, and no
annotator adds any new annotation patterns, the system assumes
that the annotator's vote will not change). Once all segments have
been marked as reconciled, or the document has passed through all
assigned annotators and steps, it exits reconciliation, and the
agreed-upon changes are folded back into the documents in the core
annotation folder which were submitted to reconciliation. So if
the same document is assigned to two annotators, and it passes
through reconciliation and the conflicts are resolved, those
assigned documents will be altered to reflect the reconciliations.
The use of the SEGMENTs
in reconciliation differs slightly from its use in core
annotation, especially with respect to the value of its "status"
attribute. The three significant "status" attribute values in
reconciliation are:
In addition, there's additional administrative information on the segment that records the state of the reconciliation.
If you submit a document to reconciliation, it may remain in
reconciliation because, e.g., an annotator who was registered with
one of the relevant roles is no longer working on the project. Or
you may have submitted it to reconciliation in error. You
can use the remove_from_reconciliation operation to remove
the document.
Keep in mind that the document may already be partially
reconciled. If you want to remove the document and preserve
the decisions already made, you can use the operation as follows:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove_from_reconciliation reconciliation basename1
This will migrate the agreed-upon document segments back into the
documents which were used to create the reconciliation document.
If you do not want to preserve those decisions, and simply want to
stop the document from being reconciled, do this instead:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove_from_reconciliation --dont_reintegrate reconciliation basename1
More on this operation here.
The workspace database is an SQLite database which tracks the
status of documents, users, and the workspace itself. The schema
can be found in MAT_PKG_HOME/lib/mat/python/MAT/ws_db.sql. The
tables are:
There are other tables and columns which relate to workspace
features we have yet to enable. We will document those features of
the database as the corresponding workspace features are enabled.