Retokenizer

Description

Version 1.2 of MAT introduced a new tokenizer, which did not produce exactly the same tokenization as the original OCaml tokenizer from MAT 1.0. We created this tool to support retokenization of documents to conform to this new tokenizer, but this utility can be used to retokenize any existing document using the current MAT tokenizer.

This tool operates either on files, in which case they're all processed according to the same task, or on workspaces, in which case the task is inferred from the workspace. When operating on files, you must specify an --output_dir argument for the location of the converted files. When operating on workspaces, the old files will be copied to <filename>.oldtok before the new files are written to the workspace.

Note: this tool always applies the default MAT tokenizer. If your task has a custom tokenizer, this tool will not work for you. A future release of MAT will provide a more general and flexible solution.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd

Usage: MATRetokenize [core_options] files [file_options]
       MATRetokenize [core_options] workspaces <workspace>...

Core options

MATRetokenize makes the common options available.

File options

--task <task>	Name of the task to use. Obligatory if the system knows of more than one task.
--input_dir <dir>	A directory, all of whose files will be retokenized. Can be repeated. The --input_files option may simultaneously be used to specify additional files.
--input_files <re>	A glob-style pattern describing the files to be retokenized. Can be repeated. The --input_dir option may be simultaneously used to specify additional directories of files.
--output_dir	A directory in which to place the retokenized documents.

Examples

Example 1

Let's say you have a set of files in /data/myfiles, all of which are MAT JSON files that need to be updated, your task is named 'My Task', and you want to store the results in /data/myconvertedfiles:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myconvertedfiles

Example 2

Let's say you're feeling brave, and instead of saving the files to a new location, you decide to overwrite the old ones:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myfiles

Example 3

Let's say that only some of the files in the directory are MAT JSON files: all the ones that end in .json, and all the ones that end in .mat.

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--output_dir c:\data\myconvertedfiles

Example 4

Let's stay that you want to process your files as in example 3, but you also have a directory /data/myotherfiles which contains only MAT JSON files which you want to process at the same time:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--input_dir /data/myotherfiles --output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--input_files c:\data\myotherfiles --output_dir c:\data\myconvertedfiles

Example 5

Let's say you have workspaces in /data/myworkspace and /data/myworkspace2:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize workspaces /data/myworkspace /data/myworkspace2

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd workspaces c:\data\myworkspace c:\data\myworkspace2

The MAT JSON files in each workspace will be converted, and copies of the original will be placed in <filename>.oldtok. It doesn't matter if the two workspaces have different tasks; the task will be inferred from the workspace.