Retokenizer

Description

Version 1.2 of MAT introduced a new tokenizer, which does not produce exactly the same tokenization as the previous OCaml tokenizer. In order to ensure optimal performance of the tools, you should retokenize your documents using this tool.

This tool operates either on files, in which case they're all processed according to the same task, or on workspaces, in which case the task is inferred from the workspace. When operating on files, you must specify an --output_dir argument for the location of the converted files. When operating on workspaces, the old files will be copied to <filename>.oldtok before the new files are written to the workspace.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd

Usage: MATRetokenize files [file_options]
MATRetokenize workspaces <workspace>...

File options

--task <task>
Name of the task to use. Obligatory if the system knows of more than one task.
--input_dir <dir>
A directory, all of whose files will be retokenized. Can be repeated. The --input_files option may simultaneously be used to specify additional files.
--input_files <re>
A glob-style pattern describing the files to be retokenized. Can be repeated. The --input_dir option may be simultaneously used to specify additional directories of files.
--output_dir
A directory in which to place the retokenized documents.

Examples

Example 1

Let's say you have a set of files in /data/myfiles, all of which are MAT JSON files that need to be updated, your task is named 'My Task', and you want to store the results in /data/myconvertedfiles:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myconvertedfiles

Example 2

Let's say you're feeling brave, and instead of saving the files to a new location, you decide to overwrite the old ones:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myfiles

Example 3

Let's say that only some of the files in the directory are MAT JSON files: all the ones that end in .json, and all the ones that end in .mat.

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--output_dir c:\data\myconvertedfiles

Example 4

Let's stay that you want to process your files as in example 3, but you also have a directory /data/myotherfiles which contains only MAT JSON files which you want to process at the same time:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--input_dir /data/myotherfiles --output_dir /data/myconvertedfiles

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--input_files c:\data\myotherfiles --output_dir c:\data\myconvertedfiles

Example 5

Let's say you have workspaces in /data/myworkspace and /data/myworkspace2:

Unix:

% $MAT_PKG_HOME/bin/MATRetokenize workspaces /data/myworkspace /data/myworkspace2

Windows native:

> %MAT_PKG_HOME%\bin\MATRetokenize.cmd workspaces c:\data\myworkspace c:\data\myworkspace2

The MAT JSON files in each workspace will be converted, and copies of the original will be placed in <filename>.oldtok. It doesn't matter if the two workspaces have different tasks; the task will be inferred from the workspace.