Version 1.2 of MAT introduced a new tokenizer, which does not
produce exactly the same tokenization as the previous OCaml tokenizer.
In order to ensure optimal performance of the tools, you should
retokenize your documents using this tool.
This tool operates either on files, in which case they're all
processed according to the same task, or on workspaces, in which case
the task is inferred from the workspace. When operating on files, you
must specify an --output_dir argument for the location of the converted
files. When operating on workspaces, the old files will be copied to
<filename>.oldtok before the new files are written to the
workspace.
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd
Usage: MATRetokenize files [file_options]
MATRetokenize workspaces <workspace>...
--task <task> |
Name of the task to use.
Obligatory if the system knows of more than one task. |
--input_dir <dir> |
A directory, all of whose files
will be retokenized. Can be repeated. The
--input_files option may simultaneously be used to specify additional
files. |
--input_files <re> |
A glob-style pattern describing the files to be retokenized. Can be repeated. The --input_dir option may be simultaneously used to specify additional directories of files. |
--output_dir |
A directory in which to place
the retokenized documents. |
Let's say you have a set of files in /data/myfiles, all of which are
MAT JSON files that need to be updated, your task is named 'My Task',
and you want to store the results in /data/myconvertedfiles:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myconvertedfiles
Let's say you're feeling brave, and instead of saving the files to a
new location, you decide to overwrite the old ones:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myfiles
Let's say that only some of the files in the directory are MAT JSON
files: all the ones that end in .json, and all the ones that end in
.mat.
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--output_dir c:\data\myconvertedfiles
Let's stay that you want to process your files as in example 3, but
you also have a directory /data/myotherfiles which contains only MAT
JSON files which you want to process at the same time:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--input_dir /data/myotherfiles --output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--input_files c:\data\myotherfiles --output_dir c:\data\myconvertedfiles
Let's say you have workspaces in /data/myworkspace and /data/myworkspace2:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize workspaces /data/myworkspace /data/myworkspace2
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd workspaces c:\data\myworkspace c:\data\myworkspace2
The MAT JSON files in each workspace will be converted, and copies
of the original will be placed in <filename>.oldtok. It doesn't
matter if the two workspaces have different tasks; the task will be
inferred from the workspace.