Version 1.2 of MAT introduced a new tokenizer, which did not
produce exactly the same tokenization as the original OCaml
tokenizer from MAT 1.0. We created this tool to support
retokenization of documents to conform to this new tokenizer, but
this utility can be used to retokenize any existing document using
the current MAT tokenizer.
This tool operates either on files, in which case they're all
processed according to the same task, or on workspaces, in which
case the task is inferred from the workspace. When operating on
files, you must specify an --output_dir argument for the location
of the converted files. When operating on workspaces, the old
files will be copied to <filename>.oldtok before the new
files are written to the workspace.
Note: this tool always applies the default MAT tokenizer. If your task has a custom tokenizer, this tool will not work for you. A future release of MAT will provide a more general and flexible solution.
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd
Usage: MATRetokenize [core_options] files [file_options]
MATRetokenize [core_options] workspaces <workspace>...
MATRetokenize makes the common options available.
--task <task> |
Name of the task to use.
Obligatory if the system knows of more than one task. |
--input_dir <dir> |
A directory, all of whose
files will be retokenized. Can be repeated. The
--input_files option may simultaneously be used to specify
additional files. |
--input_files <re> |
A glob-style pattern describing the files to be retokenized. Can be repeated. The --input_dir option may be simultaneously used to specify additional directories of files. |
--output_dir |
A directory in which to place
the retokenized documents. |
Let's say you have a set of files in /data/myfiles, all of which
are MAT JSON files that need to be updated, your task is named 'My
Task', and you want to store the results in
/data/myconvertedfiles:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myconvertedfiles
Let's say you're feeling brave, and instead of saving the files
to a new location, you decide to overwrite the old ones:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_dir /data/myfiles --output_dir /data/myfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_dir c:\data\myfiles --output_dir c:\data\myfiles
Let's say that only some of the files in the directory are MAT
JSON files: all the ones that end in .json, and all the ones that
end in .mat.
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--output_dir c:\data\myconvertedfiles
Let's stay that you want to process your files as in example 3,
but you also have a directory /data/myotherfiles which contains
only MAT JSON files which you want to process at the same time:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize files --task 'My Task' \
--input_files '/data/myfiles/*.json' --input_files '/data/myfiles/*.mat' \
--input_dir /data/myotherfiles --output_dir /data/myconvertedfiles
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd files --task "My Task" \
--input_files "c:\data\myfiles\*.json" --input_files "c:\data\myfiles\*.mat" \
--input_files c:\data\myotherfiles --output_dir c:\data\myconvertedfiles
Let's say you have workspaces in /data/myworkspace and /data/myworkspace2:
Unix:
% $MAT_PKG_HOME/bin/MATRetokenize workspaces /data/myworkspace /data/myworkspace2
Windows native:
> %MAT_PKG_HOME%\bin\MATRetokenize.cmd workspaces c:\data\myworkspace c:\data\myworkspace2
The MAT JSON files in each workspace will be converted, and
copies of the original will be placed in <filename>.oldtok.
It doesn't matter if the two workspaces have different tasks; the
task will be inferred from the workspace.