Creating reader/writers

It's not too difficult to define your own reader/writer. The file MAT_PKG_HOME/lib/mat/python/MAT/XMLIO.py provides a good example. The template looks like this:

from MAT.DocumentIO import declareDocumentIO, DocumentFileIO, SaveError
from MAT.Document import LoadError

class MyIO(DocumentFileIO):

def deserialize(self, s, annotDoc):
....

def writeToUnicodeString(self, annotDoc):
....

declareDocumentIO("my-io", MyIO, True, True)

The arguments to deserialize() are the input data from the file and an annotated document to populate; see XMLIO.py for an example of how to populate it. writeToUnicodeString() should return a Unicode string which serializes the annotated document passed in. In order to do this, you'll have to familiarize yourself with the API for manipulating documents and annotations, which is not documented but reasonably easy to understand from the source code. Once you do all this, the file type name you assign to the class via the call to declareDocumentIO() will be globally available.

You can also define command-line arguments which will be accepted by the tools when this file type is used. XMLIO.py also exemplifies this.

Finally, you can register a document convertor, typically a file containing document conversion XML, to apply whenever a document is read in the context of a particular task.

In the remainder of this document, we'll explore creating a reader for a fairly complex document format, the one associated with the brat annotation tool.

The annotation format for brat 1.2, which we digest here, looks like this:

T2	TITLE 1123 1131	Chairman
T4 IDEOLOGY 1147 1157 Republican
T6 PERSON 1132 1143 Lamar Smith
R1 Has-Ideology Arg1:T6 Arg2:T4
R2 Has-Ideology Arg1:T2 Arg2:T4

It also supports events, which have spans (as opposed to relations, indicated by R elements here, which don't), and supports string and boolean features on entities (indicated here with T) and events. An additional challenge with brat is that it's a standoff representation which does not include the document signal. (The brat format also supports declaring the annotation format, but we're going to ignore that here for the moment.)

We're only going to consider the reader here. We'll refer to the following listing:

     1	class BratIO(DocumentFileIO):
2
3 inputArgs = OptionTemplate([OpArgument("signal_file_location", hasArg = True,
4 help = "The directory where the signal files are located. If missing, the directory of the annotated file is assumed. The signal file is assumed to have a .txt extension instead of the .xml extension of the annotation file.")],
5 heading = "Options for brat input")
6
7 def __init__(self, signal_file_location = None, encoding = None,
8 annotation_conf_location = None, write_annotation_conf = False, **kw):
9 # Ignore the encoding.
10 DocumentFileIO.__init__(self, encoding = "utf-8", **kw)
11 self.signalFileLocation = signal_file_location
12 self.signalFileName = None
13
14 def readFromSource(self, source, **kw):
15 if (type(source) in (str, unicode)) and (source != "-"):
16 if os.path.splitext(source)[1] != ".ann":
17 raise LoadError, "brat annotation files must end with .ann"
18 if self.signalFileLocation is None:
19 self.signalFileLocation = os.path.dirname(source)
20 self.signalFileName = os.path.basename(source)
21 if self.signalFileName is None:
22 raise LoadError, "can't find the signal file"
23 return DocumentFileIO.readFromSource(self, source, **kw)
24
25 def deserialize(self, s, annotDoc):
26 if self.signalFileLocation is None:
27 raise LoadError, "Can't figure out where the signal is located"
28 # OK, now, try to find the signal.
29 import codecs
30 fp = codecs.open(os.path.join(self.signalFileLocation, os.path.splitext(self.signalFileName)[0] + ".txt"), "r", "utf8")
31 newSignal = fp.read()
32 fp.close()
33 if annotDoc.signal and (annotDoc.signal != newSignal):
34 raise LoadError, "signal from brat signal file doesn't match original signal"
35 annotDoc.signal = newSignal
36 annHash = {}
37 annotAttrs = []
38 stringAttrs = []
39 boolAttrs = []
40 equivSets = []
41 for line in lines:
42 if (not line) or (line[0] == "#"):
43 continue
44 [t1, tRest] = line.split("\t", 1)
45 if t1[0] == "T":
46 spanReg = tRest.split("\t")[0]
47 [lab, startI, endI] = spanReg.split()
48 a = annotDoc.createAnnotation(int(startI), int(endI), lab)
49 annHash[t1] = a
50 a.setID(t1)
51 elif t1[0] in "RE":
52 # Take it apart.
53 rToks = tRest.split()
54 lab = rToks[0]
55 args = [t.split(":") for t in rToks[1:]]
56 if t1[0] == "R":
57 # Create it.
58 a = annotDoc.createSpanlessAnnotation(lab)
59 annHash[t1] = a
60 a.setID(t1)
61 idx = t1
62 else:
63 [lab, idx] = lab.split(":")
64 annotAttrs.append((t1, idx, args))
65 elif t1[0] in "MA":
66 toks = tRest.split()
67 if len(toks) == 3:
68 stringAttrs.append((toks[1], toks[0], toks[2]))
69 else:
70 boolAttrs.append((toks[1], toks[0]))
71 elif t1 == "*":
72 # What do I do with equivs? Establish an equiv relation, I
73 # suppose, with a single attribute.
74 equivSets.append(tRest.split()[1:])
75 # So now, everyone is created.
76 for (idx, attrName, val) in stringAttrs:
77 a = annHash[idx]
78 a.atype.ensureAttribute(attrName, aType = "string")
79 a[attrName] = val
80 for (idx, attrName) in boolAttrs:
81 a = annHash[idx]
82 a.atype.ensureAttribute(attrName, aType = "boolean")
83 a[attrName] = True
84 # brat can reuse the event triggers, but MAT can't.
85 eventTriggersSaturated = set()
86 for (eid, idx, args) in annotAttrs:
87 a = annHash[idx]
88 if idx in eventTriggersSaturated:
89 newA = doc.createAnnotation(a.start, a.end, a.atype.lab)
90 newA.setID(eid)
91 for attr, val in zip(a.atype.attr_list, a.attrs):
92 if attr._typename_ != "annotation":
93 newA[attr.name] = val
94 a = newA
95 for [attrName, argIdx] in args:
96 a.atype.ensureAttribute(attrName, aType = "annotation")
97 a[attrName] = annHash[argIdx]
98 eventTriggersSaturated.add(idx)
99 if equivSets is not None:
100 atype = annotDoc.findAnnotationType("_Equiv", hasSpan = False)
101 atype.ensureAttribute("annots", aType = "annotation", aggregation = "set")
102 for equivSet in equivSets:
103 annotDoc.createSpanlessAnnotation("_Equiv", {"annots": AttributeValueSet([annHash[idx] for idx in equivSet])})

Step 1: (optional) handle external signals

The MAT reader infrastructure does not yet provide built-in support for dealing with external signals. Lines 3 - 35 provide a pattern for handling this case. You provide a command-line option for the location of the external signal (and, if necessary, you'd probably want to add options for the encoding and how to compute the signal pathname). You must specialize the readFromSource() method and locate the signal file (note that lines 16 - 17 are specific to brat, since we're looking for a specific file extension which contains the annotations themselves). Finally, in the beginning of the deserialize() method, you must read the signal (lines 29 - 32), ensure that it doesn't clash with any existing signal (lines 33 - 34), and set it in the document (line 35).

Step 2: initial assembly of annotations

If the format you're reading allows annotation-valued attributes, you need to do the deserialization in two steps: first, create actual annotations for each annotation reference, and second, set the annotation-valued attributes appropriately. Lines 36 - 75 perform this initial step.

For instance, at line 45, we recognize that the first character of the element ID is "T", indicating a spanned entity, and so on line 48, we create a new annotation, using the start and end character indices in the annotation file. (It just so happens that the brat offsets are identical to the MAT offsets. In some formats, the end index might be one less than the MAT end index, due to how the format is intended to do its counting; in other formats, the counts may be in bytes instead of characters. So the offset computation may be considerably more involved than it is here.) Once we create the annotation, we store it in a dictionary under its brat ID, and we assign this brat ID to the annotation on line 50.

On lines 51 - 64, we deal with events and relations. In brat, relations are spanless annotations, so we create such an annotation on line 58, and record it as we did the entity. (Events in brat, on the other hand, are links between spanned entities and arguments, so we don't need to introduce a new annotation for events.) In both these cases, we postpone recording the annotation-valued attributes which serve as the arguments, since we're not guaranteed of having created those yet; we create the list for later augmentation on line 64.

On lines 65 - 70, we deal with attributes. Again, we don't add them to the annotations; we record them for future augmentation.

Once we reach line 76, we've read all the brat entries, and we're ready to add the attributes. On lines 76 - 79, we deal with the string attributes; first, we ensure the attribute exists with the proper type (line 78), and then we set the attribute (line 79). We do the same for boolean attributes on lines 80 - 83.

Step 3: annotation-valued attributes

At this point, we're ready to create the annotation-valued attributes. Lines 88 - 94 deal with a feature of brat that MAT does not have: because brat events are sets of arguments declared against a spanned entity defined elsewhere, you can have multiple events defined against the same spanned entity. Because MAT deals with these as distinct event annotations, we must create copies for those entities which have already been "claimed" by an event. Once we've dealt with that detail, we handle the event attributes very similarly to the others: we ensure it exists (line 96), and then set the value (line 97), in this case pulling the annotation from the dictionary of annotations we collected when we created the annotations in step 2.

There are other details of the brat format that we've skipped over in this description; for instance, brat has a notion of entity equivalences which we model as spanless _Equiv annotations. But this overview of an example reader should provide guidance on how to implement these readers.