tcflib.tcf module

This module provides an API for TCF documents.

class tcflib.tcf.AnnotationLayerBase(initialdata=None)[source]

Bases: object

Base class for annotation layers.

corpus = None

The corpus this layer belongs to.

parent = None

The parent layer, in case of nested layers.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.AnnotationLayer(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayerBase, collections.UserList

Annotation layer that acts like a list of Annotations.

append(item)[source]

S.append(value) – append value to the end of the sequence

class tcflib.tcf.AnnotationLayerWithIDs(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayerBase, collections.UserDict

Annotation layer that holds IDs of annotations.

This class acts like a hybrid of a list and a dict: It can be used like a list, e.g. it has an append method and it iterates over its values. But its items can also be set and retrieved using annotation IDs with dict- like element access.

keys() → a set-like object providing a view on D's keys[source]
class tcflib.tcf.AnnotationElement(*, tokens=None)[source]

Bases: object

Base class for annotation elements.

parent = None

The annotation layer the element belongs to.

tcf

Return the element as an etree.Element.

class tcflib.tcf.TokenList(initialdata=None)[source]

Bases: collections.UserList

Proxy token list that sets token attributes.

Used for token lists of AnnotationElement`s that maintain a relation between the element and the token. E.g., appending a token to `reference.tokens should set the token’s reference attribute.

append(token)[source]

S.append(value) – append value to the end of the sequence

class tcflib.tcf.TextCorpus(input_data=None, *, layers=None)[source]

Bases: object

The main class that represents a TextCorpus.

A TextCorpus consists of a series of AnnotationLayers.

Parameters:
  • input_data (str or None) – The XML input.
  • layers (list or None) – A list of layers that should be parsed.
tree

Return the corpus as an etree.ElementTree.

The original XML tree is kept in memory, so that only newly added layers get serialized. This makes sure that the original tree is not touched.

write(file_or_path, *, encoding='utf-8', pretty_print=True)[source]

Write the XML tree into a file.

This method writes each layer successively and discards it afterwards. This is more memory efficient than building the whole tree at once.

Parameters:file_or_path (A file object or a file path.) – The target to which to write the XML tree.
add_layer(layer)[source]

Add an AnnotationLayerBase object to the corpus.

class tcflib.tcf.Text(text)[source]

Bases: tcflib.tcf.AnnotationLayerBase

The text annotation layer.

text = None

The unannotated text.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.Tokens(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayerWithIDs

The tokens annotation layer.

It holds a sequence of Token objects.

class tcflib.tcf.Token(text)[source]

Bases: tcflib.tcf.AnnotationElement

The token annotation element.

text = None

The token text.

lemma = None

The token lemma.

tag = None

The POS tag value.

entity = None

The NamedEntity object for the token.

reference = None

The Reference object for the token.

wordsenses = None

The list of word senses for the token.

tcf

Return the element as an etree.Element.

postag

The POS tag as a POSTagBase

semantic_unit

The semantic unit for a token.

The semantic unit can be the (disambiguated) lemma, a named entity, or a referenced semantic unit.

class tcflib.tcf.Lemmas(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayer

The lemmas annotation layer.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.Wsd(source)[source]

Bases: tcflib.tcf.AnnotationLayer

The word senses (wsd) annotation layer.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.POStags(tagset)[source]

Bases: tcflib.tcf.AnnotationLayer

The POStags annotation layer.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.DepParsing(tagset, emptytoks=False, multigovs=False)[source]

Bases: tcflib.tcf.AnnotationLayerWithIDs

The depparsing annotation layer.

It holds a sequence of DepParse objects.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.DepParse[source]

Bases: tcflib.tcf.AnnotationLayer

The parse annotation element.

It holds a sequence of Dependency objects.

append(item)[source]

S.append(value) – append value to the end of the sequence

class tcflib.tcf.Dependency(func, gov_tokens=None, dep_tokens=None)[source]

Bases: tcflib.tcf.AnnotationElement

The dependecy annotation element.

tcf

Return the element as an etree.Element.

class tcflib.tcf.NamedEntities(type)[source]

Bases: tcflib.tcf.AnnotationLayerWithIDs

The namedEntities annotation layer.

It holds a sequence of NamedEntity objects.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.NamedEntity(class_=None, tokens=None)[source]

Bases: tcflib.tcf.AnnotationElement

The token annotation element.

tcf

Return the element as an etree.Element.

class tcflib.tcf.References(typetagset, reltagset, extrefs)[source]

Bases: tcflib.tcf.AnnotationLayer

The references annotation layer.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.Entity[source]

Bases: tcflib.tcf.AnnotationLayerWithIDs

The entity annotation element.

This class represents a coreference entity inside the references annotation layer. The entity inside the namedEntities annotation layer is represented by the NamedEntity class. In TCF, both share the entity tag name.

An entity holds a sequence of Reference objects.

tcf

Return the layer as an etree.Element.

class tcflib.tcf.Reference(*, type=None, rel=None, target=None, tokens=None)[source]

Bases: tcflib.tcf.AnnotationElement

The reference annotation element.

target = None

The target Reference.

tokens

The tokens for this reference.

entity

The Entity this reference belongs to.

tcf

Return the element as an etree.Element.

class tcflib.tcf.Sentences(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayerWithIDs

The sentences annotation layer.

It holds a sequence of Sentence objects.

class tcflib.tcf.Sentence(*, tokens=None)[source]

Bases: tcflib.tcf.AnnotationElement

The token annotation element.

class tcflib.tcf.TextStructure(initialdata=None)[source]

Bases: tcflib.tcf.AnnotationLayer

The textstructure annotation layer.

It holds a sequence of TextSpan objects.

class tcflib.tcf.TextSpan(type=None)[source]

Bases: tcflib.tcf.AnnotationElement

The token annotation element.

type = None

The type of span.

tcf

Return the element as an etree.Element.

class tcflib.tcf.Graph(*, label='lemma', weight='count')[source]

Bases: tcflib.tcf.AnnotationLayerBase

The graph annotation layer.

This layer implements a graph API to store graph representations of the text (e.g., cooccurrence graphs).

tcf

Return the layer as an etree.Element.

exception tcflib.tcf.LoopError[source]

Bases: Exception

This exception is raised if a request to add an edge would result in a loop.

tcflib.tcf.serialize(obj)[source]

Serialize an object into a byte string.

Parameters:obj – A TextCorpus, etree.ElementTree or string.
Return type:bytes