target_extraction.allen.dataset_readers package¶

Submodules¶

target_extraction.allen.dataset_readers.target_conll module¶

class target_extraction.allen.dataset_readers.target_conll.TargetConllDatasetReader(token_indexers=None, coding_scheme='BIO', label_namespace='labels', **kwargs)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Dataset reader designed to read a CONLL formatted file that is produced from target_extraction.data_types.TargetTextCollection.to_conll. The CONLL file should have the following structure:

TOKEN#GOLD LABEL

Where each text is sperated by a blank new line and that each text has an associated # {text_id: ‘value’} line at the start of the text. An example of the file is below: ` # {“text_id”: “0”} The O laptop B-0 case I-0 was O great O and O cover O was O rubbish O

# {“text_id”: “2”} The O laptop B-0 case I-0 was O great O and O cover B-1 was O rubbish O `

token_indexersDict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``): We use this to define the input representation for the text. See TokenIndexer.
coding_scheme: str, optional (default=``BIO``): Specifies the coding scheme for. Valid options are BIO and BIOUL. The BIO default maintains the original BIO scheme in the data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span.
label_namespace: str, optional (default=``labels``): Specifies the namespace for the sequence labels.

text_to_instance(tokens, tags=None)[source]¶

We take pre-tokenized input here, because we don’t have a tokenizer in this class.

Return type: Instance

target_extraction.allen.dataset_readers.target_conll.logger = <Logger target_extraction.allen.dataset_readers.target_conll (WARNING)>¶

target_extraction.allen.dataset_readers.target_extraction module¶

class target_extraction.allen.dataset_readers.target_extraction.TargetExtractionDatasetReader(token_indexers=None, pos_tags=False, **kwargs)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Dataset reader designed to read a list of JSON like objects of the following type:

{tokenized_text: [This, Camera, lens, is, great],: text: This Camera lens is great, tags: [O, B, I, O, O], pos_tags: [DET, NOUN, NOUN, AUX, ADJ]}

Where the pos_tags are optional. This type of JSON can be created from exporting a target_extraction.data_types.TargetTextCollection using the to_json_file method.

If the pos_tags are given, they can be used as either features or for joint learning.

The only sequence labels that we currently support is BIO or also known as IOB-2.

Params pos_tags: Whether or not to extract POS tags if avaliable.
Returns: A Dataset of Instances for Target Extraction.

text_to_instance(tokens, text, tags=None, pos_tags=None)[source]¶

The tokens are expected to be pre-tokenised.

The original token text and the text itself is stored in a MetadataField

Parameters

tokens (List[Token]) – Tokenised text that either has target extraction labels or is to be tagged.
text (str) – The text that the tokenised text has come from.
tags (Optional[List[str]]) – The target extraction BIO labels.
pos_tags (Optional[List[str]]) – POS tags to be used either as features or for joint learning.

Return type

Instance

Returns

An Instance object with all of the above enocded for a PyTorch model.

target_extraction.allen.dataset_readers.target_extraction.logger = <Logger target_extraction.allen.dataset_readers.target_extraction (WARNING)>¶

target_extraction.allen.dataset_readers.target_sentiment module¶

class target_extraction.allen.dataset_readers.target_sentiment.TargetSentimentDatasetReader(token_indexers=None, tokenizer=None, left_right_contexts=False, reverse_right_context=False, incl_target=False, use_categories=False, target_sequences=False, position_embeddings=False, position_weights=False, max_position_distance=None, **kwargs)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Dataset reader designed to read a list of JSON like objects of the following type:

{text: This Camera lens is great,: targets: [Camera], target_sentiments: [positive]}

or

{text: This Camera lens is great,: categories: [CAMERA], category_sentiments: [positive]}

or

{text: This Camera lens is great,: targets: [Camera] categories: [CAMERA], target_sentiments: [positive]}

or

{text: This Camera lens is great,: targets: [Camera], target_sentiments: [positive], spans: [[5,11]]}

This type of JSON can be created from exporting a target_extraction.data_types.TargetTextCollection using the to_json_file method.

The difference between the three objects depends on the objective of the model being trained: 1. Version is for a purely Target based sentiment classifier. 2. Version is for a purely Aspect or latent based sentiment classifier. 3. Version is if you want to make use of the relationship between the

Target and Aspect in the sentiment classifier.

If the Target based sentiment classifier requires the knowledge of where the target is.

Parameters

lazy – Whether or not instances can be read lazily.
token_indexers (Optional[Dict[str, TokenIndexer]]) – We use this to define the input representation for the text. See allennlp.data.token_indexers.TokenIndexer.
tokenizer (Optional[Tokenizer]) – Tokenizer to use to split the sentence text as well as the text of the target.
left_right_contexts (bool) – If True it will return within the instance for text_to_instance the sentence context left and right of the target.
reverse_right_context (bool) – If True this will reverse the text that is in the right context. NOTE left_right_context has to be True.
incl_target (bool) – If left_right_context is True and this also the left and right contexts will include the target word(s) as well.
use_categories (bool) – Whether or not to return the categories in the instances even if they do occur in the dataset. This is a temporary solution to the following issue. The number of categories does not have to match the number of targets, just there has to be at least one category per sentence.
target_sequences (bool) – Whether or not to generate target_sequences which are a sequence of masks per target for all target texts. This will allow the model to know which tokens in the context relate to the target. Example of this is shown below (for this to work does require the span of each target)
position_embeddings (bool) – Whether or not to create distance values that can be converted to embeddings similar to the position_weights but instead of the model later on using them as weights it uses the distances to learn position embeddings. (for this to work does require the span of each target). A Position-aware Bidirectional Attention Network for Aspect-level Sentiment Analysis
position_weights (bool) – In the instances there will be an extra key position_weights which will be an array of integers representing the linear distance between each token and it’s target e.g. If the text contains two targets where each token is represented by a number and the 1’s target tokens = [[0,0,0,1], [1,1,0,0]] then the position_weights will be [[4,3,2,1], [1,1,2,3]]. (for this to work does require the span of each target). An example of position weighting is in section 3.3 of Modeling Sentiment Dependencies with Graph Convolutional Networks for Aspect-level Sentiment Classification
max_position_distance (Optional[int]) – The maximum position distance given to a token from the target e.g. [0,0,0,0,0,1,0,0] if the each value represents a token and 1’s represent target tokens then the distance array would be [6,5,4,3,2,1,2,3] if the max_position_distance is 5 then the distance array will be [5,5,4,3,2,1,2,3]. (for this to work either position_embeddings has to be True or position_weights)

Raises

ValueError – If the left_right_contexts is not True while either the incl_targets or reverse_right_context arguments are True.
ValueError – If the left_right_contexts and target_sequences are True at the same time.
ValueError – If the max_position_distance when set is less than 2.
ValueError – If max_position_distance is set but neither position_embeddings nor position_weights are True.

Example of target_sequences

{text: `This Camera lens is great but the: screen is rubbish`,

targets: [Camera, screen], target_sentiments: [positive, negative], target_sequences: [[0,1,0,0,0,0,0,0,0,0],

[0,0,0,0,0,0,0,1,0,0]],

spans: [[5,11], [34:40]]}

text_to_instance(text, targets=None, target_sentiments=None, spans=None, categories=None, category_sentiments=None, **kwargs)[source]¶

The original text, text tokens as well as the targets and target tokens are stored in the MetadataField.

NOTE

At least targets and/or categories must be present.

NOTE

That the left and right contexts returned in the instance are a List of a List of tokens. A list for each Target.

Parameters

text (str) – The text that contains the target(s) and/or categories.
targets (Optional[List[str]]) – The targets that are within the text
target_sentiments (Optional[List[Union[int, str]]]) – The sentiment of the targets. To be used if training the classifier
spans (Optional[List[List[int]]]) – The spans that represent the character offsets for each of the targets given in the targets list.
categories (Optional[List[str]]) – The categories that are within the text
category_sentiments (Optional[List[Union[int, str]]]) – The sentiment of the categories

Return type

Instance

Returns

An Instance object with all of the above encoded for a PyTorch model.

Raises

ValueError – If either targets and categories are both None
ValueError – If self._target_sequences is True and the passed spans argument is None.
ValueError – If self._left_right_contexts is True and the passed spans argument is None.

target_extraction.allen.dataset_readers.target_sentiment.logger = <Logger target_extraction.allen.dataset_readers.target_sentiment (WARNING)>¶

target_extraction.allen.dataset_readers.text_sentiment module¶

class target_extraction.allen.dataset_readers.text_sentiment.TextSentimentReader(label_name='label', **kwargs)[source]¶

Bases: allennlp.data.dataset_readers.text_classification_json.TextClassificationJsonReader

Subclasses allennlp.data.dataset_readers.TextClassificationJsonReader of which the only differences is the label_name construction parameter which is explained in the Parameters section below.

Reads tokens and their labels from a labeled text classification dataset. Expects a “text” field and a “label” field in JSON format. The output of read is a list of Instance s with the fields:

tokens: TextField and label: LabelField

token_indexersDict[str, TokenIndexer], optional: optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See TokenIndexer.
tokenizerTokenizer, optional (default = {"tokens": SpacyTokenizer()}): Tokenizer to use to split the input text into words or other kinds of tokens.
segment_sentences: bool, optional (default = False): If True, we will first segment the text into sentences using SpaCy and then tokenize words. Necessary for some models that require pre-segmentation of sentences, like the Hierarchical Attention Network (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).
max_sequence_length: int, optional (default = None): If specified, will truncate tokens to specified maximum length.
skip_label_indexing: bool, optional (default = False): Whether or not to skip label indexing. You might want to skip label indexing if your labels are numbers, so the dataset reader doesn’t re-number them starting from 0.
lazybool, optional, (default = False): Whether or not instances can be read lazily.
label_name: str, optional, (default = label): The name of the label field in the JSON objects that are read.

text_to_instance(text, label=None, **kwargs)[source]¶

# Parameters

textstr, required.: The text to classify
labelstr, optional, (default = None).: The label for this text.

# Returns

An Instance containing the following fields:

tokens (TextField) : The tokens in the sentence or phrase.
label (LabelField) : The label label of the sentence or phrase.

Return type: Instance

target_extraction.allen.dataset_readers.text_sentiment.logger = <Logger target_extraction.allen.dataset_readers.text_sentiment (WARNING)>¶

target_extraction.allen.dataset_readers package¶

Submodules¶

target_extraction.allen.dataset_readers.target_conll module¶

target_extraction.allen.dataset_readers.target_extraction module¶

target_extraction.allen.dataset_readers.target_sentiment module¶

target_extraction.allen.dataset_readers.text_sentiment module¶

Module contents¶