target_extraction.allen.dataset_readers package¶
Submodules¶
target_extraction.allen.dataset_readers.target_conll module¶
-
class
target_extraction.allen.dataset_readers.target_conll.
TargetConllDatasetReader
(token_indexers=None, coding_scheme='BIO', label_namespace='labels', **kwargs)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Dataset reader designed to read a CONLL formatted file that is produced from target_extraction.data_types.TargetTextCollection.to_conll. The CONLL file should have the following structure:
TOKEN#GOLD LABEL
Where each text is sperated by a blank new line and that each text has an associated # {text_id: ‘value’} line at the start of the text. An example of the file is below: ` # {“text_id”: “0”} The O laptop B-0 case I-0 was O great O and O cover O was O rubbish O
# {“text_id”: “2”} The O laptop B-0 case I-0 was O great O and O cover B-1 was O rubbish O `
- token_indexers
Dict[str, TokenIndexer]
, optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer
.- coding_scheme:
str
, optional (default=``BIO``) Specifies the coding scheme for. Valid options are
BIO
andBIOUL
. TheBIO
default maintains the original BIO scheme in the data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span.- label_namespace:
str
, optional (default=``labels``) Specifies the namespace for the sequence labels.
- token_indexers
-
target_extraction.allen.dataset_readers.target_conll.
logger
= <Logger target_extraction.allen.dataset_readers.target_conll (WARNING)>¶
target_extraction.allen.dataset_readers.target_extraction module¶
-
class
target_extraction.allen.dataset_readers.target_extraction.
TargetExtractionDatasetReader
(token_indexers=None, pos_tags=False, **kwargs)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Dataset reader designed to read a list of JSON like objects of the following type:
- {tokenized_text: [This, Camera, lens, is, great],
text: This Camera lens is great, tags: [O, B, I, O, O], pos_tags: [DET, NOUN, NOUN, AUX, ADJ]}
Where the pos_tags are optional. This type of JSON can be created from exporting a target_extraction.data_types.TargetTextCollection using the to_json_file method.
If the pos_tags are given, they can be used as either features or for joint learning.
The only sequence labels that we currently support is BIO or also known as IOB-2.
- Params pos_tags
Whether or not to extract POS tags if avaliable.
- Returns
A
Dataset
ofInstances
for Target Extraction.
-
text_to_instance
(tokens, text, tags=None, pos_tags=None)[source]¶ The tokens are expected to be pre-tokenised.
The original token text and the text itself is stored in a MetadataField
- Parameters
tokens (
List
[Token
]) – Tokenised text that either has target extraction labels or is to be tagged.text (
str
) – The text that the tokenised text has come from.tags (
Optional
[List
[str
]]) – The target extraction BIO labels.pos_tags (
Optional
[List
[str
]]) – POS tags to be used either as features or for joint learning.
- Return type
Instance
- Returns
An Instance object with all of the above enocded for a PyTorch model.
-
target_extraction.allen.dataset_readers.target_extraction.
logger
= <Logger target_extraction.allen.dataset_readers.target_extraction (WARNING)>¶
target_extraction.allen.dataset_readers.target_sentiment module¶
-
class
target_extraction.allen.dataset_readers.target_sentiment.
TargetSentimentDatasetReader
(token_indexers=None, tokenizer=None, left_right_contexts=False, reverse_right_context=False, incl_target=False, use_categories=False, target_sequences=False, position_embeddings=False, position_weights=False, max_position_distance=None, **kwargs)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Dataset reader designed to read a list of JSON like objects of the following type:
- {text: This Camera lens is great,
targets: [Camera], target_sentiments: [positive]}
or
- {text: This Camera lens is great,
categories: [CAMERA], category_sentiments: [positive]}
or
- {text: This Camera lens is great,
targets: [Camera] categories: [CAMERA], target_sentiments: [positive]}
or
- {text: This Camera lens is great,
targets: [Camera], target_sentiments: [positive], spans: [[5,11]]}
This type of JSON can be created from exporting a target_extraction.data_types.TargetTextCollection using the to_json_file method.
The difference between the three objects depends on the objective of the model being trained: 1. Version is for a purely Target based sentiment classifier. 2. Version is for a purely Aspect or latent based sentiment classifier. 3. Version is if you want to make use of the relationship between the
Target and Aspect in the sentiment classifier.
If the Target based sentiment classifier requires the knowledge of where the target is.
- Parameters
lazy – Whether or not instances can be read lazily.
token_indexers (
Optional
[Dict
[str
,TokenIndexer
]]) – We use this to define the input representation for the text. Seeallennlp.data.token_indexers.TokenIndexer
.tokenizer (
Optional
[Tokenizer
]) – Tokenizer to use to split the sentence text as well as the text of the target.left_right_contexts (
bool
) – If True it will return within the instance for text_to_instance the sentence context left and right of the target.reverse_right_context (
bool
) – If True this will reverse the text that is in the right context. NOTE left_right_context has to be True.incl_target (
bool
) – If left_right_context is True and this also the left and right contexts will include the target word(s) as well.use_categories (
bool
) – Whether or not to return the categories in the instances even if they do occur in the dataset. This is a temporary solution to the following issue. The number of categories does not have to match the number of targets, just there has to be at least one category per sentence.target_sequences (
bool
) – Whether or not to generate target_sequences which are a sequence of masks per target for all target texts. This will allow the model to know which tokens in the context relate to the target. Example of this is shown below (for this to work does require the span of each target)position_embeddings (
bool
) – Whether or not to create distance values that can be converted to embeddings similar to the position_weights but instead of the model later on using them as weights it uses the distances to learn position embeddings. (for this to work does require the span of each target). A Position-aware Bidirectional Attention Network for Aspect-level Sentiment Analysisposition_weights (
bool
) – In the instances there will be an extra key position_weights which will be an array of integers representing the linear distance between each token and it’s target e.g. If the text contains two targets where each token is represented by a number and the 1’s target tokens = [[0,0,0,1], [1,1,0,0]] then the position_weights will be [[4,3,2,1], [1,1,2,3]]. (for this to work does require the span of each target). An example of position weighting is in section 3.3 of Modeling Sentiment Dependencies with Graph Convolutional Networks for Aspect-level Sentiment Classificationmax_position_distance (
Optional
[int
]) – The maximum position distance given to a token from the target e.g. [0,0,0,0,0,1,0,0] if the each value represents a token and 1’s represent target tokens then the distance array would be [6,5,4,3,2,1,2,3] if the max_position_distance is 5 then the distance array will be [5,5,4,3,2,1,2,3]. (for this to work either position_embeddings has to be True or position_weights)
- Raises
ValueError – If the left_right_contexts is not True while either the incl_targets or reverse_right_context arguments are True.
ValueError – If the left_right_contexts and target_sequences are True at the same time.
ValueError – If the max_position_distance when set is less than 2.
ValueError – If max_position_distance is set but neither position_embeddings nor position_weights are True.
- Example of target_sequences
- {text: `This Camera lens is great but the
screen is rubbish`,
targets: [Camera, screen], target_sentiments: [positive, negative], target_sequences: [[0,1,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,1,0,0]],
spans: [[5,11], [34:40]]}
-
text_to_instance
(text, targets=None, target_sentiments=None, spans=None, categories=None, category_sentiments=None, **kwargs)[source]¶ The original text, text tokens as well as the targets and target tokens are stored in the MetadataField.
- NOTE
At least targets and/or categories must be present.
- NOTE
That the left and right contexts returned in the instance are a List of a List of tokens. A list for each Target.
- Parameters
text (
str
) – The text that contains the target(s) and/or categories.targets (
Optional
[List
[str
]]) – The targets that are within the texttarget_sentiments (
Optional
[List
[Union
[int
,str
]]]) – The sentiment of the targets. To be used if training the classifierspans (
Optional
[List
[List
[int
]]]) – The spans that represent the character offsets for each of the targets given in the targets list.categories (
Optional
[List
[str
]]) – The categories that are within the textcategory_sentiments (
Optional
[List
[Union
[int
,str
]]]) – The sentiment of the categories
- Return type
Instance
- Returns
An Instance object with all of the above encoded for a PyTorch model.
- Raises
ValueError – If either targets and categories are both None
ValueError – If self._target_sequences is True and the passed spans argument is None.
ValueError – If self._left_right_contexts is True and the passed spans argument is None.
-
target_extraction.allen.dataset_readers.target_sentiment.
logger
= <Logger target_extraction.allen.dataset_readers.target_sentiment (WARNING)>¶
target_extraction.allen.dataset_readers.text_sentiment module¶
-
class
target_extraction.allen.dataset_readers.text_sentiment.
TextSentimentReader
(label_name='label', **kwargs)[source]¶ Bases:
allennlp.data.dataset_readers.text_classification_json.TextClassificationJsonReader
Subclasses
allennlp.data.dataset_readers.TextClassificationJsonReader
of which the only differences is the label_name construction parameter which is explained in the Parameters section below.Reads tokens and their labels from a labeled text classification dataset. Expects a “text” field and a “label” field in JSON format. The output of
read
is a list ofInstance
s with the fields:tokens:
TextField
and label:LabelField
- token_indexers
Dict[str, TokenIndexer]
, optional optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer
.- tokenizer
Tokenizer
, optional (default ={"tokens": SpacyTokenizer()}
) Tokenizer to use to split the input text into words or other kinds of tokens.
- segment_sentences:
bool
, optional (default =False
) If True, we will first segment the text into sentences using SpaCy and then tokenize words. Necessary for some models that require pre-segmentation of sentences, like the Hierarchical Attention Network (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).
- max_sequence_length:
int
, optional (default =None
) If specified, will truncate tokens to specified maximum length.
- skip_label_indexing:
bool
, optional (default =False
) Whether or not to skip label indexing. You might want to skip label indexing if your labels are numbers, so the dataset reader doesn’t re-number them starting from 0.
- lazy
bool
, optional, (default =False
) Whether or not instances can be read lazily.
- label_name:
str
, optional, (default =label
) The name of the label field in the JSON objects that are read.
-
text_to_instance
(text, label=None, **kwargs)[source]¶ # Parameters
- textstr, required.
The text to classify
- labelstr, optional, (default = None).
The label for this text.
# Returns
- An Instance containing the following fields:
tokens (TextField) : The tokens in the sentence or phrase.
label (LabelField) : The label label of the sentence or phrase.
- Return type
Instance
- token_indexers
-
target_extraction.allen.dataset_readers.text_sentiment.
logger
= <Logger target_extraction.allen.dataset_readers.text_sentiment (WARNING)>¶