target_extraction package¶

Subpackages¶

Submodules¶

target_extraction.data_types module¶

Moudle that contains the two main data types target_extraction.data_types.TargetText and target_extraction.data_types.TargetTextCollection where the later is a container for the former.

classes:

target_extraction.data_types.TargetText
target_extraction.data_types.TargetTextCollection

class target_extraction.data_types.TargetText(text, text_id, targets=None, spans=None, target_sentiments=None, categories=None, category_sentiments=None, anonymised=False, **additional_data)[source]¶

Bases: collections.abc.MutableMapping

This is a data structure that inherits from MutableMapping which is essentially a python dictionary.

The following are the default keys that are in all TargetText objects, additional items can be added through __setitem__

text - The text associated to all of the other items
text_id – The unique ID associated to this object
targets – List of all target words that occur in the text. A special
placeholder of None (python None value) can exist where the target does not exist but a related Category does this would mean though that the related span is Span(0, 0), this type of special placeholder is in place for the SemEval 2016 Restaurant dataset where they link the categories to the targets but not all categories have related targets thus None.
spans – List of Span NamedTuples where each one specifies the start and end of the respective targets within the text.
target_sentiments – List sepcifying the sentiment of the respective targets within the text.
categories – List of categories that exist in the data which may or may not link to the targets (this is dataset speicific). NOTE: depending on the dataset and how it is parsed the category can exist but the target does not as the category is a latent variable, in these cases the category and category sentiments will be the same size which would be a different size to the target and target sentiments size. E.g. can happen where the dataset has targets and categories but they do not map to each other in a one to one manner e.g SemEval 2014 restuarant dataset, there are some samples that contain categories but no targets. Another word for category can be aspect.
category_sentiments – List of the sentiments associated to the categories. If the categories and targets map to each other then this will be empty and you will only use the target_sentiments.

Attributes:

anonymised – If True then the data within the TargetText object has no text but the rest of the metadata should exist.

Methods:

to_json – Returns the object as a dictionary and then encoded using json.dumps
to_conll – Returns a CONLL formatted string where the formatt will be the following: TOKEN#GOLD LABEL#PREDICTION 1# PREDICTION 2. Where each token and relevant labels are on separate new lines. The first line will always contain the following: # {text_id: `value}` where the text_id represents the text_id of this TargetText, this will allow the CONLL

string to be uniquely identified back this TargetText object.
from_conll – Adds the gold labels and/or predicted sequence labels from the CONLL formatted string.
tokenize – This will add a new key tokenized_text to this TargetText instance that will store the tokens of the text that is associated to this TargetText instance.
pos_text – This will add a new key pos_tags to this TargetText instance. This key will store the pos tags of the text that is associated to this Target Text instance.
force_targets – Does not return anything but modifies the spans and text values as whitespace is prefixed and suffixed the target unless the prefix or suffix is whitespace. NOTE that this is the only method that currently can change the spans and text key values after they have been set.
sequence_labels – Adds the sequence_labels key to this TargetText instance which can be used to train a machine learning algorthim to detect targets.
get_sequence_indexs – The indexs related to the tokens, pos tags etc for each labelled sequence span.
get_sequence_spans – The span indexs from the sequence labels given assuming that the sequence labels are in BIO format.
get_targets_from_sequence_labels – Retrives the target words given the sequence labels.
one_sample_per_span – This returns a similar TargetText instance where the new instance will only contain one target per span.
left_right_target_contexts – This will return the sentence that is left and right of the target as well as the words in the target for each target in the sentence.
replace_target – Given an index and a new target word it will replace the target at the index with the new target word and return a new TargetText object with everything the same apart from this new target.
de_anonymise – This will set the anonymised attribute to False from True and set the text key value to the value in the text key within the text_dict argument.
in_order – True if all the targets within this TargetText are in sequential left to right order within the text.
re_order – Re-Orders the TargetText object targets so that they are in a left to right order within the text, this will then re-order all values within this object that are in a list format into this order. Once the TargetText has been re-ordered it will return True when :py:meth`target_extraction.data_types.TargetText.in_order` is called.
add_unique_key – Given a key e.g. targets it will create a new value in the TargetText object that is a list of strings which are unique IDs based on the text_id and the index the targets occur in e.g. if the targets contain [food, service] and the text_id is 12a5 then the target_id created will contain [`12a5$$0,`12a5$$1`]`

Static Functions:

from_json – Returns a TargetText object given a json string. For example the json string can be the return of TargetText.to_json.
targets_from_spans – Given a sequence of spans and the associated text it will return the targets that are within the text based on the spans
target_text_from_prediction – Creates a TargetText object from data that has come from predictions of a Target Extract tagger

add_unique_key(id_key, id_key_name, id_delimiter='::')[source]¶

Parameters

id_key (str) – The name of the key within this TargetText that requires unique ids that will be stored in id_key_name.
id_key_name (str) – The name of the key to associate to these new unique ids.
id_delimiter (str) – The delimiter to seperate the text_id and the index of the id_key that is being represented by this unique id.

Raises

KeyError – If the id_key_name already exists within the TargetText.
TypeError – If the value of id_key is not of type List.

Example

self.add_unique_key(targets, targets_id) where targets`=[`food, service] and text_id`=`12a5 the following key will be added to self targets_id with the following value = [`12a5::0, 12a5::1]`

Return type

None

property anonymised¶

Return type: bool
Returns: True if the data within the TargetText has been anonymised. Anonymised data means that there is no text associated with the TargetText object but all of the metadata is there.

de_anonymise(text_dict)[source]¶

This will set the anonymised attribute to False from True and set the text key value to the value in the text key within the text_dict argument.

Parameters

text_dict (Dict[str, str]) – A dictionary that contain the following two keys: 1. text and 2. text_id where the text_id has to match the current TargetText object text_id and the text value will become the new value in the text key for this TargetText object.

Raises

ValueError – If the TargetText object text_id does not match the text_id within text_dict argument.
AnonymisedError – If the text given does not pass the sanitize() test.

Return type

None

force_targets()[source]¶

NOTE: As this affects the following attributes spans, text, and targets it therefore has to modify these through self._storage as both of these attributes are within self._protected_keys.

Does not return anything but modifies the spans and text values as whitespace is prefixed and suffixed the target unless the prefix or suffix is whitespace.

Motivation: Ensure that the target tokens are not within another separate String e.g. target = priced but the sentence is the laptop;priced is high and the tokenizer is on whitespace it will not have priced seperated therefore the BIO tagging is not deterministic thus force will add whitespace around the target word e.g. the laptop; priced. This was mainly added for the TargetText.sequence_tags method.

Raises: AnonymisedError – If the object has been anonymised then this method cannot be used.
Return type: None

from_conll(conll_str, tokens_key='tokenized_text', gold_label_key=None, prediction_key=None)[source]¶

Parameters

conll_str (str) – CONLL formatted string formatted like so: TOKEN#GOLD LABEL#PREDICTION 1# PREDICTION 2
tokens_key (str) – Key to save the CONLL tokens too.
gold_label_key (Optional[str]) – Key to save the gold labels too. Either gold_label_key or prediction_key must not be None or both not None
prediction_key (Optional[str]) – Key to save the prediction labels too. The value will be of shape (number runs, number tokens)

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
ValueError – If both gold_label_key and prediction_key are None.
ValueError – If the number of labels are not consistent in the CONLL string e.g. the first token has 3 predicted labels and the second token has 2 predicted labels.
ValueError – If the text within this TargetText does not match the tokens in the CONLL string. (CASE SENSITIVE)

Return type

None

static from_json(json_text, anonymised=False)[source]¶

This is required as the ‘spans’ are Span objects which are not json serlizable and are required for TargetText therefore this handles that special case.

This function is also required as we have had to avoid using the __set__ function and add objects via the _storage dictionary underneath so that we could add values to this object that are not within the constructor like tokenized_text. To ensure that it is compatable with the TargetText concept we call TargetText.sanitize method at the end.

Parameters

json_text (str) – JSON representation of TargetText (can be from TargetText.to_json)
anonymised (bool) – Whether or not the TargetText object being loaded is an anonymised version.

Return type

TargetText

Returns

A TargetText object

Raises

KeyError – If within the JSON representation there is no text_id key. Or if anonymised is False raises a KeyError if there is no text key in the JSON representation.

get_sequence_indexs(sequence_key)[source]¶

The following sequence label tags are supported: IOB-2. These are the tags that are currently generated by sequence_labels.

Parameters

sequence_key (str) – Key to sequence labels such as a BIO sequence labels. Example key name would be sequence_labels after sequence_labels function has been called or more appropiately predicted_sequence_labels when you have predicted sequence labels.

Return type

List[List[int]]

Returns

A list of a list of intergers where each list of integers represent the token/pos tag/sequence label index of each sequence label span. :Example: These sequence labels [O, B, I, O, B]

would return the following integers list [[1, 2], [4]]

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
ValueError – If the sequence labels that are contained in the sequence key value contain values other than B, I, or O.
ValueError – If then number of tokens in the current TargetText object is not the same as the number of sequence labels.

get_sequence_spans(sequence_key, confidence=None)[source]¶

The following sequence label tags are supported: IOB-2. These are the tags that are currently generated by sequence_labels

Parameters

sequence_key (str) – Key to sequence labels such as a BIO sequence labels. Example key name would be sequence_labels after sequence_labels function has been called or more appropiately predicted_sequence_labels when you have predicted sequence labels.
confidence (Optional[float]) –
Optional argument that will return only spans that have been predicted with a confidence higher than this. :NOTE: As it is BIO labelling in the case where

all but one of the B and I’s is greater than the threshold that span would not be returned, as one of the words in the multi word target word is less than the threshold.

Return type

List[Span]

Returns

The span indexs from the sequence labels given assuming that the sequence labels are in BIO format.

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
KeyError – If no confidence key are found. However confidence is only required if the confidence argument is set.
ValueError – If the sequence labels that are contained in the sequence key value contain values other than B, I, or O.
ValueError – If the confidence value is not between 0 and 1

get_targets_from_sequence_labels(sequence_key, confidence=None)[source]¶

This function mains use is when the sequence labels have been predicted on a piece of text that has no gold annotations.

Parameters

sequence_key (str) – Key to sequence labels such as a BIO sequence labels. Example key name would be sequence_labels after sequence_labels function has been called or more appropiately predicted_sequence_labels when you have predicted sequence labels.
confidence (Optional[float]) –
Optional argument that will return only target texts that have been predicted with a confidence higher than this. :NOTE: As it is BIO labelling in the case where

all but one of the B and I’s is greater than the threshold that target word would not be returned as one of the words in the multi word target word is less than the threshold.

Return type

List[str]

Returns

The target text’s that the sequence labels have predcited.

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
KeyError – If no tokenized_text or confidence key are found. However confidence is only required if the confidence argument is set.
ValueError – If the confidence value is not between 0 and 1

in_order()[source]¶

Return type: bool
Returns: True if all the targets within this TargetText are in sequential left to right order within the text.

left_right_target_contexts(incl_target)[source]¶

Parameters: incl_target (bool) – Whether or not the left and right sentences should also include the target word.
Return type: List[Tuple[List[str], List[str], List[str]]]
Returns: The sentence that is left and right of the target as well as the words in the target for each target in the sentence.
Raises: AnonymisedError – If the object has been anonymised then this method cannot be used.

one_sample_per_span(remove_empty=False)[source]¶

This returns a similar TargetText instance where the new instance will only contain one target per span.

This is for the cases where you can have a target e.g. food that has a different related category attached to it e.g. TargetText(text=`$8 and there is much nicer, food, all of it great and

continually refilled.`, text_id=`1`, targets=[food, food, food], categories=[style, quality, price], target_sentiments=[pos,`pos`,`pos`], spans=[Span(27, 31),Span(27, 31),Span(27, 31)])

As we can see the targets and the categories are linked, this is only really the case in SemEval 2016 datasets from what I know currently. In the example case above it will transform it to the following: TargetText(text=`$8 and there is much nicer, food, all of it great and

continually refilled.`, text_id=`1`, targets=[food],spans=[Span(27,31)])

This type of pre-processing is perfect for the Target Extraction task.

Parameters: remove_empty (bool) – If the TargetText instance contains any None targets then these will be removed along with their respective Spans.
Return type: TargetText
Returns: This returns a similar TargetText instance where the new instance will only contain one target per span.
Raises: AnonymisedError – If the object has been anonymised then this method cannot be used.

pos_text(tagger, perform_type_checks=False)[source]¶

This will add a new key pos_tags to this TargetText instance. This key will store the pos tags of the text that is associated to this Target Text instance. NOTE: It will also replace the current tokens in the tokenized_text key with the tokens produced from the pos tagger.

For a set of pos taggers that are definetly comptable see target_extraction.pos_taggers module. The pos tagger will have to produce both a list of tokens and pos tags.

Parameters

tagger (Callable[[str], Tuple[List[str], List[str]]]) – POS tagger.
perform_type_checks (bool) – Whether or not to perform type checks to ensure the POS tagger returns a tuple containing two lists both containing Strings.

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
TypeError – If the POS tagger given does not return a Tuple
TypeError – If the POS tagger given does not return a List of Strings for both the tokens and the pos tags.
TypeError – If the POS tagger tokens or pos tags are not lists
ValueError – If the POS tagger return is not a tuple of length 2
ValueError – This is raised if the Target Text text is empty
ValueError – If the number of pos tags for this instance does not have the same number of tokens that has been generated by the tokenizer function.

Return type

None

re_order(keys_not_to_order=None)[source]¶

Re-orders the TargetText object so that the targets are in a left to right order within the text, this will then re-order all values within this object that are in a list format into this order. Once the TargetText has been re-ordered it will return True when :py:meth`target_extraction.data_types.TargetText.in_order` is called.

Parameters: keys_not_to_order (Optional[List[str]]) – Any key values not to re-order using this function e.g. pos_tags, tokenized_text, etc
Raises: AssertionError – If running :py:meth`target_extraction.data_types.TargetText.in_order` after being re-ordered does not return True.
Return type: None

replace_target(target_index, replacement_target_word)[source]¶

Params target_index

The target index of the target word to replace

Parameters

replacement_target_word (str) – The target word to replace the target word at the given index

Return type

TargetText

Returns

Given the target index and replacement target word it will replace the target at the index with the new target word and return a new TargetText object with everything the same apart from this new target.

Raises

ValueError – If the target_index is less than 0 or an index number that does not exist.
OverLappingTargetsError – If the target to replace is contained within another target e.g. what a great day if this has two targets great and great day then it will raise this error if you replace either word as each is within the other.
AnonymisedError – If the object has been anonymised then this method cannot be used.

Example

Given the following TargetText Object

sanitize()[source]¶

This performs a check on all of the lists that can be given at object construction time to ensure that the following conditions are met:

The target, spans and target_sentiments lists are all of the same size if set.
The categories and the category_sentiments lists are all of the same size if set.

Further more it checks the following:

If targets or spans are set then both have to exist.
If targets and spans are set that the spans text match the associated target words e.g. if the target is barry davies in today barry davies went then the spans should be [[6,18]]
If anonymised esures that the text key does not exist.

The 2nd check is not performed if self.anonymised is False.

Raises: ValueError – If any of the above conditions are not True.
Return type: None

sequence_labels(per_target=False, label_key=None)[source]¶

Adds the sequence_labels key to this TargetText instance which can be used to train a machine learning algorthim to detect targets. The value associated to the sequence_labels key will be a list of B, I, or O labels, where each label is associated to a token.

The force_targets method might come in useful here for training and validation data to ensure that more of the targets are not affected by tokenization error as only tokens that are fully within the target span are labelled with B or I tags. Another use for the force_targets is so to ensure that targets are not affected by tokenisation and therefore can be used to state where the targets are in the sequence for sentiment classification e.g. in the case of getting contextualised target tokens or to create [TD-BERT Gao et al. 2019](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8864964).

Currently the only sequence labels supported is IOB-2 labels for the targets only. Future plans look into different sequence label order e.g. IOB see link below for more details of the difference between the two sequence, of which there are more sequence again. https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

Parameters

per_target (bool) – Whether the the value of associated to the sequence_labels key should be one list for all of the targets False. Or if True should be a list of a labels per target where the labels will only be associated to the represented target.
label_key (Optional[str]) – Optional label key. Where the key represents a list of values that are associated with each token. These list of values are then the class labels to attach to each B, I, O tag. E.g. the label key could be target_sentiments therefore creating the sequence labelling task of target extraction and predicting sentiment. For example if the label key is target_sentiments it would make the B, I, O task extraction and sentiment prediction.

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
KeyError – If the current TargetText has not been tokenized. Or if label_key is not None then label_key must be a key in self else KeyError.
ValueError – If label_key not None. Raises if number of labels does not match the number of targets that the labels should be associated too.
ValueError – If two targets overlap the same token(s) e.g Laptop cover was great if Laptop and Laptop cover are two separate targets this should raise a ValueError as a token should only be associated to one target.

Return type

None

static target_text_from_prediction(text, text_id, sequence_labels, tokenized_text, confidence=None, confidences=None, **additional_data)[source]¶

Creates a TargetText object from data that has come from predictions of a Target Extract tagger e.g. the dictionaries that are returned from target_extraction.allen.allennlp_model.predict_sequences()

Parameters

text (str) – Text to give to the TargetText object
text_id (str) – Text ID to give to the TargetText object
sequence_labels (List[str]) – The predicted sequence labels
tokenized_text (List[str]) – The tokens that were used to produce the predicted sequence labels (should be returned by the Target Extract tagger predictor).
confidence (Optional[float]) – The level of confidence from the tagger that is required for a target to be a target e.g. 0.9
confidences (Optional[List[float]]) – The list of confidence values produced by the Target Extract tagger predictor to be used with the confidence argument. The list of confidence values should be the same size as the sequence labels list and tokenized text.
additional_data – Any other keyword arguments to provide to the TargetText object

Return type

TargetText

Returns

A TargetText object with spans and targets values

Raises

ValueError – If sequence labels, tokenized text and confidecnes are not of the same length
ValueError – If the following keys are in the additional data; 1. confidence, 2. text, 3. text_id, 4. tokenized_text 5. sequence_labels, 6. targets, 7. spans. As these keys will be populated by within the TargetText object automatically.

static targets_from_spans(text, spans)[source]¶

Parameters

text (str) – The text that the spans are associated too.
spans (List[Span]) – A list of Span values that represent the character index of the target words to be returned.

Return type

List[str]

Returns

The target words that are associated to the spans and text given.

to_conll(gold_label_key, prediction_key=None)[source]¶

Parameters

gold_label – A key that contains a sequence of labels e.g. [B, I, O]. This can come from the return of the sequence_labels()
prediction_key (Optional[str]) – Key to the predicted labels of the gold_label. Where the prediction key values is a list of a list of predicted labels. Each list is therefore a different model run hence creating the PREDICTION 1, ‘PREDICTION 2’ etc. Thus the values of prediction_key must be of shape (number runs, number tokens)

Return type

str

Returns

A CONLL formatted string where the format will be the following: TOKEN#GOLD LABEL#PREDICTION 1# PREDICTION 2 Where each token and relevant labels are on separate new lines. The first line will always contain the following: # {text_id: `value}` where the text_id represents the text_id of this TargetText, this will allow the CONLL string to be uniquely identified back this TargetText object.

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
KeyError – If the the object has not be tokenized using tokenize()
KeyError – If the prediction_key or gold_label_key do not exist.
ValueError – If the gold_label_key or prediction_key values are not of the same length as the tokens, as the labels will not be able to match tokens etc.
ValueError – If the values in prediction_key are not of shape (number runs, number tokens)

to_json()[source]¶

Required as TargetText is not json serlizable due to the ‘spans’.

Return type: str
Returns: The object as a dictionary and then encoded using json.dumps

tokenize(tokenizer, perform_type_checks=False)[source]¶

This will add a new key tokenized_text to this TargetText instance that will store the tokens of the text that is associated to this TargetText instance.

For a set of tokenizers that are definitely comptable see target_extraction.tokenizers module.

Ensures that the tokenization is character preserving.

Parameters

tokenizer (Callable[[str], List[str]]) – The tokenizer to use tokenize the text for each TargetText instance in the current collection
perform_type_checks (bool) – Whether or not to perform type checks to ensure the tokenizer returns a List of Strings

Raises

AnonymisedError – If the object has been anonymised then this method cannot be used.
TypeError – If the tokenizer given does not return a List of Strings.
ValueError – This is raised if the TargetText instance contains empty text.
ValueError – If the tokenization is not character preserving.

Return type

None

class target_extraction.data_types.TargetTextCollection(target_texts=None, name=None, metadata=None, anonymised=False)[source]¶

Bases: collections.abc.MutableMapping

This is a data structure that inherits from MutableMapping which is essentially a python dictionary, however the underlying storage is a OrderedDict therefore if you iterate over it, the iteration will always be in the same order.

This structure only contains TargetText instances.

Attributes:

name – Name associated to the TargetTextCollection.
metadata – Any metadata to associate to the object e.g. domain of the dataset, all metadata is stored in a dictionary. By default the metadata will always have the name attribute within the metadata under the key name. If anonymised is also True then this will also be in the metadata under the key anonymised
anonymised – If True then the data within the TargetText objects have no text but the rest of the metadata should exist.

Methods:

to_json – Writes each TargetText instances as a dictionary using it’s own to_json function on a new line within the returned String. The returned String is not json comptable but if split by new line it is and is also comptable with the from_json method of TargetText.
to_conll – A CONLL formatted string where the format will be the following: TOKEN#GOLD LABEL#PREDICTION 1# PREDICTION 2 Where each token and relevant labels are on separate new lines. The first line will always contain the following: # {text_id: `value}` where the text_id represents the text_id of this TargetText, this will allow the CONLL string to be uniquely identified back this TargetText object. Also each TargetText CONLL string will be seperated by a new line.
to_conll_file – Saves the TargetTextCollection to CONLL format. Useful for Sequence Labelling tasks.
load_conll – Loads the CONLL information into the collection.
add – Wrapper around __setitem__. Given as an argument a TargetText instance it will be added to the collection.
to_json_file – Saves the current TargetTextCollection to a json file which won’t be strictly json but each line in the file will be and each line in the file can be loaded in from String via TargetText.from_json. Also the file can be reloaded into a TargetTextCollection using TargetTextCollection.load_json.
tokenize – This applies the TargetText.tokenize method across all of the TargetText instances within the collection.
pos_text – This applies the TargetText.pos_text method across all of
the TargetText instances within the collection.
sequence_labels – This applies the TargetText.sequence_labels method across all of the TargetText instances within the collection.
force_targets – This applies the TargetText.force_targets method across all of the TargetText instances within the collection.
exact_match_score – Recall, Precision, and F1 score in a Tuple. All of these measures are based on exact span matching rather than the matching of the sequence label tags, this is due to the annotation spans not always matching tokenization therefore this removes the tokenization error that can come from the sequence label measures.
samples_with_targets – Returns all of the samples that have target spans as a TargetTextCollection.
target_count – A dictionary of target text as key and values as the number of times the target text occurs in this TargetTextCollection
one_sample_per_span – This applies the TargetText.one_sample_per_span method across all of the TargetText instances within the collection to create a new collection with those new TargetText instances within it.
number_targets – Returns the total number of targets.
number_categories – Returns the total number of categories.
category_count – Returns a dictionary of categories as keys and values as the number of times the category occurs.
target_sentiments – A dictionary where the keys are target texts and the values are a List of sentiment values that have been associated to that target.
dict_iter – Returns an interator of all of the TargetText objects within the collection as dictionaries.
unique_distinct_sentiments – A set of the distinct sentiments within the collection. The length of the set represents the number of distinct sentiments within the collection.
de_anonymise – This will set the anonymised attribute to False from True and set the text key value to the value in the text key within the text_dict argument for each of the TargetTexts in the collection. If any Error is raised this collection will revert back fully to being anonymised.
sanitize – This applies the TargetText.sanitize function to all of the TargetText instances within this collection, affectively ensures that all of the instances follow the specified rules that TargetText instances should follow.
in_order – This returns True if all TargetText objects within the collection contains a list of targets that are in order of appearance within the text from left to right e.g. if the only TargetText in the collection contains two targets where the first target in the targets list is the first (left most) target in the text then this method would return True.
re_order – This will apply target_extraction.data_types.TargetText.re_order() to each TargetText within the collection.
add_unique_key – Applies the following target_extraction.data_types.TargetText.add_unique_key() to each TargetText within this collection
key_difference – Given this collection and another it will return all of the keys that the other collection contains which this does not.
combine_data_on_id – Given this collection and another it will add all of the data from the other collection into this collection based on the unique key given.
one_sentiment_text – Adds the text_sentiment_key to each TargetText within the collection where the value will represent the sentiment value for the text based on the sentiment_key values and average_sentiment determining how to handle multiple sentiments. This will allow text level classifiers to be trained on target/aspect/category data.

Static Functions:

from_json – Returns a TargetTextCollection object given the json like String from to_json. For example the json string can be the return of TargetTextCollection.to_json.
load_json – Returns a TargetTextCollection based on each new line in the given json file.
combine – Returns a TargetTextCollection that is the combination of all of those given.
same_data – Given a List of TargetTextCollections it will return a list of tuples specifying the overlap between the collections based on the samples text_id and text key values. If it returns an empty list then there are no overlap between the collections. This is useful to find duplicates beyond the text_id as it checks the text value as well.

add(value)[source]¶

Wrapper around set item. Instead of having to add the value the usual way of finding the instances ‘text_id’ and setting this containers key to this value, it does this for you.

e.g. performs self[value[‘text_id’]] = value

Parameters: value (TargetText) – The TargetText instance to store in the collection. Will anonymise the TargetText object if the collection’s anonymised attribute is True.
Return type: None

add_unique_key(id_key, id_key_name, id_delimiter='::')[source]¶

Applies the following target_extraction.data_types.TargetText.add_unique_key() to each TargetText within this collection

Parameters

id_key (str) – The name of the key within this TargetText that requires unique ids that will be stored in id_key_name.
id_key_name (str) – The name of the key to associate to these new unique ids.
id_delimiter (str) – The delimiter to seperate the text_id and the index of the id_key that is being represented by this unique id.

Return type

None

property anonymised¶

Return type: bool
Returns: True if the data within the TargetTextCollection has been anonymised. Anonymised data means that there is no text associated with any of the TargetText objects within the collection, but all of the metadata is there.

category_count()[source]¶

Return type: Dict[str, int]
Returns: A dictionary of categories as keys and values as the number of times the category occurs in this TargetTextCollection
Raises: ValueError – If any category has the value of None.

static combine(*collections)[source]¶

Parameters: collections – An iterator containing one or more TargetTextCollections
Return type: TargetTextCollection
Returns: A TargetTextCollection that is the combination of all of those given.
NOTE: If any of the collections are anonymised then the returned collection will also be anonymised, even if only one of the collections has been anonymised.

combine_data_on_id(other_collection, id_key, data_keys, raise_on_overwrite=True, check_same_ids=True)[source]¶

Parameters

other_collection (TargetTextCollection) – The collection that contains the data that is to be copied to this collection.
id_key (str) – The key that indicates in each TargetText within this and the other_collection how the values are to be copied from the other_collection to this collection.
data_keys (List[str]) – The keys of the values in each TargetText within the other_collection that is be copied to the relevant TargetTexts within this collection. It assumes that if any of key/values are a list of lists that the inner lists relate to the targets and the outer list is not related to the targets.
raise_on_overwrite (bool) – If True will raise the target_extraction.data_types_util.OverwriteError if any of the data_keys exist in any of the TargetTexts within this collection.
check_same_ids (bool) – If True will ensure that this collection and the other collection are of same length and check if each have the same unique ids

Raises

AssertionError – If the number of IDs from the id_key does not match the number of data to be added to a data key
ValueError – If check_same_ids is True and the two collections are either of not the same length or have different unique ids according to id_key within the TargetText objects.
OverwriteError – If raise_on_overwrite is True and the any of the data_keys exist in any of the TargetTexts within this collection.

Return type

None

de_anonymise(text_dicts)[source]¶

This will set the anonymised attribute to False from True and set the text key value to the value in the text key within the text_dict argument for each of the TargetTexts in the collection. If any Error is raised this collection will revert back fully to being anonymised.

Parameters

text_dicts (Iterable[Dict[str, str]]) – An iterable of dictionaries that contain the following two keys: 1. text and 2. text_id where the text_id has to be a key within the current collection. The text associated to that id will become that TargetText object’s text value.

Raises

ValueError – If the length of the text_dicts does not match that of the collection.
KeyError – If any of the text_id`s in the `text_dicts do not match those within this collection.

Return type

None

dict_iterator()[source]¶

Return type: Iterable[Dict[str, Any]]
Returns: An interator of all of the TargetText objects within the collection as dictionaries.

exact_match_score(predicted_sequence_key='predicted_sequence_labels')[source]¶

Just for clarification we use the sequence label tags to find the predicted spans. However even if you have a perfect sequence label score does not mean you will have a perfect extact span score as the tokenizer used for the sequence labelling might not align perfectly with the annotated spans.

The False Positive mistakes, False Negative mistakes, and correct True Positive Dictionary keys are those names with the values neing a List of Tuples where the Tuple is made up of the TargetText instance ID and the Span that was incorrect (FP) or not tagged (FN) or correct (TP). Example of this is as follows: {FP: [(‘1’, Span(0, 4))], ‘FN’: [], ‘TP’: []}

Parameters

predicted_sequence_key (str) – Key of the predicted sequence labels within this TargetText instance.

Return type

Tuple[float, float, float, Dict[str, List[Tuple[str, Span]]]]

Returns

Recall, Precision, and F1 score, False Positive mistakes, False Negative mistakes, and correct True Positives in a Dict. All of these measures are based on exact span matching rather than the matching of the sequence label tags, this is due to the annotation spans not always matching tokenization therefore this removes the tokenization error that can come from the sequence label measures.

Raises

KeyError – If there are no predicted sequence label key within this TargetText.
ValueError – If the predicted or true spans contain multiple spans that have the same span e.g. [Span(4, 15), Span(4, 15)]

force_targets()[source]¶

This applies the TargetText.force_targets method across all of the TargetText instances within the collection.

Return type: None

static from_json(json_text, **target_text_collection_kwargs)[source]¶

Required as the json text is expected to be the return from the self.to_json method. This string is not passable by a standard json decoder.

Parameters

json_text (str) – This is expected to be a dictionary like object for each new line in this text
target_text_collection_kwargs – Key word arguments to give to the TargetTextCollection constructor.

Return type

TargetTextCollection

Returns

A TargetTextCollection based on each new line in the given text to be passable by TargetText.from_json method.

Raises

AnonymisedError – If the TargetText object that it is loading is anonymised but the target_text_collection_kwargs argument contains anonymised False, as you cannot de-anonymised without performing the target_extraction.data_types.TargetTextCollection.de_anonymised().

in_order()[source]¶

This returns True if all TargetText objects within the collection contains a list of targets that are in order of appearance within the text from left to right e.g. if the only TargetText in the collection contains two targets where the first target in the targets list is the first (left most) target in the text then this method would return True.

Return type: bool
Returns: True if all the targets within all the TargetText objects in this collection are in sequential left to right order within the text.

key_difference(other_collection)[source]¶

Parameters: other_collection (TargetTextCollection) – The collection that is being compared to this.
Return type: List[str]
Returns: A list of keys that represent all of the keys that are in the other (compared) collection and not in this collection.

load_conll(conll_fp, tokens_key='tokenized_text', gold_label_key=None, prediction_key=None)[source]¶

This takes the conll_fp and loads the CONLL data into the relevant TargetText samples in this collection using the TargetText from_conll function. The matching of TargetText with CONLL data is through the CONLL string containing # {text_id: _id} for each CONLL sentence/text.

Parameters

tokens_key (str) – Key to save the CONLL tokens too, for the TargetText.
gold_label_key (Optional[str]) – Key to save the gold labels too. Either gold_label_key or prediction_key must not be None or both not None, for the TargetText.
prediction_key (Optional[str]) – Key to save the prediction labels too. The value will be of shape (number runs, number tokens), for the TargetText.

Return type

None

static load_json(json_fp, **target_text_collection_kwargs)[source]¶

Allows loading a dataset from json. Where the json file is expected to be output from TargetTextCollection.to_json_file as the file will be a json String on each line generated from TargetText.to_json. This will also load any meta data that was stored within the TargetTextCollection.

Parameters

json_fp (Path) – File that contains json strings generated from TargetTextCollection.to_json_file
target_text_collection_kwargs – Key word arguments to give to the TargetTextCollection constructor. If there was any meta data stored within the loaded json then these key word arguments would over ride the meta data stored.

Return type

TargetTextCollection

Returns

A TargetTextCollection based on each new line in the given json file, and the optional meta data on the last line.

property name¶

Return type: str
Returns: The name attribute.

number_categories()[source]¶

Return type: int
Returns: The total number of categories in the collection
Raises: ValueError – If one of the category values in the list is of value None

number_targets(incl_none_targets=False)[source]¶

Parameters: incl_none_targets (bool) – Whether to include targets that are None and are therefore associated to the categories in the count.
Return type: int
Returns: The total number of targets in the collection.

one_sample_per_span(remove_empty=False)[source]¶

This applies the TargetText.one_sample_per_span method across all of the TargetText instances within the collection to create a new collection with those new TargetText instances within it.

Parameters: remove_empty (bool) – If the TargetText instance contains any None targets then these will be removed along with their respective Spans.
Return type: TargetTextCollection
Returns: A new TargetTextCollection that has samples that come from this collection but has had the TargetText.one_sample_per_span method applied to it.

one_sentiment_text(sentiment_key, average_sentiment=False, text_sentiment_key='text_sentiment')[source]¶

Adds the text_sentiment_key to each TargetText within the collection where the value will represent the sentiment value for the text based on the sentiment_key values and average_sentiment determining how to handle multiple sentiments. This will allow text level classifiers to be trained on target/aspect/category data.

Parameters

sentiment_key (str) – The key in the TargetTexts that represent the sentiment for the TargetTexts sentence.
average_sentiment (bool) – If False it will only add the text_sentiment_key to TargetTexts that have one unique sentiment in the sentiment_key e.g. can have more than one sentiment value in the sentiment_key but each one of those values has to be the same value. If True it will choose the most frequent sentiment , ties are decided by random choice. If the there are no values in sentiment_key then text_sentiment_key will not be added to the TargetText.
text_sentiment_key (str) – The key to add the text level sentiment value to.

Return type

None

pos_text(tagger)[source]¶

This applies the TargetText.pos_text method across all of the TargetText instances within the collection.

For a set of pos taggers that are definetly comptable see target_extraction.pos_taggers module.

Parameters

tagger (Callable[[str], List[str]]) – POS tagger.

Raises

TypeError – If the POS tagger given does not return a List of Strings.
ValueError – This is raised if any of the TargetText instances in the collection contain an empty string.
ValueError – If the Target Text instance has not been tokenized.
ValueError – If the number of pos tags for a Target Text instance does not have the same number of tokens that has been generated by the tokenizer function.

Return type

None

re_order(keys_not_to_order=None)[source]¶

This will apply target_extraction.data_types.TargetText.re_order() to each TargetText within the collection.

Parameters: keys_not_to_order (Optional[List[str]]) – Any keys within the TargetTexts that do not need re-ordering
Return type: None

static same_data(collections)[source]¶

Parameters: collections (List[TargetTextCollection]) – A list of TargetTextCollections to test if there are any duplicates based on text_id and text key values.
Return type: List[Tuple[List[Tuple[TargetText, TargetText]], Tuple[str, str]]]
Returns: If the list is empty then there are no duplicates. Else a list of tuples containing 1. A list of tuples of duplicate TargetText instances 2. A tuple of collection names that the TargetText have come stating the names of the collections that have the duplicates.

samples_with_targets()[source]¶

Return type: TargetTextCollection
Returns: All of the samples that have targets as a TargetTextCollection for this TargetTextCollection.
Raises: KeyError – If either spans or targets does not exist in one or more of the TargetText instances within this collection. These key’s are protected keys thus they should always exist but this is just a warning if you have got around the protected keys.

sanitize()[source]¶

This applies the TargetText.sanitize function to all of the TargetText instances within this collection, affectively ensures that all of the instances follow the specified rules that TargetText instances should follow.

Return type: None

sequence_labels(return_errors=False, **target_sequence_label_kwargs)[source]¶

This applies the TargetText.sequence_labels method across all of the TargetText instances within the collection.

Parameters

return_errors (bool) – Returns TargetText objects that have caused the ValueError to be raised.
target_sequence_label_kwargs – Any Keyword arguments to give to the TargetText sequence_labels function.

Return type

List[TargetText]

Returns

A list of TargetText objects that have caused the ValueError to be raised if return_errors is True else an empty list will be returned.

Raises

KeyError – If the current TargetText has not been tokenized.
ValueError – If two targets overlap the same token(s) e.g Laptop cover was great if Laptop and Laptop cover are two seperate targets this should riase a ValueError as a token should only be associated to one target.

target_count(lower=False, target_key='targets')[source]¶

Note

The target can not exist e.g. be a None target as the target can be combined with the category like in the SemEval 2016 Restaurant dataset. In these case we do not include these in the target_count.

Parameters

lower (bool) – Whether or not to lower the target text.
target_key (str) – The key in each TargetText sample that contains the list of target words.

Return type

Dict[str, int]

Returns

A dictionary of target text as key and values as the number of times the target text occurs in this TargetTextCollection

target_sentiments(lower=False, unique_sentiment=False)[source]¶

Note

The target can not exist e.g. be a None target as the target can be combined with the category like in the SemEval 2016 Restaurant dataset. In these case we do not include these in the target_count.

Parameters

lower (bool) – Whether or not to lower the target text.
unique_sentiment (bool) – Whether or not the return is a dictionary whose values are a List of Strings or if True a Set of Strings.

Return type

Dict[str, Union[List[str], Set[str]]]

Returns

A dictionary where the keys are target texts and the values are a List of sentiment values that have been associated to that target. The sentiment value can occur more than once indicating the number of times that target has been associated with that sentiment unless unique_sentiment is True then instead of a List of sentiment values a Set is used instead.

Explanation

If the target camera has occured with the sentiment positive twice and negative once then it will return {camera: [positive, positive, negative]}. However if unique_sentiment is True then it will return: {camera: {positive, negative}}.

to_conll(gold_label_key, prediction_key=None)[source]¶

This in affect performs the to_conll function for each TargetText within the collection and seperates each on the CONLL strings with a new line.

Parameters

gold_label – A key that contains a sequence of labels e.g. [B, I, O]. This can come from the return of the sequence_labels()
prediction_key (Optional[str]) – Key to the predicted labels of the gold_label. Where the prediction key values is a list of a list of predicted labels. Each list is therefore a different model run hence creating the PREDICTION 1, ‘PREDICTION 2’ etc. Thus the values of prediction_key must be of shape (number runs, number tokens)

Return type

str

Returns

A CONLL formatted string where the format will be the following: TOKEN#GOLD LABEL#PREDICTION 1# PREDICTION 2 Where each token and relevant labels are on separate new lines. The first line will always contain the following: # {text_id: `value}` where the text_id represents the text_id of this TargetText, this will allow the CONLL string to be uniquely identified back this TargetText object. Also each TargetText CONLL string will be seperated by a new line.

to_conll_file(conll_fp, gold_label_key, prediction_key=None)[source]¶

Writes the ouput of to_conll to the conll_fp file.

Parameters

conll_fp (Path) – Write the CONLL string to this file path.
gold_label – A key that contains a sequence of labels e.g. [B, I, O]. This can come from the return of the sequence_labels()
prediction_key (Optional[str]) – Key to the predicted labels of the gold_label. Where the prediction key values is a list of a list of predicted labels. Each list is therefore a different model run hence creating the PREDICTION 1, ‘PREDICTION 2’ etc. Thus the values of prediction_key must be of shape (number runs, number tokens)

Return type

None

to_json()[source]¶

Required as TargetTextCollection is not json serlizable due to the ‘spans’ in the TargetText instances.

Return type: str
Returns: The object as a list of dictionaries where each the TargetText instances are dictionaries. It will also JSON serialize any meta data as well.

to_json_file(json_fp, include_metadata=False)[source]¶

Saves the current TargetTextCollection to a json file which won’t be strictly json but each line in the file will be and each line in the file can be loaded in from String via TargetText.from_json. Also the file can be reloaded into a TargetTextCollection using TargetTextCollection.load_json.

Parameters

json_fp (Path) – File path to the json file to save the current data to.
include_metadata (bool) – Whether or not to include the metadata when writing to file.

Return type

None

tokenize(tokenizer)[source]¶

This applies the TargetText.tokenize method across all of the TargetText instances within the collection.

For a set of tokenizers that are definetly comptable see target_extraction.tokenizers module.

Ensures that the tokenization is character preserving.

Parameters

tokenizer (Callable[[str], List[str]]) – The tokenizer to use tokenize the text for each TargetText instance in the current collection

Raises

TypeError – If the tokenizer given does not return a List of Strings.
ValueError – This is raised if any of the TargetText instances in the collection contain an empty string.
ValueError – If the tokenization is not character preserving.

Return type

None

unique_distinct_sentiments(sentiment_key='target_sentiments')[source]¶

Parameters: sentiment_key (str) – The key that represents the sentiment value for each TargetText object
Return type: Set[int]
Returns: A set of the distinct sentiments within the collection. The length of the set represents the number of distinct sentiments within the collection.
Raises: TypeError – If the value in the sentiment_key is not of type list

target_extraction.data_types.check_anonymised(func)[source]¶

Assumes the first argument in the given function is a TargetText object defined by self.

Raises: AnonymisedError – If the TargetText object given to func anonymised attribute is True.

target_extraction.data_types_util module¶

Module that contains helpful classes and methods for target_extraction.data_types

classes:

Span
OverLappingTargetsError
AnonymisedError
OverwriteError

exception target_extraction.data_types_util.AnonymisedError(error_string)[source]¶

Bases: Exception

If the something cannot be performed because the target_extraction.data_types.TargetText or target_extraction.data_types.TargetTextCollection object has been anonymised.

exception target_extraction.data_types_util.OverLappingTargetsError[source]¶

Bases: Exception

If two targets within the same sentence overlap with each other when they shouldn’t.

exception target_extraction.data_types_util.OverwriteError(error_string)[source]¶

Bases: Exception

If some key exists in a dictionary like object and the intended action is to write data to that key when it should not then this error is raised to indicate this action was going to be performed.

class target_extraction.data_types_util.Span[source]¶

Bases: tuple

Span is a named tuple. It has two fields:

start – An integer that specifies the start of a target word within a text.
end – An integer that specifies the end of a target word within a text.

property end¶: Alias for field number 1

property start¶: Alias for field number 0

target_extraction.dataset_parsers module¶

This module contains all the functions that will parse a particular dataset into a target_extraction.data_types.TargetTextCollection object.

Functions:

semeval_2014

target_extraction.dataset_parsers.CACHE_DIRECTORY = PosixPath('/home/travis/.bella_tdsa')¶

target_extraction.dataset_parsers.download_election_folder(cache_dir=None)[source]¶

Downloads the data for the Election Twitter dataset by Wang et al, 2017 <https://www.aclweb.org/anthology/E17-1046> that can be found here

This is then further used in the following functions target_extraction.dataset_parsers.wang_2017_election_twitter_train() and target_extraction.dataset_parsers.wang_2017_election_twitter_test() as a way to get the data.

Parameters: cache_dir (Optional[Path]) – The directory where all of the data is stored for this code base. If None then the cache directory is dataset_parsers.CACHE_DIRECTORY
Return type: Path
Returns: The Path to the Wang 2017 Election Twitter folder within the cache_dir.
Raises: FileNotFoundError – If not all of files where downloaded the first time. Will require the user to delete either the cache directory or the Wang 2017 Election Twitter folder within the cache directory.

target_extraction.dataset_parsers.multi_aspect_multi_sentiment_acsa(dataset, cache_dir=None)[source]¶

The data for this function when downloaded is stored within: `Path(cache_dir, ‘Jiang 2019 MAMS ACSA’)

NOTE: That as each sentence/TargetText object has to have a text_id, as no ids exist in this dataset the ids are created based on when the sentence occurs in the dataset e.g. the first sentence/TargetText object id is ‘0’

For reference this dataset has 8 different aspect categories.

Parameters

dataset (str) – Either train, val or test, determines the dataset that is returned.
cache_dir (Optional[Path]) – The directory where all of the data is stored for this code base. If None then the cache directory is dataset_parsers.CACHE_DIRECTORY

Return type

TargetTextCollection

Returns

The train, val, or test dataset from the Multi-Aspect-Multi-Sentiment dataset (MAMS) ACSA version. Dataset came from the A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis, EMNLP 2019

Raises

ValueError – If the dataset value is not train, val, or test

target_extraction.dataset_parsers.multi_aspect_multi_sentiment_atsa(dataset, cache_dir=None, original=True)[source]¶

The data for this function when downloaded is stored within: `Path(cache_dir, ‘Jiang 2019 MAMS ATSA’)

NOTE

That as each sentence/TargetText object has to have a text_id, as no ids exist in this dataset the ids are created based on when the sentence occurs in the dataset e.g. the first sentence/TargetText object id is ‘0’

Parameters

dataset (str) – Either train, val or test, determines the dataset that is returned.
cache_dir (Optional[Path]) – The directory where all of the data is stored for this code base. If None then the cache directory is dataset_parsers.CACHE_DIRECTORY
original (bool) – This does not affect val or test. If True then it will download the original training data from the original paper . Else it will download the cleaned Training dataset version. The cleaned version only contains a few sample differences but these differences are with respect to overlapping targets. See this notebook for full differences:

Return type

TargetTextCollection

Returns

The train, val, or test dataset from the Multi-Aspect-Multi-Sentiment dataset (MAMS) ATSA version. Dataset came from the A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis, EMNLP 2019

Raises

ValueError – If the dataset value is not train, val, or test

target_extraction.dataset_parsers.semeval_2014(data_fp, conflict)[source]¶

The sentiment labels are the following: 1. negative, 2. neutral, 3. positive, and 4. conflict. conflict will not appear if the argument conflict is False.

Parameters

data_fp (Path) – Path to the SemEval 2014 formatted file.
conflict (bool) – Whether or not to include targets or categories that have the conflict sentiment value. True is to include conflict targets and categories.

Return type

TargetTextCollection

Returns

The SemEval 2014 data formatted into a target_extraction.data_types.TargetTextCollection object.

Raises

SyntaxError – If the File passed is detected as not a SemEval formatted file.
xml.etree.ElementTree.ParseError – If the File passed is not formatted correctly e.g. mismatched tags

target_extraction.dataset_parsers.semeval_2016(data_fp, conflict)[source]¶

This is only for subtask 1 files where the review is broken down into sentences. Furthermore if the data contains targets and not just categories the targets and category sentiments are linked and are all stored in the targets_sentiments further as some of the datasets only contain category information to make it the same across domains the sentiment values here will always be in the targets_sentiments field.

The sentiment labels are the following: 1. negative, 2. neutral, 3. positive, and 4. conflict. conflict will not appear if the argument conflict is False.

Parameters

data_fp (Path) – Path to the SemEval 2016 formatted file.
conflict (bool) – Whether or not to include targets and categories that have the conflict sentiment value. True is to include conflict targets and categories.

Return type

TargetTextCollection

Returns

The SemEval 2016 data formatted into a target_extraction.data_types.TargetTextCollection object.

Raises

SyntaxError – If the File passed is detected as not a SemEval formatted file.
xml.etree.ElementTree.ParseError – If the File passed is not formatted correctly e.g. mismatched tags

target_extraction.dataset_parsers.wang_2017_election_twitter_test(cache_dir=None)[source]¶

The data for this function when downloaded is stored within: Path(cache_dir, ‘Wang 2017 Election Twitter’)

Parameters

cache_dir (Optional[Path]) – The directory where all of the data is stored for this code base. If None then the cache directory is dataset_parsers.CACHE_DIRECTORY

Return type

TargetTextCollection

Returns

The Test dataset of the Election Twitter dataset by Wang et al, 2017 <https://www.aclweb.org/anthology/E17-1046> that can be found here

Raises

FileNotFoundError – If not all of files where downloaded the first time. Will require the user to delete either the cache directory or the Wang 2017 Election Twitter folder within the cache directory.

target_extraction.dataset_parsers.wang_2017_election_twitter_train(cache_dir=None)[source]¶

The data for this function when downloaded is stored within: Path(cache_dir, ‘Wang 2017 Election Twitter’)

Parameters

cache_dir (Optional[Path]) – The directory where all of the data is stored for this code base. If None then the cache directory is dataset_parsers.CACHE_DIRECTORY

Return type

TargetTextCollection

Returns

The Training dataset of the Election Twitter dataset by Wang et al, 2017 <https://www.aclweb.org/anthology/E17-1046> that can be found here

Raises

FileNotFoundError – If not all of files where downloaded the first time. Will require the user to delete either the cache directory or the Wang 2017 Election Twitter folder within the cache directory.

target_extraction.pos_taggers module¶

This modules contains a set of functions that return pos tagger functions which can be defined by the following typing: Callable[[str], Tuple[List[str], List[str]]]. All of the functions take exactly no positional arguments but can take keyword arguments.

All of the functions take in a String and perform tokenisation and POS tagging at the same time and return both as a List of Strings where the first List are the tokens and the second the POS tags.

Functions:

stanford – Returns both UPOS and XPOS tags where UPOS is the default. Stanford Neural Network POS tagger. Tagger has the option to have been trained on different languages and treebanks.
spacy – Returns both UPOS and XPOS tags where UPOS is the default. Spacy Neural Network POS tagger. Tagger has the option to have been trained on different languages.

target_extraction.pos_taggers.spacy_tagger(fine=False, spacy_model_name='en_core_web_sm')[source]¶

Spacy Neural Network POS tagger which returns both UPOS and XPOS tags.

Choice of two different POS tags: 1. UPOS - Universal POS tags, coarse grained POS tags. 2. XPOS - Target language fine grained POS tags.

The XPOS for English I think is Penn Treebank set.

If the whitespace between two words is more than one token then the Spacy tagger tags it as a space, however we remove these tags.

Languages supported: https://spacy.io/usage/models

Parameters

fine (bool) – If True then returns XPOS else returns UPOS tags.
spacy_model_name (str) – Name of the Spacy model e.g. en_core_web_sm

Return type

Callable[[str], Tuple[List[str], List[str]]]

Returns

A callable that takes a String and returns the tokens and associated POS tags for that String.

target_extraction.pos_taggers.stanford(fine=False, lang='en', treebank=None, download=False)[source]¶

Stanford Neural Network (NN) tagger that uses a highway BiLSTM that has as input: 1. Word2Vec and FastText embeddings, 2. Trainable Word Vector, and 3. Uni-Directional LSTM over character embeddings. The UPOS predicted tag is used as a feature to predict the XPOS tag within the NN.

Choice of two different POS tags: 1. UPOS - Universal POS tags, coarse grained POS tags. 2. XPOS - Target language fine grained POS tags.

The XPOS for English I think is Penn Treebank set.

ASSUMPTIONS: The returned callable pos tagger will assume that all text that is given to it, is one sentence, as this method performs sentence splitting but we assume each text is one sentence and we ignore the sentence splitting.

Languages supported: https://stanfordnlp.github.io/stanfordnlp/installation_download.html#human- languages-supported-by-stanfordnlp

Reference paper: https://www.aclweb.org/anthology/K18-2016

Parameters

fine (bool) – If True then returns XPOS else returns UPOS tags.
lang (str) – Language of the Neural Network tokeniser
treebank (Optional[str]) – The neural network model to use based on the treebank it was trained from. If not given the default treebank will be used. To see which is the default treebank and the treebanks available for each language go to: https://stanfordnlp.github.io/stanfordnlp/installation_ download.html#human-languages-supported-by-stanfordnlp
download (bool) – If to re-download the model.

Return type

Callable[[str], Tuple[List[str], List[str]]]

Returns

A callable that takes a String and returns the tokens and associated POS tags for that String.

target_extraction.taggers_helper module¶

This module contains code that will help the following modules:

tokenizers
pos_taggers

Functions:

stanford_downloader - Downloads the specific Stanford NLP Neural Network pipeline.
spacy_downloader - This in affect downloads the relevant spacy model and loads the model with the relevant taggers e.g. POS, Parse and NER taggers for that spacy model which is language dependent.

target_extraction.taggers_helper.LOADED_SPACY_MODELS = {}¶

target_extraction.taggers_helper.spacy_downloader(spacy_model_name, pos_tags, parse, ner)[source]¶

This is a copy of allennlp.common.util.get_spacy_model function. This in affect downloads the relevant spacy model and loads the model with the relevant taggers e.g. POS, Parse and NER taggers for that spacy model which is language dependent.

Spacy can have multiple trained models per language based on size.

Parameters

spacy_model_name (str) – Name of the Spacy model e.g. en_core_web_sm
pos_tags (bool) – Whether or not the returned Spacy model should perform POS tagging.
parse (bool) – Whether or not the returned Spacy model should perform Parsing.
ner (bool) – Whether or not the returned Spacy model should perform NER.

Return type

Language

Returns

The relevant Spacy model.

target_extraction.taggers_helper.stanford_downloader(lang, treebank=None, download=False)[source]¶

Downloads the Stanford NLP Neural Network pipelines that can be used for the following tagging tasks:

tokenizing
Multi Word Tokens (MWT)
POS tagging - Universal POS (UPOS) tags and depending on the language, language specific POS tags (XPOS)
Lemmatization
Dependency Parsing

Each pipeline is trained per language and per treebank hence why the language and treebank is required as arguments. When the treebank is not given the default treebank is used.

If download is True then it will re-download the pipeline even if it already exists, this might be useful if a new version has come avliable.

Languages supported: https://stanfordnlp.github.io/stanfordnlp/installation_download.html#human- languages-supported-by-stanfordnlp

Reference paper: https://www.aclweb.org/anthology/K18-2016

Parameters

lang (str) – Language of the Neural Network Pipeline to download.
treebank (Optional[str]) – The neural network model to use based on the treebank it was trained from. If not given the default treebank will be used. To see which is the default treebank and the treebanks available for each language go to: https://stanfordnlp.github.io/stanfordnlp/installation_ download.html#human-languages-supported-by-stanfordnlp
download (bool) – If to re-download the model.

Return type

str

Returns

The treebank full name which this method has to resolve to it’s full name to find the model’s directory.

Raises

ValueError – If the treebank does not exist for the given language. Also will raise an error there is not a pipeline for the language given.

target_extraction.tokenizers module¶

This modules contains a set of functions that return tokenization functions which can be defined by the following typing: Callable[[str], List[str]]. All of the functions take exactly no positional arguments but can take keyword arguments.

target_extraction.tokenizers.ark_twokenize()[source]¶

A Twitter tokeniser from CMU Ark and relevant paper

Return type: Callable[[str], List[str]]
Returns: A callable that takes a String and returns the tokens for that String.

target_extraction.tokenizers.is_character_preserving(original_text, text_tokens)[source]¶

Parameters

original_text (str) – Text that has been tokenized
text_tokens (List[str]) – List of tokens after the text has been tokenized

Return type

bool

Returns

True if the tokenized text when all characters are joined together is equal to the original text with all it’s characters joined together.

target_extraction.tokenizers.spacy_tokenizer(lang='en')[source]¶

Given optionally the language (default English) it will return the Spacy rule based tokeniser for that language but the function will now return a List of String rather than Spacy tokens.

If the whitespace between two words is more than one token then the Spacy tokenizer treat it as in affect a special space token, we remove these special space tokens.

Parameters: lang (str) – Language of the rule based Spacy tokeniser to use.
Return type: Callable[[str], List[str]]
Returns: A callable that takes a String and returns the tokens for that String.

target_extraction.tokenizers.stanford(lang='en', treebank=None, download=False)[source]¶

Stanford neural network tokeniser that uses a BiLSTM and CNN at the character and token level.

ASSUMPTIONS: The returned callable tokeniser will assume that all text that is given to it, is one sentence, as this method performs sentence splitting but we assume each text is one sentence and we ignore the sentence splitting.

For Vietnamese instead of characters they used syllables.

Languages supported

Reference paper

Parameters

lang (str) – Language of the Neural Network tokeniser
treebank (Optional[str]) – The neural network model to use based on the treebank it was trained from. If not given the default treebank will be used. To see which is the default treebank and the treebanks available for each language go to this link
download (bool) – If to re-download the model.

Return type

Callable[[str], List[str]]

Returns

A callable that takes a String and returns the tokens for that String.

target_extraction.tokenizers.token_index_alignment(text, tokens)[source]¶

Parameters

text (str) – text that has been tokenized
tokens (List[str]) – The tokens that were the output of the text and a tokenizer (tokenizer has to be character preserving)

Return type

List[Span]

Returns

A list of tuples where each tuple contains two ints each representing the start and end index for each of the associated tokens given as an argument.

target_extraction.tokenizers.whitespace()[source]¶

Standard whitespace tokeniser

Return type: Callable[[str], List[str]]
Returns: A callable that takes a String and returns the tokens for that String.

target_extraction package¶

Subpackages¶

Submodules¶

target_extraction.data_types module¶

target_extraction.data_types_util module¶

target_extraction.dataset_parsers module¶

target_extraction.pos_taggers module¶

target_extraction.taggers_helper module¶

target_extraction.tokenizers module¶

Module contents¶