bella.models package

Submodules

bella.models.base module

Module contains all of the main base classes for the machine learning models these are grouped into 3 categories; 1. Mixin, 2. Abstract, and 3. Concrete.

Mixin classes - This is a function based class that contains functions that do not rely on the type of model and are useful for all:

  1. bella.models.base.ModelMixin

Abstract classes - This is used to enforce all the functions that all the machine learning models must have. This is also the class that inherits the Mixin class:

  1. bella.models.base.BaseModel

Concrete classes - These are more concete classes that still contain some abstract methods. However they are the classes to inherit from to create a machine learning model base on a certain framework e.g. SKlearn or Keras:

  1. bella.models.base.SKLearnModel
  2. bella.models.base.KerasModel
class bella.models.base.BaseModel[source]

Bases: bella.models.base.ModelMixin, abc.ABC

Abstract class for all of the machine learning models.

Attributes:

  1. model – Machine learning model that is associated to this instance.
  2. fitted – If the machine learning model has been fitted (default False)

Methods:

  1. fit – Fit the model according to the given training data.
  2. predict – Predict class labels for samples in X.
  3. probabilities – The probability of each class label for all samples in X.
  4. __repr__ – Name of the machine learning model.

Class Methods:

  1. name – – Returns the name of the model.

Functions:

  1. save – Saves the given machine learning model instance to a file.
  2. load – Loads the entire machine learning model from a file.
  3. evaluate_parameter – fit and predict given training, validation and test data the given model when the given parameter is changed on the model.
  4. evaluate_parameters – same as evaluate_parameter however it evaluates over many parameter values for the same parameter.
static evaluate_parameter(model, train, val, test, parameter_name, parameter)[source]

Given a model will set the parameter_name to parameter fit the model and return the a Tuple of parameter changed and predictions of the model on the test data, using the train and validation data for fitting.

Parameters:
  • model (BaseModel) – bella.models.base.BaseModel instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (Union[None, Tuple[ndarray, ndarray]]) – Tuple of (X_val, y_val) or None is not required. This is only required if the model requires validation data like the bella.models.base.KerasModel models do.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. optimiser
  • parameter (Any) – value to assign to the parameter e.g. keras.optimizers.RMSprop
Return type:

Tuple[Any, ndarray]

Returns:

A tuple of (parameter value, predictions)

static evaluate_parameters(model, train, val, test, parameter_name, parameters, n_jobs)[source]

Performs bella.models.base.BaseModel.evaluate_parameter() on one parameter_name but with multiple parameter values.

This is useful if you would like to know the affect of changing the values of a parameter. It can also perform the task in a multiprocessing manner if n_jobs > 1.

Parameters:
  • model (BaseModel) – bella.models.base.BaseModel instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (Union[None, Tuple[ndarray, ndarray]]) – Tuple of (X_val, y_val) or None is not required. This is only required if the model requires validation data like the bella.models.base.KerasModel models do.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. optimiser
  • parameters (List[Any]) – A list of values to assign to the parameter e.g. [keras.optimizers.RMSprop]
  • n_jobs (int) – Number of cpus to use for multiprocessing if 1 then will not multiprocess.
Return type:

List[Tuple[Any, ndarray]]

Returns:

A list of tuples of (parameter value, predictions)

fit(X, y)[source]

Fit the model according to the given training data.

Parameters:
  • X (ndarray) – Training samples matrix, shape = [n_samples, n_features]
  • y (ndarray) – Training targets, shape = [n_samples]
Return type:

None

Returns:

The model attribute will now be trained.

fitted

If the machine learning model has been fitted (default False)

Return type:bool
Returns:True or False
static load(load_fp)[source]

Loads the entire machine learning model from a file.

Parameters:load_fp (Path) – File path of the location that the model was saved to.
Return type:BaseModel
Returns:self
model

Machine learning model that is associated to this instance.

Return type:Any
Returns:The machine learning model
classmethod name()[source]

Returns the name of the model.

Return type:str
Returns:Name of the model
predict(X)[source]

Predict class labels for samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]
Return type:ndarray
Returns:Predicted class label per sample, shape = [n_samples]
probabilities(X)[source]

The probability of each class label for all samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]]
Return type:ndarray
Returns:Probability of each class label for all samples, shape = [n_samples, n_classes]
static save(model, save_fp)[source]

Saves the entire machine learning model to a file.

Parameters:
  • model (BaseModel) – The machine learning model instance to be saved.
  • save_fp (Path) – File path of the location that the model is to be saved to.
Return type:

None

Returns:

Nothing.

class bella.models.base.KerasModel[source]

Bases: bella.models.base.BaseModel

Concrete class that is designed to be used as the base class for all machine learning models that are based on the Keras library.

Attributes:

  1. tokeniser – Tokeniser model uses e.g. str.split().
  2. embeddings – the word embeddings the model uses. e.g. bella.word_vectors.SSWE
  3. lower – if the model lower cases the words when pre-processing the data
  4. reproducible – Whether to be reproducible. If None then it is quicker to run. Else provide a int that will represent the random seed value.
  5. patience – Number of epochs with no improvement before training is stopped.
  6. batch_size – Number of samples per gradient update.
  7. epcohs – Number of times to train over the entire training set before stopping.
  8. optimiser – Optimiser the model uses. e.g. keras.optimizers.SGD
  9. optimiser_params – Parameters for the optimiser. If None uses default for the optimiser being used.

Abstract Methods:

  1. keras_model – Keras machine Learning model that represents the class e.g. single forward LSTM.
  2. create_training_text – Converts the training and validation data into a format that the keras model can take as input.
  3. create_training_y – Converts the training and validation targets into a format that can be used by the keras model.

Methods:

  1. fit – Fit the model according to the given training and validation data.
  2. probabilities – The probability of each class label for all samples in X.
  3. predict – Predict class labels for samples in X.

Functions:

  1. save – Given a instance of this class will save it to a file.
  2. load – Loads an instance of this class from a file.
  3. evaluate_parameter – fit and predict given training, validation and test data the given model when the given parameter is changed on the model.
  4. evaluate_parameters – same as evaluate_parameter however it evaluates over many parameter values for the same parameter.
batch_size

batch_size attribute

Return type:int
Returns:The batch_size used in the model
create_training_text(train_data, validation_data)[source]

Converts the training and validation data into a format that the keras model can take as input.

Return type:Tuple[Any, Any]
Returns:A tuple of length two containing the keras model training and validation input respectively.
create_training_y(train_y, validation_y)[source]

Converts the training and validation targets into a format that can be used by the keras model

Return type:Tuple[ndarray, ndarray]
Returns:A tuple of length containing two array the first for training and the second for validation.
embeddings

embeddings attribute

Return type:WordVectors
Returns:The embeddings used in the model
epochs

epochs attribute

Return type:int
Returns:The epochs used in the model
static evaluate_parameter(model, train, val, test, parameter_name, parameter)[source]

Given a model will set the parameter_name to parameter fit the model and return the a Tuple of parameter changed and predictions of the model on the test data, using the train and validation data for fitting.

Parameters:
  • model (KerasModel) – KerasModel instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (Tuple[ndarray, ndarray]) – Tuple of (X_val, y_val). Used to evaluate the model at each epoch. Will not be trained on this data.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. optimiser
  • parameter (Any) – value to assign to the parameter e.g. keras.optimizers.RMSprop
Return type:

Tuple[Any, ndarray]

Returns:

A tuple of (parameter value, predictions)

static evaluate_parameters(model, train, val, test, parameter_name, parameters, n_jobs)[source]

Performs bella.models.base.KerasModel.evaluate_parameter() on one parameter_name but with multiple parameter values.

This is useful if you would like to know the affect of changing the values of a parameter. It can also perform the task in a multiprocessing manner if n_jobs > 1.

Parameters:
  • model (KerasModel) – bella.models.base.KerasModel instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (Tuple[ndarray, ndarray]) – Tuple of (X_val, y_val). Used to evaluate the model at each epoch. Will not be trained on this data.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. optimiser
  • parameters (List[Any]) – A list of values to assign to the parameter e.g. [keras.optimizers.RMSprop]
  • n_jobs (int) – Number of cpus to use for multiprocessing if 1 then will not multiprocess.
Return type:

List[Tuple[Any, ndarray]]

Returns:

A list of tuples of (parameter value, predictions)

fit(X, y, validation_data, verbose=0, continue_training=False)[source]

Fit the model according to the given training and validation data.

Parameters:
  • X (ndarray) – Training samples matrix, shape = [n_samples, n_features]
  • y (ndarray) – Training targets, shape = [n_samples]
  • validation_data (Tuple[ndarray, ndarray]) – Tuple of (x_val, y_val). Used to evaluate the model at each epoch. Will not be trained on this data.
  • verbose (int) – 0 = silent, 1 = progress
  • continue_training (bool) – Whether the model that has already been trained should be trained further.
Return type:

History

Returns:

A record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values.

keras_model(num_classes)[source]

Keras machine Learning model that represents the class e.g. single forward LSTM.

Return type:Model
Returns:Keras machine learning model
static load(load_fp)[source]

Loads an instance of this class from a file.

Parameters:load_fp (Path) – File path of the location that the model was saved to.
Return type:KerasModel
Returns:self
lower

lower attribute

Return type:bool
Returns:The lower used in the model
optimiser

optimiser attribute

Return type:OptimizerV2
Returns:The optimiser used in the model
optimiser_params

optimiser_params attribute

Return type:Optional[Dict[str, Any]]
Returns:The optimiser_params used in the model
patience

patience attribute

Return type:int
Returns:The patience used in the model
predict(X)[source]

Predict class labels for samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]
Return type:ndarray
Returns:Predicted class label per sample, shape = [n_samples]
probabilities(X)[source]

The probability of each class label for all samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]]
Return type:ndarray
Returns:Probability of each class label for all samples, shape = [n_samples, n_classes]
process_text(texts, max_length, padding='pre', truncate='pre')[source]

Given a list of Strings, tokenised the text and lower case if set and then convert the tokens into a integers representing the tokens in the embeddings. Lastly it pads the data based on the max_length param.

If the max_length is smaller than the sentences size it truncates the sentence. If max_length = -1 then the max_length is that of the longest sentence in the texts.

Params texts:List of texts
Params max_length:
 How many tokens a sentence can contain. If it is -1 then it uses the sentence with the most tokens as the max_length parameter.
Params padding:Which side of the sentence to pad: pre beginning, post end.
Params truncate:
 Which side of the sentence to truncate: pre beginning post end.
Return type:Tuple[int, ndarray]
Returns:A tuple of length 2 containg: 1. The max_length parameter, 2. A matrix of shape [n_samples, pad_size] where each integer in the matrix represents the word embedding lookup.
Raises:ValueError – If the mex_length argument is equal to or less than 0. Or if the calculated max_length is 0.
reproducible

reproducible attribute

Return type:Optional[int]
Returns:The reproducible used in the model
static save(model, save_fp)[source]

Given a Keras Model, mode, path to the folder to save too, and a name to save the files it will save the data to restore the model.

Parameters:
  • model (KerasModel) – The machine learning model instance to be saved.
  • save_fp (Path) – File path of the location that the model is to be saved.
Return type:

None

Returns:

Nothing.

Raises:

ValueError – If the model has not been fitted or if the model is not of type bella.models.base.KerasModel

tokeniser

tokeniser attribute

Return type:Callable[[str], List[str]]
Returns:The tokeniser used in the model
class bella.models.base.ModelMixin[source]

Bases: object

Mixin class for all of the machine learning models. Contain functions only so they are as generic as possible.

Functions:

  1. train_val_split – Splits the training dataset into a train and validation set in a stratified split.
static train_val_split(train, split_size=0.2, seed=42)[source]

Splits the training dataset into a train and validation set in a stratified split.

Parameters:
  • train (TargetCollection) – The training dataset that needs to be split into
  • split_size (float) – Fraction of the dataset to assign to the validation set.
  • seed (Union[None, int]) – Seed value to give to the stratified splitter. If None then it uses the radnom state of numpy.
Return type:

Tuple[Tuple[ndarray, ndarray], Tuple[ndarray, ndarray]]

Returns:

Two tuples of length two where each tuple is the train and validation splits respectively, and each tuple contains the data (X) and class labels (y) respectively. Returns ((X_train, y_train), (X_val, y_val))

class bella.models.base.SKLearnModel(*args, **kwargs)[source]

Bases: bella.models.base.BaseModel

Concrete class that is designed to be used as the base class for all machine learning models that are based on the scikit learn library.

At the moment expects all of the machine learning models to use a SVM as their classifier. This is due to assuming the model will have the method sklearn.svm.SVC.decision_function() to get probabilities.

NOTE each time the model_parameters are set it resets the model i.e. the fitted attribute is False

Attributes:

  1. model – Machine learning model. Expects it to be a sklearn.pipeline.Pipeline instance.
  2. fitted – If the machine learning model has been fitted (default False)
  3. model_parameters – The parameters that are set in the machine learning model. E.g. Parameter could be the tokeniser used.

Abstract Class Methods:

  1. get_parameters – Transform the given parameters into a dictonary that is accepted as model parameters.
  2. get_cv_parameters – Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV
  3. normalise_parameter_names – Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters(). This is required so that the evaluate_parameters() can work this class.

Methods:

  1. fit – Fit the model according to the given training data.
  2. predict – Predict class labels for samples in X.
  3. probabilities – The probability of each class label for all samples in X.
  4. __repr__ – Name of the machine learning model.

Functions:

  1. save – Given a instance of this class will save it to a file.
  2. load – Loads an instance of this class from a file.
  3. evaluate_parameter – fit and predict given training, validation and test data the given model when the given parameter is changed on the model.
  4. evaluate_parameters – same as evaluate_parameter however it evaluates over many parameter values for the same parameter.
  5. grid_search_model – Given a model class it will perform a Grid Search over the parameters you give to the models bella.models.base.SKLearnModel.get_cv_parameters() function via the keyword arguments. Returns a pandas dataframe representation of the grid search results.
  6. get_grid_score – Given the return of the grid_search_model() will return the grid scores as a List of the mean test accuracy result.
  7. models_best_parameter – Given a list of models and their base model arguments, it will find the best parameter value out of the values given for that parameter while keeping the base model arguments constant for each model.

Abstract Functions:

  1. Pipeline – Machine Learning model that is used as the base template for the model attribute. Expects it to be a sklearn.pipeline.Pipeline instance.
__init__(*args, **kwargs)[source]
Return type:None
static evaluate_parameter(model, train, val, test, parameter_name, parameter)[source]

Given a model will set the parameter_name to parameter fit the model and return the a Tuple of parameter changed and predictions of the model on the test data, using the train and validation data for fitting.

Parameters:
  • model (SKLearnModel) – bella.models.base.SKLearn instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (None) – Use None. This is only kept to keep the API clean.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. word_vectors
  • parameter (Any) – value to assign to the parameter e.g. bella.word_vectors.SSWE
Return type:

Tuple[Any, ndarray]

Returns:

A tuple of (parameter value, predictions)

static evaluate_parameters(model, train, val, test, parameter_name, parameters, n_jobs)[source]

Performs bella.models.base.KerasModel.evaluate_parameter() on one parameter_name but with multiple parameter values.

This is useful if you would like to know the affect of changing the values of a parameter. It can also perform the task in a multiprocessing manner if n_jobs > 1.

Parameters:
  • model (SKLearnModel) – bella.models.base.SKLearn instance
  • train (Tuple[ndarray, ndarray]) – Tuple of (X_train, y_train). Used to fit the model.
  • val (None) – Use None. This is only kept to keep the API clean.
  • test (ndarray) – X_test data to predict on.
  • parameter_name (str) – Name of the parameter to change e.g. word_vectors
  • parameters (List[Any]) – A list of values to assign to the parameter e.g. [bella.word_vectors.SSWE]
  • n_jobs (int) – Number of cpus to use for multiprocessing if 1 then will not multiprocess.
Return type:

List[Tuple[Any, ndarray]]

Returns:

A list of tuples of (parameter value, predictions)

fit(X, y)[source]

Fit the model according to the given training data.

Parameters:
  • X (ndarray) – Training samples matrix, shape = [n_samples, n_features]
  • y (ndarray) – Training targets, shape = [n_samples]
Returns:

The model attribute will now be trained.

classmethod get_cv_parameters()[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Return type:List[Dict[str, List[Any]]]
static get_grid_score(grid_scores, associated_param=None)[source]

Given the return of the grid_search_model() will return the grid scores as a List of the mean test accuracy result.

Parameters:
  • grid_scores (DataFrame) – Return of the grid_search_model()
  • associated_param (Optional[str]) – Optional. The name of the parameter you want to associate to the score. E.g. lexicon as you have grid searched over different lexicons and you want the return to be associated with the lexicon name e.g. [(0.68, ‘MPQA), (0.70, ‘NRC’)]
Return type:

Union[List[float], List[Tuple[float, str]]]

Returns:

A list of test scores from the grid search and if associated_param is not None a list of scores and parameter names.

classmethod get_parameters()[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Return type:Dict[str, Any]
static grid_search_model(model, X, y, n_cpus=1, num_folds=5, **kwargs)[source]

Given a model class it will perform a Grid Search over the parameters you give to the models bella.models.base.SKLearnModel        .get_cv_parameters() function via the keyword arguments. Returns a pandas dataframe representation of the grid search results.

Parameters:
  • model (SKLearnModel) – The class of the model to use not an instance of the model.
  • X (ndarray) – Training samples matrix, shape = [n_samples, n_features]
  • y (ndarray) – Training targets, shape = [n_samples]
  • n_cpus (int) – Number of estimators to fit in parallel. Default 1.
  • num_folds (int) – Number of Stratified cross validation folds. Default 5.
  • kwargs – Keyword arguments to give to the models bella.models.base.SKLearnModel                       .get_cv_parameters() function.
Return type:

DataFrame

Returns:

Pandas dataframe representation of the grid search results.

static load(load_fp)[source]

Loads an instance of this class from a file.

Parameters:load_fp (Path) – File path of the location that the model was saved to.
Return type:SKLearnModel
Returns:self
model_parameters

The parameters that are set in the machine learning model. E.g. Parameter could be the tokeniser used.

Return type:Dict[str, Any]
Returns:parameters of the machine learning model
static models_best_parameter(models_kwargs, param_name, param_values, X, y, n_cpus=1, num_folds=5)[source]

Given a list of models and their base model arguments, it will find the best parameter value out of the values given for that parameter while keeping the base model arguments constant for each model.

This essentially performs 5 fold cross validation grid search for the one parameter given, across all models given.

Parameters:
  • models_kwargs (List[Tuple[SKLearnModel, Dict[str, Any]]]) – A list of tuples where each tuple contains a model and the models keyword arguments to give to its get_cv_parameters method. These arguments are the models standard arguments that are not to be changed.
  • param_name (str) – Name of the parameter to be changed. This name has to be the name of the keyword argument in the models get_cv_parameters method.
  • param_values (List[Any]) – The different values to assign to the param_name argument.
  • X (List[Any]) – The training samples.
  • y (ndarray) – The training target samples.
Return type:

Dict[SKLearnModel, str]

Returns:

A dictionary of model and the name of the best parameter.

classmethod normalise_parameter_names(parameter_dict)[source]

Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters().

Return type:Dict[str, Any]
Returns:A dictonary that can be used as keyword arguments into the get_parameters() method
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
predict(X)[source]

Predict class labels for samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]
Returns:Predicted class label per sample, shape = [n_samples]
Raises:ValueError – If the model has not been fitted
probabilities(X)[source]

The probability of each class label for all samples in X.

Parameters:X (ndarray) – Test samples matrix, shape = [n_samples, n_features]]
Returns:Probability of each class label for all samples, shape = [n_samples, n_classes]
Raises:ValueError – If the model has not been fitted
static save(model, save_fp, compress=0)[source]

Given an instance of this class will save it to a file.

Parameters:
  • model (SKLearnModel) – The machine learning model instance to be saved.
  • save_fp (Path) – File path of the location that the model is to be saved to.
  • compress (int) – Optional (default 0). Level of compression 0 is no compression and 9 is the most compressed. The more compressed the lower the read/write time.
Return type:

None

Returns:

Nothing.

Raises:

ValueError – If the model has not been fitted or if the model is not of type bella.models.base.SKLearn

bella.models.target module

Module contains all of the classes that represent Machine Learning models that are within Vo and Zhang 2015 paper:

  1. bella.models.target.TargetInd – Target Indepdent model
  2. bella.models.target.TargetDepMinus – Target Dependent Minus model
  3. bella.models.target.TargetDep – Target Dependent model
  4. bella.models.target.TargetDepPlus – Target Dependent Plus model
class bella.models.target.TargetDep(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.target.TargetInd

Target-dep model from Vo and Zhang 2015 paper.

__init__(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Wether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.01], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.01]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Wether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
class bella.models.target.TargetDepMinus(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.025, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.target.TargetInd

__init__(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.025, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Wether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.025], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.025]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.025, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Wether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
class bella.models.target.TargetDepPlus(word_vectors, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.target.TargetInd

__init__(word_vectors, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • senti_lexicon (Lexicon) – Sentiment Lexicon to be used for the Left and Right sentiment context (LS and RS).
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, senti_lexicon, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.01], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors (List[List[WordVectors]]) – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • senti_lexicon (List[Lexicon]) – A list of Sentiment Lexicons to be explored for the Left and Right sentiment context (LS and RS). Default None, use the sentiment lexicons already within the model.
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.01]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • senti_lexicon (Lexicon) – Sentiment Lexicon to be used for the Left and Right sentiment context (LS and RS).
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
classmethod normalise_parameter_names(parameter_dict)[source]

Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters().

Return type:Dict[str, Any]
Returns:A dictonary that can be used as keyword arguments into the get_parameters() method
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
class bella.models.target.TargetInd(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.base.SKLearnModel

Attributes:

  1. model – Machine learning model. Expects it to be a sklearn.pipeline.Pipeline instance.
  2. fitted – If the machine learning model has been fitted (default False)
  3. model_parameters – The parameters that are set in the machine learning model. E.g. Parameter could be the tokeniser used.

Methods:

  1. fit – Fit the model according to the given training data.
  2. predict – Predict class labels for samples in X.
  3. probabilities – The probability of each class label for all samples in X.
  4. __repr__ – Name of the machine learning model.

Class Methods:

  1. get_parameters – Transform the given parameters into a dictonary that is accepted as model parameters.
  2. get_cv_parameters – Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV
  3. name – – Returns the name of the model.

Functions:

  1. save – Given a instance of this class will save it to a file.
  2. load – Loads an instance of this class from a file.
  3. pipeline – Machine Learning model that is used as the base template for the model attribute.
__init__(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.01], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.01]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
classmethod normalise_parameter_names(parameter_dict)[source]

Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters().

Return type:Dict[str, Any]
Returns:A dictonary that can be used as keyword arguments into the get_parameters() method
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model

bella.models.tdlstm module

Module contains all of the classes that represent Machine Learning models that are within Tang et al. 2016 paper:

  1. bella.models.tdlstm.LSTM – LSTM model.
  2. bella.models.tdlstm.TDLSTM – TDLSTM model.
  3. bella.models.tdlstm.TCLSTM – TCLSTM model.
class bella.models.tdlstm.LSTM(tokeniser, embeddings, reproducible=None, pad_size=-1, lower=True, patience=10, batch_size=32, epochs=300, embedding_layer_kwargs=None, lstm_layer_kwargs=None, dense_layer_kwargs=None, optimiser=<class 'tensorflow.python.keras.optimizer_v2.gradient_descent.SGD'>, optimiser_params=None)[source]

Bases: bella.models.base.KerasModel

Attributes:

  1. pad_size – The max number of tokens to use per sequence. If -1 use the text sequence in the training data that has the most tokens as the pad size.
  2. embedding_layer_kwargs – Keyword arguments to pass to the embedding layer which is a keras.layers.Embedding object. Can be None if no parameters to pass.
  3. lstm_layer_kwargs – Keyword arguments to pass to the lstm layer(s) which is a keras.layers.LSTM object. Can be None if no parameters to pass.
  4. dense_layer_kwargs – Keyword arguments to pass to the dense (final layer) which is a keras.layers.Dense object. Can be None if no parameters to pass.

Methods:

  1. model_parameters – Returns a dictionary containing the attributes of the class instance, the parameters to give to the class constructior to re-create this instance, and the class itself.
  2. create_training_text – Converts the training and validation data into a format that the keras model can take as input.
  3. create_training_y – Converts the training and validation target values from a vector of class lables into a matrix of binary values. of shape [n_samples, n_classes].
  4. keras_model – The model that represents this class. This is a single forward LSTM.
__init__(tokeniser, embeddings, reproducible=None, pad_size=-1, lower=True, patience=10, batch_size=32, epochs=300, embedding_layer_kwargs=None, lstm_layer_kwargs=None, dense_layer_kwargs=None, optimiser=<class 'tensorflow.python.keras.optimizer_v2.gradient_descent.SGD'>, optimiser_params=None)[source]
Parameters:
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split().
  • embeddings (WordVectors) – Embedding (Word vectors) to be used e.g. bella.word_vectors.SSWE
  • reproducible (Optional[int]) – Whether to be reproducible. If None then it is quicker to run. Else provide a int that will represent the random seed value.
  • pad_size (int) – The max number of tokens to use per sequence. If -1 use the text sequence in the training data that has the most tokens as the pad size.
  • lower (bool) – Whether to lower case the words being processed.
  • patience (int) – Number of epochs with no improvement before training is stopped.
  • batch_size (int) – Number of samples per gradient update.
  • epochs (int) – Number of times to train over the entire training set before stopping. If patience is set, then it may stop before reaching the number of epochs specified here.
  • embedding_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the embedding layer which is a keras.layers.Embedding object. If no parameters to pass leave as None.
  • lstm_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the lstm layer(s) which is a keras.layers.LSTM object. If no parameters to pass leave as None.
  • dense_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the dense (final layer) which is a keras.layers.Dense object. If no parameters to pass leave as None.
  • optimiser (OptimizerV2) – Optimiser to be used accepts any keras optimiser. Default is keras.optimizers.SGD
  • optimiser_params (Optional[Dict[str, Any]]) – Parameters for the optimiser. If None uses default optimiser parameters.
Return type:

None

create_training_text(train_data, validation_data)[source]

Converts the training and validation data into a format that the keras model can take as input.

Parameters:
  • train_data (List[Dict[str, str]]) – Data to be trained on. Which is a list of dictionaries where each dictionary has a text field containing text.
  • validation_data (List[Dict[str, str]]) – Data to evaluate the model at training time. Which is a list of dictionaries where each dictionary has a text field containing text.
Return type:

Tuple[ndarray, ndarray]

Returns:

A tuple of length two containing the train and validation input that are both the output of _pre_process()

create_training_y(train_y, validation_y)[source]

Converts the training and validation target values from a vector of class lables into a matrix of binary values of shape [n_samples, n_classes].

To convert the vector of classes to a matrix we the keras.utils.to_categorical() function.

Parameters:
  • train_y (ndarray) – Vector of class labels, shape = [n_samples]
  • validation_y (ndarray) – Vector of class labels, shape = [n_samples]
Return type:

Tuple[ndarray, ndarray]

Returns:

A tuple of length two containing the train and validation matrices respectively. The shape of each matrix is: [n_samples, n_classes]

dense_layer_kwargs

dense_layer_kwargs attribute

Return type:Dict[str, Any]
Returns:The dense_layer_kwargs used in the model
embedding_layer_kwargs

embedding_layer_kwargs attribute

Return type:Dict[str, Any]
Returns:The embedding_layer_kwargs used in the model
keras_model(num_classes)[source]

The model that represents this class. This is a single forward LSTM.

Parameters:num_classes (int) – Number of classes to predict.
Return type:Model
Returns:Forward LSTM keras model.
lstm_layer_kwargs

lstm_layer_kwargs attribute

Return type:Dict[str, Any]
Returns:The lstm_layer_kwargs used in the model
model_parameters()[source]

Returns a dictionary containing the attributes of the class instance, the parameters to give to the class constructior to re-create this instance, and the class itself.

This is used by the save() method so that the instance can be re-created when loaded by the load() method.

Return type:Dict[str, Any]
classmethod name()[source]
Return type:str
pad_size

pad_size attribute

Return type:int
Returns:The pad_size used in the model
class bella.models.tdlstm.TCLSTM(tokeniser, embeddings, reproducible=None, pad_size=-1, lower=True, patience=10, batch_size=32, epochs=300, embedding_layer_kwargs=None, lstm_layer_kwargs=None, dense_layer_kwargs=None, optimiser=<class 'tensorflow.python.keras.optimizer_v2.gradient_descent.SGD'>, optimiser_params=None, include_target=True)[source]

Bases: bella.models.tdlstm.TDLSTM

create_training_text(train_data, validation_data)[source]

Converts the training and validation data into a format that the keras model can take as input.

Parameters:
  • train_data (List[Dict[str, Any]]) – See bella.models.tdlstm.                           TDLSTM.create_training_text() train_data parameter.
  • validation_data (List[Dict[str, Any]]) – See bella.models.tdlstm.                                TDLSTM.create_training_text() validation_data parameter.
Return type:

Tuple[List[ndarray], List[ndarray]]

Returns:

A tuple of length two containing the train and validation input that are both the output of _pre_process()

keras_model(num_classes)[source]

The model that represents this class. This is the same as the bella.models.tdlstm.TDLSTM.keras_model() model, however the words in before inputting into the LSTM are concatenated with the word embedding of the target. If the target is more than one word then the word embedding of the target is the average (median in our case) embeddings of the target words.

Parameters:num_classes (int) – Number of classes to predict.
Return type:Model
Returns:Two LSTMs one forward from the left context and the other backward from the right context taking into account the target vector embedding.
classmethod name()[source]
Return type:str
class bella.models.tdlstm.TDLSTM(tokeniser, embeddings, reproducible=None, pad_size=-1, lower=True, patience=10, batch_size=32, epochs=300, embedding_layer_kwargs=None, lstm_layer_kwargs=None, dense_layer_kwargs=None, optimiser=<class 'tensorflow.python.keras.optimizer_v2.gradient_descent.SGD'>, optimiser_params=None, include_target=True)[source]

Bases: bella.models.tdlstm.LSTM

Attributes:

  1. include_target – Wheather to include the target in the LSTM representations.
__init__(tokeniser, embeddings, reproducible=None, pad_size=-1, lower=True, patience=10, batch_size=32, epochs=300, embedding_layer_kwargs=None, lstm_layer_kwargs=None, dense_layer_kwargs=None, optimiser=<class 'tensorflow.python.keras.optimizer_v2.gradient_descent.SGD'>, optimiser_params=None, include_target=True)[source]
Parameters:
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split().
  • embeddings (WordVectors) – Embedding (Word vectors) to be used e.g. bella.word_vectors.SSWE
  • reproducible (Optional[int]) – Whether to be reproducible. If None then it is but quicker to run. Else provide a int that will represent the random seed value.
  • pad_size (int) – The max number of tokens to use per sequence. If -1 use the text sequence in the training data that has the most tokens as the pad size.
  • lower (bool) – Whether to lower case the words being processed.
  • patience (int) – Number of epochs with no improvement before training is stopped.
  • batch_size (int) – Number of samples per gradient update.
  • epochs (int) – Number of times to train over the entire training set before stopping. If patience is set, then it may stop before reaching the number of epochs specified here.
  • embedding_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the embedding layer which is a keras.layers.Embedding object. If no parameters to pass leave as None.
  • lstm_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the lstm layer(s) which is a keras.layers.LSTM object. If no parameters to pass leave as None.
  • dense_layer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the dense (final layer) which is a keras.layers.Dense object. If no parameters to pass leave as None.
  • optimiser (OptimizerV2) –

    Optimiser to be used accepts any keras optimiser. Default is keras.optimizers.SGD

  • optimiser_params (Optional[Dict[str, Any]]) – Parameters for the optimiser. If None uses default optimiser parameters.
  • include_target (bool) – Wheather to include the target in the LSTM representations.
Return type:

None

create_training_text(train_data, validation_data)[source]

Converts the training and validation data into a format that the keras model can take as input.

Parameters:
  • train_data (List[Dict[str, Any]]) – Data to be trained on. Which is a list of dictionaries where each dictionary has a text field containing text and a field spans containing a list of Tuples where each Tuple represents a occurence of the Target, each Tuple contains the index of the starting and ending character index (Expects the List to be of size 1 as there should be only one target per target sample. This case is not True for the Dong et al. dataset therefore it only takes the first target instance in the sentence as the target).
  • validation_data (List[Dict[str, Any]]) – Data to evaluate the model at training time. Expects the same data as the train_data parameter.
Return type:

Tuple[List[ndarray], List[ndarray]]

Returns:

A tuple of length two containing the train and validation input that are both the output of _pre_process()

include_target

include_target attribute

Return type:bool
Returns:The include_target used in the model
keras_model(num_classes)[source]

The model that represents this class. This is a custom combination of two LSTMs.

Parameters:num_classes (int) – Number of classes to predict.
Return type:Model
Returns:Two LSTMs, one forward from the left context and the other backward from the right context. The output of the two are concatenated and are input to the output layer.
model_parameters()[source]

Returns a dictionary containing the attributes of the class instance, the parameters to give to the class constructior to re-create this instance, and the class itself.

This is used by the save() method so that the instance can be re-created when loaded by the load() method.

Return type:Dict[str, Any]
classmethod name()[source]
Return type:str

bella.models.tdparse module

Module contains all of the classes that represent Machine Learning models that are within Wang et al. paper.

  1. bella.models.target.TDParseMinus – TDParse Minus model
  2. bella.models.target.TDParse – TDParse model
  3. bella.models.tdparse.TDParsePlus – TDParse Plus model
class bella.models.tdparse.TDParse(word_vectors, parser, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.tdparse.TDParseMinus

__init__(word_vectors, parser, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • parser (Any) – The dependency parser to be used.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod name()[source]
Return type:str
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
class bella.models.tdparse.TDParseMinus(word_vectors, parser, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.target.TargetInd

__init__(word_vectors, parser, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • parser (Any) – The dependency parser to be used.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, parser, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.01], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors (List[List[WordVectors]]) – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • parser (List[Any]) – A list of dependency parser to be used.
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.01]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, parser, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • parser (Any) – The dependency parser to be used.
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
classmethod normalise_parameter_names(parameter_dict)[source]

Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters().

Return type:Dict[str, Any]
Returns:A dictonary that can be used as keyword arguments into the get_parameters() method
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model
class bella.models.tdparse.TDParsePlus(word_vectors, parser, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Bases: bella.models.tdparse.TDParseMinus

__init__(word_vectors, parser, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]
Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • parser (Any) – The dependency parser to be used.
  • senti_lexicon (Lexicon) – Sentiment Lexicon to be used for the Left and Right sentiment context (LS and RS).
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

None

classmethod get_cv_parameters(word_vectors, parser, senti_lexicon, tokeniser=[<function ark_twokenize>], lower=[True], C=[0.01], random_state=[42], scale=[MinMaxScaler(copy=True, feature_range=(0, 1))])[source]

Transform the given parameters into a list of dictonaries that is accepted as param_grid parameter in sklearn.model_selection.GridSearchCV

Parameters:
  • word_vectors (List[List[WordVectors]]) – A list of a list of word vectors e.g. [[SSWE()], [SSWE(), GloveCommonCrawl()]].
  • parser (List[Any]) – A list of dependency parser to be used.
  • senti_lexicon (List[Lexicon]) – A list of Sentiment Lexicons to be explored for the Left and Right sentiment context (LS and RS).
  • tokenisers – A list of tokeniser to be used e.g. str.split(). Default [ark_twokenize]
  • lowers – A list of bool values which indicate whether to lower case the input words. Default [True]
  • C – A list of C values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [0.01]
  • random_state – A list of random_state values for the sklearn.svm.SVC estimator that is used in the pipeline. Default [42]
  • scale – List of scale values. The list can include sklearn.preprocessing.MinMaxScaler type of clases or None if no scaling is to be used. Default [sklearn.preprocessing.MinMaxScaler]
Returns:

Parameters to explore through cross validation

classmethod get_parameters(word_vectors, parser, senti_lexicon, tokeniser=<function ark_twokenize>, lower=True, C=0.01, random_state=42, scale=MinMaxScaler(copy=True, feature_range=(0, 1)))[source]

Transform the given parameters into a dictonary that is accepted as model parameters

Parameters:
  • word_vectors (List[WordVectors]) – A list of one or more word vectors to be used as feature vector lookups. If more than one is used the word vectors are concatenated together to create a the feature vector for each word.
  • parser (Any) – The dependency parser to be used.
  • senti_lexicon (Lexicon) – Sentiment Lexicon to be used for the Left and Right sentiment context (LS and RS).
  • tokeniser (Callable[[str], List[str]]) – Tokeniser to be used e.g. str.split()
  • lower (bool) – Whether to lower case the words
  • C (float) – The C value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • random_state (int) – The random_state value for the sklearn.svm.SVC estimator that is used in the pipeline.
  • scale (Any) – How to scale the data before input into the estimator. If no scaling is to be used set this to None.
Return type:

Dict[str, Any]

Returns:

Model parameters

classmethod name()[source]
Return type:str
classmethod normalise_parameter_names(parameter_dict)[source]

Converts the output of get_parameters() into a dictionary that can be used as input into get_parameters().

Return type:Dict[str, Any]
Returns:A dictonary that can be used as keyword arguments into the get_parameters() method
static pipeline()[source]

Machine Learning model that is used as the base template for the model attribute.

Return type:Pipeline
Returns:The template machine learning model

Module contents