fiesta.TTTS

Top-Two Thompson Sampling (TTTS) (Russo, 2016) this approach does not give each model the same number of evaluations rather it evaluates the models that are performing better more often.

In more detail after each of the \(N\) models are evaluated \(3\) times the following two steps are repeated until one of the models is better than the rest by a certain confidence level:

  1. \(2\) models are sampled from the \(N\) models where the sampling is weighted based on the performance of the models.

  2. Between the \(2\) models \(1\) is choosen randomly and is evaluated

Caveat

Assumes that the evaluations scores produced by the models follow a Gaussian (normal) distribution. For a great guide on knowing what distribution your evaluation metric would produce see Dror and Reichart, 2018 guide.

Note

All models are evaluted at least three times to ensure that we have a larger enough belief to start testing if one model is better than the rest with a certain confidence.

Note

This approach has been shown to outperform the standard non-adpative FC approach as shown in this paper associated to this code base and the following notebook tutorial from this code based. Outperform here means that we find the best model in fewer number of evaluations/runs.

fiesta.fiesta.TTTS(data, model_functions, split_function, p_value, logit_transform=False, samples=100000)[source]
Parameters
  • data (List[Dict[str, Any]]) – A list of dictionaries, that as a whole represents the entire dataset. Each dictionary within the list represents one sample from the dataset.

  • model_functions (List[Callable[[List[Dict[str, Any]], List[Dict[str, Any]]], float]]) – A list of functions that represent different models e.g. pytorch model. Which take a train and test dataset as input and returns a metric score e.g. Accuracy. The model functions should not have random seeds set else it defeats the point of finding the best model independent of the random seed and data split.

  • split_function (Callable[[List[Dict[str, Any]]], Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]]) – A function that can be used to split the data into train and test splits. This should produce random splits each time it is called. If you would like to use a fixed split each time, you can hard code this function to produce a fixed split each time.

  • p_value (float) – The significance value for the best model to be truely the best model e.g. 0.05 if you want to be at least 95% confident.

  • logit_transform (bool) – Whether to transform the model function’s returned metric score by the logit function.

  • samples (int) – Number of samples to generate from our belief distribution for each model. This argument is passed directly to fiesta.util.belief_calc() within this function. This should be large e.g. minimum 10000.

Return type

Tuple[List[float], List[float], int, List[List[float]]]

Returns

Tuple containing 4 values:

  1. The confidence socres for each model, the best model should have the highest confidence

  2. The number of times each model was evaluated as a proportion of the number of evaluations

  3. The total number of model evaluations

  4. The scores that each model generated when evaluated.

NOTE

That if the logit transform is True then the last item in the tuple would be scores that have been transformed by the logit function.

Raises

ValueError – If the p_value is not between 0 and 1.