from codeE.methods import ...State of the art methods proposed in the crowdsourcing scenario, also called learning from crowds.

To problem notation see the documentation
The available methods:
- Majority Voting: soft, hard, weighted.
- Dawid and Skene: ground truth (GT) inference based on confusion matrices (CM) of annotators.
- Global Label - Label Noise: GT inference based on global behavior of annotations, as a label noise problem (only one confusion matrix).
- Raykar et al: predictive model over GT inference based on CM of annotators.
- Crowd Mixture Model: inference of model and groups on annotations of the data.
- Crowd - Mixture of Annotators: inference of model and groups on annotations of the annotators.
- Global Model - Label Noise: inference of model and global behavior of annotations, as a label noise problem (only one confusion matrix).
- Multiple-Annotator Logistic Regression: predictive model over GT inference based on annotators reliability.
- Rodrigues & Pereira - CrowdLayer: The Raykar et al. model trained only with backpropagation.
- Goldberger & Ben-Reuven - NoiseLayer: The Global Model - Label Noise trained only with backpropagation.
Comparison over the methods could be found on Comparison
class codeE.methods.LabelAgg(scenario="global")Methods that reduced the multiple annotations y to a single target for each input pattern x. The goal is to define a ground truth z as a function of the annotations, for example summary statistic such as the mean, the median or the mode.
The most used and simple technique corresponds to Majority Voting (MV) [1], which can handle both scenario setting, individual or global.
It can handle both representation: individual and dense (for further details see representation documentation)
Parameters
- scenario: string, {'global','individual'}, default='global'
The scenario in which the annotations will be aggregated. Subject to the representation format, for further details see the representation documentation.
[1] Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008, August). Get another label? improving data quality and data mining using multiple, noisy labelers.
[2] Rodrigues, F., Pereira, F., & Ribeiro, B. (2013). Learning from multiple annotators: distinguishing good from random labelers.
... #read some data
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')
y_cat_var, _ = set_representation(y_obs,"onehotvar")
r_obs = set_representation(y_obs,"global")Infer over global scenario
from codeE.methods import LabelAgg
label_A = LabelAgg(scenario="global")
mv_soft = label_A.infer( r_obs, 'softMV')
mv_hard = label_A.predict(r_obs, 'hardMV')Infer over individual - sparse scenario
from codeE.baseline import LabelAgg
label_A = LabelAgg(scenario="individual", sparse=True)
mv_soft = label_A.infer( y_cat_var, 'softMV')
mv_hard = label_A.predict(y_cat_var, 'hardMV')Infer over individual - dense scenario
from codeE.baseline import LabelAgg
label_A = LabelAgg(scenario="individual")
mv_soft = label_A.infer( y_obs_categorical, 'softMV')
mv_hard = label_A.predict(y_obs_categorical, 'hardMV')Weighted (individual - dense scenario)
from codeE.baseline import LabelAgg
label_A = LabelAgg(scenario="individual")
T_weights = np.sum(y_obs_categorical.sum(axis=-1) == 0, axis=0) #number of annotations given per annotator
Wmv_soft = label_A.infer(y_obs_categorical, 'softMV', weights=T_weights)
Wmv_soft| Function | Description |
|---|---|
| infer | Return the inferred label of the data |
| predict | same as infer |
infer(labels, method, weights=[1], onehot=False)Infer the ground truth over the labels (multiple annotations) based on the indicated method.
Parameters
- labels: array-like of shape
scenario='global': (n_samples, n_classes)
scenario='individual': (n_samples, n_annotators, n_classes)
Annotations of the data, should be individual or global representation
- method: string, {'softmv','hardmv'}
The method used to infer the ground truth of the data based on the most used aggregation techniques: Majority Voting (MV) [2].
hardmv: The categorical (discrete) value of the ground truth is obtained as the most frequent label between the observed annotations. softmv: The probability estimation of the ground truth is obtained as the relative frequency of each possible label over all the observed annotations.
- weights: array-like or list of shape or (n_annotators,)
The weights over annotators to use in the aggregation scheme, if it is neccesary. There is no restriction on this value (the sum does not need to be 1). - onehot: boolean, default=False
Only used if method='hardmv'. This value will control the returned representation, as one-hot vector or as class numbers (between 0 and K-1).
Returns
- Z_hat: array-like of shape (n_samples, n_classes) or (n_samples,)
The inferred ground truth of the data based on the selected method. The returned shape will be (n_samples,) if method='hardmv and onehot=False
predict(*args)same operation than infer function.
class codeE.methods.LabelInf_EM(init_Z='softmv', priors=0, fast=False, DTYPE_OP='float32')Method that infers the ground truth label with a probabilistic framework based on the EM algorithm [1]. It represent the annotators ability as a confusion matrix to detect the ground truth
This method is proposed on the framework of Dawid and Skene (D&S) [2].
It is proposed on the individual dense representation (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class.
- priors: different options
The priors to be set on the confusion matrices of annotators, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the annotators over all the data. - array-like of shape (n_annotators,)
A vector of the number of annotations to be set prior for every possible annotator on the data. - array-like of shape (n_annotators, n_classes)
A matrix of the number of annotations to be set priors for every annotator and every ground truth label on the data. - array-like of shape (n_annotators, n_classes, n_classes)
A cube with the number of annotations to be set priors for every annotator, every ground truth label and every observed label on the data. priors=
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- fast: boolean, default=False
If the fast estimation of the method is used. This correspond to perform a discrete/hard estimation of the ground truth label, after E step. According to Sinha et al. [4] accelerates convergence (reduce the number of iterations to convergence). - DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
[1] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm
[2] Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error‐rates using the EM algorithm
[3] Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., & Baldi, P. (1995). Inferring ground truth from subjective labelling of venus images.
[4] Sinha, V. B., Rao, S., & Balasubramanian, V. N. (2018). Fast Dawid-Skene: A Fast Vote Aggregation Scheme for Sentiment Classification
... #read some data
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')Train the model
from codeE.methods import LabelInf_EM as DS
DS_model = DS(init_Z='softmv')
hist = DS_model.fit(y_obs_categorical)Train with different settings
from codeE.methods import LabelInf_EM as DS
DS_model = DS(init_Z='hardmv', priors=1, fast=True)
hist = DS_model.fit(y_obs_categorical)Get estimation of the ground truth label on trainig set and ground truth marginal
print("p(z) =", DS_model.get_marginalZ())
ds_labels = DS_model.infer()| Function | Description |
|---|---|
| get_marginalZ | Returns the parameter used as marginal probability of ground truth |
| get_confusionM | Returns the modeled confusion matrices |
| get_qestimation | Returns the estimation over auxiliar variable Q |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| C_step | Perform an auxiliar C-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| fit | same as train |
| infer | Get the inferred ground truth on training set |
| predict | same as infer |
| get_ann_confusionM | Returns the estimation of individual confusion matrices |
get_marginalZ()Returns
- z_marginal: array-like of shape (n_classes,)
The parameter used as marginal probability of the ground truth
get_confusionM()Returns
get_qestimation()Returns
- Qi_k: array-like of shape (n_samples, n_classes)
The estimation over auxiliar variable Q
set_priors(priors)Parameters
- priors: as state on init
init_E(y_ann, method="")Initialization of the E-step based on method.
Parameters
- y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(y_ann)Perform the inference on the E-step.
Parameters
- y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation.
M_step(y_ann)Perform the inference on the M-step.
Parameters
- y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation.
C_step()Perform an auxiliar discrete/hard step on the estimation of the ground truth label
compute_logL()Calculate the log-likelihood of the data.
train(y_ann, max_iter=50,tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
fit(*args)same operation than train function.
infer()Returns
- Qi_k: array-like of shape (n_samples, n_classes)
The estimation of ground truth on the training data.
predict(*args)same operation than infer function.
get_ann_confusionM()Returns
- prob_Y_Zt: array-like of shape (n_annotators, n_classes, n_classes)
The estimation of individual confusion matrix of each annotator:
class codeE.methods.LabelInf_EM_G(init_Z="softmv", priors=0, DTYPE_OP='float32')Same idea that Model Inference based on EM - Label Noise. An extension, based on Label Inference, that infers the true label without a predictive model over the ground truth. It is based on a single confusion matrix .
It is proposed on the global representation. (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class.
- priors: different options
The priors to be set on the confusion matrices of annotators, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the groups over all the data. - array-like of shape (n_groups,)
A vector of the number of annotations to be set prior for every possible group on the data. - array-like of shape (n_groups, n_classes)
A matrix of the number of annotations to be set priors for every group and every ground truth label on the data. - array-like of shape (n_groups, n_classes, n_classes)
A cube with the number of annotations to be set priors for every group, every ground truth label and every observed label on the data.
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
X_train = ...
... #read some data
from codeE.representation import set_representation
r_obs = set_representation(y_obs, "global")Train the model
from codeE.methods import LabelInf_EM_G
LI_G_model = LabelInf_EM_G(init_Z='softmv')
hist = LI_G_model.fit(r_obs)Train with different settings
from codeE.methods import LabelInf_EM as DS
LI_G_model = LabelInf_EM_G(init_Z='hardmv', priors=1)
hist = LI_G_model.fit(r_obs)Get estimation of the ground truth label on trainig set and ground truth marginal
print("p(z) =", LI_G_model.get_marginalZ())
g_labels = LI_G_model.infer()| Function | Description |
|---|---|
| get_marginalZ | Returns the parameter used as marginal probability of ground truth |
| get_confusionM | Returns the unique confusion matrix modeled |
| get_qestimation | Returns the estimation over auxiliar model Q |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| fit | same as train |
| infer | Get the inferred ground truth on training set |
| predict | same as infer |
| get_global_confusionM | Returns the estimation of global confusion matrix |
get_marginalZ
get_confusionM()Returns
get_qestimation()Returns
- Qi_k: array-like of shape (n_samples,n_classes)
The estimation over auxiliar model Q
set_priors(priors)Parameters
- priors: as state on init
init_E(r_ann, method="")Initialization of the E-step based on method.
Parameters
- r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(r_ann)Perform the inference on the E-step.
Parameters
- r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation.
M_step(r_ann)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation.
compute_logL()Calculate the log-likelihood of the data.
train(r_ann, max_iter=50, tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
fit(R, runs = 1, max_iter=50, tolerance=3e-2)same operation than train function.
get_global_confusionM()Returns
class codeE.methods.ModelInf_EM(init_Z='softmv', n_init_Z=0, priors=0, DTYPE_OP='float32')This method set a predictive model
of the ground truth inside the inference for joint learning. It also represent the annotators ability as a confusion matrix
and allows any model on f().
The original idea (learning from crowds) was proposed by Raykar et al. [1].
It is proposed on the individual dense representation. (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv','model'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class. The model refers to train the predictive model over hardmv for n_init_Z epochs and use the predictions of it to initialize ground truth.
- n_init_Z: int, default=0
The number of epochs that the predictive model is going to be pre-trained, only used if init_Z='model'. - priors: different options
The priors to be set on the confusion matrices of annotators, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the annotators over all the data. - array-like of shape (n_annotators,)
A vector of the number of annotations to be set prior for every possible annotator on the data. - array-like of shape (n_annotators, n_classes)
A matrix of the number of annotations to be set priors for every annotator and every ground truth label on the data. - array-like of shape (n_annotators, n_classes, n_classes)
A cube with the number of annotations to be set priors for every annotator, every ground truth label and every observed label on the data.
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
[1] Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds.
[2] Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., & Navab, N. (2016). Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images
[3] Rodrigues, F., & Pereira, F. (2017). Deep learning from crowds
X_train = ...
... #read some data
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')Define predictive model (based on keras)
F_model = Sequential()
... #add layersSet the model and train the method
from codeE.methods import ModelInf_EM as Raykar
R_model = Raykar(init_Z="softmv")
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
R_model.set_model(F_model, **args)
R_model.fit(X_train, y_obs_categorical)Train with different settings
from codeE.methods import ModelInf_EM as Raykar
R_model = Raykar(init_Z="softmv", priors='laplace')
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
R_model.set_model(F_model, **args)
R_model.fit(X_train, y_obs_categorical, runs=20)Get the base model to predict the ground truth on some data
raykar_fx = R_model.get_basemodel()
raykar_fx.predict(new_X)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_confusionM | Returns the modeled confusion matrices |
| get_qestimation | Returns the estimation over auxiliar variable Q |
| set_model | Set the predictive model (base model) class |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| multiples_run | Perform multiple runs of the EM algorithm and save the best |
| fit | same as multiples_run |
| get_ann_confusionM | Returns the estimation of individual confusion matrices |
| get_predictions | Returns the probability predictions of ground truth over some set |
| get_predictions_annot | Returns the probability estimation of labels over some set for each annotator |
get_basemodel()Returns
get_confusionM()Returns
get_qestimation()Returns
- Qi_k: array-like of shape (n_samples, n_classes)
The estimation over auxiliar variable Q
set_model(model, optimizer="adam", epochs=1, batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the M step of the iterative EM algorithm.
Parameters
- model: function or class of keras model
Predictive model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - epochs: int, default=1
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
set_priors(priors)Parameters
- priors: as state on init
init_E(X, y_ann, method="")Initialization of the E-step based on method.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(X, y_ann, predictions=[])Perform the inference on the E-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - predictions: array-like of shape (n_samples, n_classes)
Probability predictions of the ground truth on training set. If X is given, not necessary to give this parameter (default=[]).
M_step(X, y_ann)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation.
compute_logL()Calculate the log-likelihood of the data.
train(X_train, y_ann, max_iter=50,tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
multiples_run(Runs, X, y_ann, max_iter=50, tolerance=3e-2)Perform multiple runs of the EM algorithm and save the best execution based on log-likelihood.
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
Returns
- found_logL: list of length=Runs
A list with the history of log-likelihood for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=3e-2)same operation than multiples_run function.
get_ann_confusionM()Returns
- prob_Y_Zt: array-like of shape (n_annotators, n_classes, n_classes)
The estimation of individual confusion matrix of each annotator:
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_predictions_annot(X, data=[])Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data. - data: array-like of shape (n_samples, n_classes)
If the probability predictions of the ground truth over some set are delivered
Returns
- prob_Y_xt: array-like of shape (n_samples, n_annotators, n_classes)
The probability estimation of labels over some set for each annotator.
class codeE.methods.ModelInf_EM_CMM(M, init_Z="softmv", n_init_Z=0, priors=0, DTYPE_OP='float32')This method infer a predictive model of the ground truth jointly with the ground truth inference based on groups over the data annotations. Contrary to other methods, it does not have an explicit model per annotators. It represents the groups ability as a confusion matrix
and allows any model on f().
The original CMM (Crowd Mixture Model) method was proposed by Mena et al. [1].
It is proposed on the global representation. (for further details see representation documentation)
Parameters
- M: int
The number of groups (n_groups) to be found (different types of behaviors) in the annotations.- If M=1 returns a ModelInf_EM_G instance, (with a global confusion matrix).
- init_Z: string, {'softmv','hardmv','model'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class. The model refers to train the predictive model over hardmv for n_init_Z epochs and use the predictions of it to initialize ground truth.
- n_init_Z: int, default=0
The number of epochs that the predictive model is going to be pre-trained, only used if init_Z='model'. - priors: different options
The priors to be set on the confusion matrices of annotators, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the groups over all the data. - array-like of shape (n_groups,)
A vector of the number of annotations to be set prior for every possible group on the data. - array-like of shape (n_groups, n_classes)
A matrix of the number of annotations to be set priors for every group and every ground truth label on the data. - array-like of shape (n_groups, n_classes, n_classes)
A cube with the number of annotations to be set priors for every group, every ground truth label and every observed label on the data.
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
[1] Mena, F., & Ñanculef, R. (2019, October). Revisiting Machine Learning from Crowds a Mixture Model for Grouping Annotations
[2] Mena, F., Ñanculef, R., & Valles, C. (2020). Collective Annotation Patterns in Learning from Crowds
X_train = ...
... #read some data
from codeE.representation import set_representation
r_obs = set_representation(y_obs, "global")Define predictive model (based on keras)
F_model = Sequential()
... #add layersSet the model and train the method
from codeE.methods import ModelInf_EM_CMM as CMM
CMM_model = CMM(M=3)
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
CMM_model.set_model(F_model, **args)
CMM_model.fit(X_train, r_obs)Train with different settings
from codeE.methods import ModelInf_EM_CMM as CMM
CMM_model = CMM(M=3, init_Z='model', n_init_Z=3, priors=0)
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
CMM_model.set_model(F_model, **args)
CMM_model.fit(X_train, r_obs, runs=20)Get the base model to predict the ground truth on some data
cmm_fx = CMM_model.get_basemodel()
cmm_fx.predict(new_X)Get the presence probability and the confusion matrices of the groups
print("p(g) =",CMM_model.get_alpha())
B = CMM_model.get_confusionM()
from codeE.utils import plot_confusion_matrix
for i in range(len(B)):
plot_confusion_matrix(B[i])| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_confusionM | Returns the modeled confusion matrices |
| get_alpha | Returns the presence probability of each group |
| get_qestimation | Returns the estimation over auxiliar model Q |
| set_model | Set the predictive model (base model) class |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| multiples_run | Perform multiple runs of the EM algorithm and save the best |
| fit | same as multiples_run |
| get_global_confusionM | Returns the estimation of global confusion matrix |
| get_ann_confusionM | Returns the estimation of individual confusion matrix of some annotator based on his annotations on the data |
| get_predictions | Returns the probability predictions of ground truth over some set |
| get_predictions_groups | Returns the probability estimation of labels over the modeled groups |
get_basemodel()Returns
get_confusionM()Returns
get_alpha()Returns
get_qestimation()Returns
- Qij_mk: array-like of shape (n_samples, n_classes, n_groups, n_classes)
The estimation over auxiliar model Q
set_model(model, optimizer="adam", epochs=1, batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the M step of the iterative EM algorithm.
Parameters
- model: function or class of keras model
Predictive model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - epochs: int, default=1
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
set_priors(priors)Parameters
- priors: as state on init
init_E(X, r_ann, method="")Initialization of the E-step, based on the following approximation:
.
The groups g initialization is based on a K-means clustering and the ground truth z is based on method.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(X, predictions=[])Perform the inference on the E-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - predictions: array-like of shape (n_samples, n_classes)
Probability predictions of the ground truth on training set. If X is given, not necessary to give this parameter (default=[]).
M_step(X, r_ann)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation.
compute_logL(r_ann)Calculate the log-likelihood of the data.
train(X_train, r_ann, max_iter=50,tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
multiples_run(Runs, X, r_ann, max_iter=50, tolerance=3e-2)Perform multiple runs of the EM algorithm and save the best execution based on log-likelihood.
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
Returns
- found_logL: list of length=Runs
A list with the history of log-likelihood for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=3e-2)same operation than multiples_run function.
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_global_confusionM()Returns
get_ann_confusionM(X, Y)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data. - Y: array-like of shape (n_samples, )
Annotations of some specific annotator t, no label symbol =-1
Returns
- ann_prob_Y_Z: array-like of shape (n_classes, n_classes)
The estimation of individual confusion matrix of some annotator t:
get_predictions_groups(X, data=[])Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data. - data: array-like of shape (n_samples, n_classes)
If the probability predictions of the ground truth over some set are delivered
Returns
- prob_Y_xg: array-like of shape (n_samples, n_groups, n_classes)
The probability estimation of labels over some set for each group.
class codeE.methods.ModelInf_EM_CMOA(M, init_Z="softmv", init_G="", n_init_Z=0, n_init_G=0, priors=0, DTYPE_OP='float32')This method infer a predictive model of the ground truth jointly with the ground truth inference based on groups over the annotations of the annotators. Contrary to other methods, it does not have an explicit model per annotators. It represents the groups ability as a confusion matrix
and allows any model on f().
It requieres a group model that assign annotators a to groups g:
The original C-MoA (Crowd - Mixture of Annotators) method was proposed by Mena et al. [2]
It is proposed on the individual sparse representation. (for further details see representation documentation)
Parameters
- M: int
The number of groups (n_groups) to be found (different types of behaviors) in the annotations. - init_Z: string, {'softmv','hardmv','model'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class. The model refers to train the predictive model over hardmv for n_init_Z epochs and use the predictions of it to initialize ground truth.
- init_G: string, {'','model'}, default=''
The method used to pre-train the group model. The model refers to train the group model for n_init_G epochs. Empty string means no pre-training. - n_init_Z: int, default=0
The number of epochs that the predictive model is going to be pre-trained, only used if init_Z='model'. - n_init_G: int, default=0
The number of epochs that the group model is going to be pre-trained, only used if init_G='model'. - priors: different options
The priors to be set on the confusion matrices of groups, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the groups over all the data. - array-like of shape (n_groups,)
A vector of the number of annotations to be set prior for every possible group on the data. - array-like of shape (n_groups, n_classes)
A matrix of the number of annotations to be set priors for every group and every ground truth label on the data. - array-like of shape (n_groups, n_classes, n_classes)
A cube with the number of annotations to be set priors for every group, every ground truth label and every observed label on the data.
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
[1] Mena, F., & Ñanculef, R. (2019, October). Revisiting Machine Learning from Crowds a Mixture Model for Grouping Annotations
[2] Mena, F., Ñanculef, R., & Valles, C. (2020). Collective Annotation Patterns in Learning from Crowds
X_train = ...
... #read some data
from codeE.representation import set_representation
y_cat_var, A_idx_var = set_representation(y_obs,"onehotvar")Define predictive model (based on keras)
F_model = Sequential()
... #add layersDefine the group model (based on keras)
T= np.concatenate(A_idx_var).max() +1
group_model = Sequential()
group_model.add(Embedding(T, 8, input_length=1,
trainable=True,weights=[A_rep]))
group_model.add(Reshape([K]))
... #add dense (feed forward) layersSet the model and train the method
from codeE.methods import ModelInf_EM_CMOA as CMOA
CMOA_model = CMOA(M=3)
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
CMOA_model.set_model(F_model, ann_model=group_model, **args)
CMOA_model.fit(X_train, y_cat_var, A_idx_var)Train with different settings (you must create the keras model again)
from codeE.methods import ModelInf_EM_CMOA as CMOA
CMOA_model = CMOA(M=3, init_Z='softmv', n_init_Z=0, n_init_G=0, priors=1)
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
CMOA_model.set_model(F_model, ann_model=group_model, **args)
CMOA_model.fit(X_train, y_cat_var, A_idx_var, runs=20)Get the base model to predict the ground truth on some data
cmoaK_fx = CMOA_model.get_basemodel()
cmoaK_fx.predict(new_X)Get the individual confusion matrices for every annotator
A = np.unique(np.concatenate(A_idx_var)).reshape(-1,1)
prob_Yzt = CMOA_model.get_ann_confusionM(A)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_groupmodel | Returns the group model that assign annotators to group |
| get_confusionM | Returns the modeled confusion matrices |
| get_qestimation | Returns the estimation over auxiliar model Q |
| set_model | Set the predictive model (base model) and group model |
| set_ann_model | Set the group model |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| multiples_run | Perform multiple runs of the EM algorithm and save the best |
| fit | same as multiples_run |
| get_global_confusionM | Returns the estimation of global confusion matrix |
| get_ann_confusionM | Returns the estimation of individual confusion matrix of the annotators |
| get_predictions_z | Returns the probability predictions of ground truth over some set |
| get_predictions_g | Returns the probability predictions of the groups over the annotators |
| get_predictions_groups | Returns the probability estimation of labels over the modeled groups |
get_basemodel()Returns
get_groupmodel()Returns
get_confusionM()Returns
get_qestimation()Returns
- Qil_mk: (n_samples,) of arrays of shape (n_annotations(i), n_groups, n_classes)
The estimation over auxiliar model Q
set_model(model, optimizer="adam", epochs=1, batch_size=32, ann_model=None)Set the predictive model (base model) over the ground truth and define how to optimize it on the M step of the iterative EM algorithm. Besides set the group model that assign annotators to groups.
Parameters
- model: function or class of keras model
Base model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - epochs: int, default=1
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
ann_model: function or class of keras model
Group model based on Keras.
set_ann_model(model, optimizer=None, epochs=None)Set the group model that assign annotators to groups.
Parameters
- model: function or class of keras model
Group model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default=None
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/, If None it uses the optimizer and epochs of the base model. - epochs: int, default=None
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/, If None it uses the optimizer and epochs of the base model.
set_priors(priors)Parameters
- priors: as state on init
init_E(X, y_ann_var, A_idx_var, method="")Initialization of the E-step, based on the following approximation:
.
The groups g initialization is based on a K-means clustering and the ground truth z is based on method.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i), n_classes)
Annotations of the data, should be on a categorical representation of variable length, from only anotators that annotate the data. - A_idx_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i),)
Identifier of the annotator of each annotations in y_ann_var. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(X, y_ann_flatten, A_idx_flatten, predictions_Z=[], predictions_G=[])Perform the inference on the E-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann_flatten: array-like of shape (n_annotations, n_classes)
Annotations of the data, the flatten format of the variable length categorical representation from y_ann_var. - A_idx_flatten: array-like of shape (n_annotations, )
Identifier of the annotators on the data, the flatten format from A_idx_var. - predictions_Z: array-like of shape (n_samples, n_classes)
Probability predictions of the ground truth on training set. If X is given, not necessary to give this parameter (default=[]). - predictions_G: array-like of shape (n_annotators, n_groups)
Probability predictions of the groups on the annotators. If A_idx_flatten is given, not necessary to give this parameter (default=[]).
M_step(X, y_ann_flatten, A_idx_flatten)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann_flatten: array-like of shape (n_annotations, n_classes)
Annotations of the data, the flatten format of the variable length categorical representation from y_ann_var. - A_idx_flatten: array-like of shape (n_annotations, )
Identifier of the annotators on the data, the flatten format from A_idx_var.
compute_logL()Calculate the log-likelihood of the data.
train(X_train, y_ann_var, A_idx_var, max_iter=50,tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i), n_classes)
Annotations of the data, should be on a categorical representation of variable length, from only anotators that annotate the data. - A_idx_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i),)
Identifier of the annotator of each annotations in y_ann_var. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
multiples_run(Runs, X, y_ann_var, A_idx_var, max_iter=50, tolerance=3e-2)Perform multiple runs of the EM algorithm and save the best execution based on log-likelihood.
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i), n_classes)
Annotations of the data, should be on a categorical representation of variable length, from only anotators that annotate the data. - A_idx_var: array-like of shape (n_samples,) of arrays of shape (n_annotations(i),)
Identifier of the annotator of each annotations in y_ann_var. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
Returns
- found_logL: list of length=Runs
A list with the history of log-likelihood for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=3e-2)same operation than multiples_run function.
get_predictions_z(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_predictions_g(A)Parameters
- A: array-like of shape (n_annotators_pred, 1)
The identifier of n_annotators_pred annotators to assign groups.
Returns
- prob_G_hat: array-like of shape (n_annotators_pred, n_groups)
The probability predictions over the groups of some annotators
get_global_confusionM(prob_Gt)Parameters
- prob_Gt: array-like of shape (n_annotators, n_groups)
Probabilities of the annotators over the groups.
Returns
get_ann_confusionM(A)Parameters
- A: array-like of shape (n_annotators_pred, 1)
The identifier of n_annotators_pred annotators.
Returns
- prob_Y_Zt: array-like of shape (n_annotators_pred, n_classes, n_classes)
The estimation of individual confusion matrices of n_annotators_pred annotators:
get_predictions_groups(X, data=[])Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data. - data: array-like of shape (n_samples, n_classes)
If the probability predictions of the ground truth over some set are delivered
Returns
- prob_Y_xg: array-like of shape (n_samples, n_groups, n_classes)
The probability estimation of labels over some set for each group.
class codeE.methods.ModelInf_EM_G(init_Z="softmv", n_init_Z=0, priors=0, DTYPE_OP='float32')Inspired in a solution to the Label Noise problem [2], the NLNN (Noisy Labels Neural-Network) [2] is an EM solution based on a single confusion matrix to infer a predictive model
of the ground truth. The NLNN with some minor modifications is applied to the crowdsourcing in the global scenario, where the noisy channel refers to the global confusion matrix.
This method can be referenced as global label noise, global learning from crowds or learning from global annotations.
It is proposed on the global representation. (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv','model'}, default='softmv'
The method used to initialize the ground truth probabilities on the EM step:. The softmv and hardmv posibilities are based on LabelAgg class. The model refers to train the predictive model over hardmv for n_init_Z epochs and use the predictions of it to initialize ground truth.
- n_init_Z: int, default=0
The number of epochs that the predictive model is going to be pre-trained, only used if init_Z='model'. - priors: different options
The priors to be set on the confusion matrices of annotators, could be in different formats:- string, {'laplace','none'}
The 'laplace' stand for Laplace smoothing, a prior with the value of 1. - int
A number of annotations to be set prior for all the groups over all the data. - array-like of shape (n_groups,)
A vector of the number of annotations to be set prior for every possible group on the data. - array-like of shape (n_groups, n_classes)
A matrix of the number of annotations to be set priors for every group and every ground truth label on the data. - array-like of shape (n_groups, n_classes, n_classes)
A cube with the number of annotations to be set priors for every group, every ground truth label and every observed label on the data.
Comments on the priors: The laplace smooth prior helps to stabilize traning and speeds up convergence. The disadvantage trade-off correspond to a slightly worse estimation of the ground truth.
- string, {'laplace','none'}
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
[1] Bekker, A. J., & Goldberger, J. (2016, March). Training deep neural-networks based on unreliable labels
[2] Frénay, B., & Kabán, A. (2014, April). A comprehensive introduction to label noise.
X_train = ...
... #read some data
from codeE.representation import set_representation
r_obs = set_representation(y_obs, "global")Define predictive model (based on keras)
F_model = Sequential()
... #add layersSet the model and train the method
from codeE.methods import ModelInf_EM_G as G_Noise
GNoise_model = G_Noise()
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
GNoise_model.set_model(F_model, **args)
GNoise_model.fit(X_train, r_obs)Train with different settings
from codeE.methods import ModelInf_EM_G as G_Noise
GNoise_model = G_Noise(init_Z='model', n_init_Z=3, priors=0)
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}
GNoise_model.set_model(F_model, **args)
GNoise_model.fit(X_train, r_obs, runs=20)Get the base model to predict the ground truth on some data
G_fx = GNoise_model.get_basemodel()
G_fx.predict(new_X)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_confusionM | Returns the unique confusion matrix modeled |
| get_qestimation | Returns the estimation over auxiliar model Q |
| set_model | Set the predictive model (base model) |
| set_priors | To set the priors used on confusion matrices |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| multiples_run | Perform multiple runs of the EM algorithm and save the best |
| fit | same as multiples_run |
| get_global_confusionM | Returns the estimation of global confusion matrix |
| get_predictions | Returns the probability predictions of ground truth over some set |
| get_predictions_global | Returns the probability estimation of labels over crowdsourcing scenario |
get_basemodel()Returns
get_confusionM()Returns
get_qestimation()Returns
- Qi_k: array-like of shape (n_samples,n_classes)
The estimation over auxiliar model Q
set_model(model, optimizer="adam", epochs=1, batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the M step of the iterative EM algorithm.
Parameters
- model: function or class of keras model
Base model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - epochs: int, default=1
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
set_priors(priors)Parameters
- priors: as state on init
init_E(X, r_ann, method="")Initialization of the E-step based on method.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities on the EM step:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
E_step(X, predictions=[])Perform the inference on the E-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - predictions: array-like of shape (n_samples, n_classes)
Probability predictions of the ground truth on training set. If X is given, not necessary to give this parameter (default=[]).
M_step(X, r_ann)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation.
compute_logL()Calculate the log-likelihood of the data.
train(X_train, r_ann, max_iter=50, tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
multiples_run(Runs,X,r_ann,max_iter=50,tolerance=3e-2)Perform multiple runs of the EM algorithm and save the best execution based on log-likelihood.
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
Returns
- found_logL: list of length=Runs
A list with the history of log-likelihood for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,R, runs = 1, max_iter=50, tolerance=3e-2)same operation than multiples_run function.
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_global_confusionM()Returns
get_predictions_global(X, data=[])Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data. - data: array-like of shape (n_samples, n_classes)
If the probability predictions of the ground truth over some set are delivered
Returns
- prob_Y_x: array-like of shape (n_samples, n_classes)
The probability estimation of the crowdsourcing labels over some set.
class codeE.methods.ModelInf_EM_R(init_R='original', DTYPE_OP='float32')This method set a predictive model
of the ground truth that every annotator could identify based on his reliability. It represent the annotators reliability over each data by a fixed probability
or
.
The original MA-LR (Multiple-Annotator Logistic Regression) was proposed by Rodrigues et al. [1].
It is proposed on the individual dense representation. (for further details see representation documentation)
Parameters
- init_R: string, {'softmv','hardmv','original','simple'}, default='original'
The method used to initialize the reliability probabilities on the EM step for each annotator over each data:. The original/simple refers to the same, initialize all probabilities as 1, i.e. assign each annotator as trustworthy. The softmv and hardmv posibilities are based on LabelAgg class and are soft versions of the above.
- DTYPE_OP: string, default='float32'
dtype of numpy array, restricted to https://numpy.org/devdocs/user/basics.types.html
X_train = ...
... #read some data
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')Define predictive model (based on keras)
F_model = Sequential()
... #add layers
args = {'epochs':1, 'batch_size':BATCH_SIZE, 'optimizer':OPT}Set the model and train the method
from codeE.methods import ModelInf_EM_R as MA_DL
MA_model = MA_DL()
MA_model.set_model(F_model, **args)
MA_model.fit(X_train, y_obs_categorical)Train with different settings
from codeE.methods import ModelInf_EM_R as MA_DL
MA_model = MA_DL(init_R="softmv")
MA_model.set_model(F_model, **args)
MA_model.fit(X_train, y_obs_categorical, runs=20)Get the base model to predict the ground truth on some data
ma_fx = MA_model.get_basemodel()
ma_fx.predict(new_X)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_b | Returns the modeled reliability of annotators |
| get_restimation | Returns the estimation over auxiliar variable R |
| set_model | Set the predictive model (base model) class |
| init_E | Initialization of the E-step |
| E_step | Perform the inference on the E-step |
| M_step | Perform the inference on the M-step |
| compute_logL | Calculate the log-likelihood of the data |
| train | Perform all the inference based on EM algorithm |
| multiples_run | Perform multiple runs of the EM algorithm and save the best |
| fit | same as multiples_run |
| get_ann_rel | Returns the estimation of the probabilistic reliability of each annotator |
| get_predictions | Returns the probability predictions of ground truth over some set |
get_basemodel()Returns
get_b()Returns
get_restimation()Returns
- Ri_l: array-like of shape (n_samples, n_annotators, 1)
The estimation over auxiliar variable R.
set_model(model, optimizer="adam", epochs=1, batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the M step of the iterative EM algorithm.
Parameters
- model: function or class of keras model
Predictive model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - epochs: int, default=1
Number of epochs (iteration over the entire set) to train the model based on https://keras.io/api/models/model_training_apis/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
init_E(X, y_ann, method="")Initialization of the E-step based on method.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - method: string, {'softmv','hardmv','original','simple'}, default='original'
The method used to initialize the reliability probabilities on the EM step for each annotator over each data:. The original/simple refers to the same, initialize all probabilities as 1, i.e. assign each annotator as trustworthy. The softmv and hardmv posibilities are based on LabelAgg class and are soft versions of the above. The empty string will use the method seted on init.
E_step(X, y_ann, predictions=[])Perform the inference on the E-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - predictions: array-like of shape (n_samples, n_classes)
Probability predictions of the ground truth on training set. If X is given, not necessary to give this parameter (default=[]).
M_step(X, y_ann)Perform the inference on the M-step.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation.
compute_logL()Calculate the log-likelihood of the data.
train(X_train, y_ann, max_iter=50,tolerance=3e-2)Perform all the inference based on EM algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
multiples_run(Runs, X, y_ann, max_iter=50, tolerance=3e-2)Perform multiple runs of the EM algorithm and save the best execution based on log-likelihood.
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=3e-2
The maximum relative difference on the parameters and loss between the iterations to train.
Returns
- found_logL: list of length=Runs
A list with the history of log-likelihood for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=3e-2)same operation than multiples_run function.
get_ann_rel()Returns
- prob_R_t: array-like of shape (n_annotators, 1)
The estimation of the probabilistic reliability of each annotator:
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
class codeE.methods.ModelInf_BP(init_Z='softmv', n_init_Z= 0, prior_lamb=0, init_conf = "default")This method is an extension of Raykar et al that avoids the use of the EM algorithm by encoding the confusion matrix as weights (based on a CrowdLayer [1]) of a big neural network named crowd model: . Where
is the predictive model and
are the confusion weights.
The original method (deep learning from crowds) that trains the crowd model only by backpropagation (BP) was proposed by Rodrigues & Pereira [1].
It is proposed on the global representation. (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv'}, default='softmv'
The method used to initialize the ground truth probabilities for pre-init the base model:. The softmv and hardmv posibilities are based on LabelAgg class. Only used if n_init_Z!=0
- n_init_Z: int, default=0
The number of epochs that the base predictive model is going to be pre-trained/pre-init. - prior_lamb: float, default=0
The hyper-parameter used in the loss function to weight the prior on the opinion of the majority - init_conf: string, {'default','', 'model', 'soft'} default='default'
The method used to initialize the confusion matrix weights inside the model, 'default' or empty it use the original proposed in [1], identity matrix. 'soft' is a soft version of the identity matrix based on a 15% of noise level. The 'model' use the confusion matrix of the pre-init/pre-trained base model as initialization (proposed in Goldberger & Ben-Reuven), only available if n_init_Z!=0.
[1] Rodrigues, F., & Pereira, F. (2017). Deep learning from crowds.
[2] Github code - fmpr/CrowdLayer
X_train = ...
... #read some data
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')Define predictive model (based on keras)
F_model = Sequential()
... #add layersSet the model and train the method
from codeE.methods import ModelInf_BP as Rodrigues18
Ro_model = Rodrigues18()
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
Ro_model.set_model(F_model, **args)
Ro_model.fit(X_train, y_obs_categorical)Train with different settings
from codeE.methods import ModelInf_BP as Rodrigues18
Ro_model = Rodrigues18(init_Z='softmv', n_init_Z=3, init_conf="model")
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
Ro_model.set_model(F_model, **args)
Ro_model.fit(X_train, y_obs_categorical, runs=10)Get the base model to predict the ground truth on some data
learned_fx = Ro_model.get_basemodel()
learned_fx.predict(new_X)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_confusionM | Returns the modeled weights of confusion matrices |
| set_model | Set the predictive model (base model) class |
| set_crowdL_model | Set the auxiliar crowd model to learning from crowds |
| init_model | Initialization of the model weights |
| train | Perform the learning based on backpropagation algorithm |
| multiples_run | Performs multiple runs of the learning and save the best weights |
| fit | same as multiples_run |
| get_ann_confusionM | Returns the estimation of individual confusion matrices |
| get_predictions_annot | Returns the probability estimation of labels over some set for each annotator |
get_basemodel()Returns
get_confusionM()Returns
- betas: array-like of shape (n_annotators, n_classes, n_classes)
The modeled weights of confusion matricesin the auxiliar neural network
, not bounded as probabilities, i.e. does not sum one over the observed labels j.
set_model(model, optimizer="adam", batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the backpropagation.
Parameters
- model: function or class of keras model
Predictive model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
set_crowdL_model(set_w = False, weights=0)Set the auxiliar crowd model to learning from crowds , based on neural networks.
Parameters
- set_w: boolean, default=False
If a weight matrix for initialization of the confusion weights is going to be seted. - weights: array-like of shape (n_classes, n_classes, n_annotators), default=False
The confusion matrix weights used as initialization values on the auxiliar crowd model. Only used if set_w=True.
init_model(X, y_ann, method="")Initialization of the neural network weights.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
train(X_train, y_ann, max_iter=50,tolerance=1e-2)Perform the learning of the neural network weights based on backpropagation algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=1e-2
The maximum relative difference on the loss between the iterations of learning.
multiples_run(Runs, X, y_ann, max_iter=50, tolerance=1e-2)Performs multiple runs of the neural network learning and save the best weights
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - y_ann: array-like of shape (n_samples, n_annotators, n_classes)
Annotations of the data, should be the individual one-hot (categorical) representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=1e-2
The maximum relative difference on the loss between the iterations of learning.
Returns
- found_loss: list of length=Runs
A list with the history of loss for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=1e-2)same operation than multiples_run function.
get_ann_confusionM(norm="")Parameters
- norm: string, {'softmax', '01', ''}, default=''
The normalize method used to obtain the individual confusion matrices estimation. Empty string does not use a normalization step, 'softmax' use the softmax function and '01' use a 0-1 range scaler as the proposed in [2].
Returns
- prob_Y_Zt: array-like of shape (n_annotators, n_classes, n_classes)
The estimation of individual confusion matrix of each annotator:.
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_predictions_annot(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Y_xt: array-like of shape (n_samples, n_annotators, n_classes)
The probability estimation of labels over some set for each annotator.
class codeE.methods.ModelInf_BP_G(init_Z='softmv', n_init_Z= 0, init_conf = "default")This method is an extension of Global Model - Label Noise that avoids the use of the EM algorithm by encoding the noise channel as weights (based on a NoiseLayer) of a big neural network named crowd noise model: . Where
is the predictive model and
are the confusion weights or noise weights.
The original method (s-model) that trains the crowd noise model only by backpropagation (BP) was proposed by Goldberger & Ben-Reuven [1].
It is proposed on the global representation. (for further details see representation documentation)
Parameters
- init_Z: string, {'softmv','hardmv'}, default='softmv'
The method used to initialize the ground truth probabilities for pre-init the base model:. The softmv and hardmv posibilities are based on LabelAgg class. Only used if n_init_Z!=0
- n_init_Z: int, default=0
The number of epochs that the base predictive model is going to be pre-trained/pre-init. - init_conf: string, {'default','', 'model'} default='default'
The method used to initialize the confusion matrix weights inside the model, both options are proposed in [2]. 'default' or empty it use the , a soft identity matrix based on a 15% of noise level. The 'model' use the confusion matrix of the pre-init/pre-trained base model as initialization, only available if n_init_Z!=0.
[1] Goldberger, J., & Ben-Reuven, E. (2016). Training deep neural-networks using a noise adaptation layer.
[2] Github code - udibr/noisy_labels
X_train = ...
... #read some data
from codeE.representation import set_representation
r_obs = set_representation(y_obs,"global")Define predictive model (based on keras)
F_model = Sequential()
... #add layersSet the model and train the method
from codeE.methods import ModelInf_BP_G as G_Noise
GNoise_model = G_Noise()
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
GNoise_model.set_model(F_model, **args)
GNoise_model.fit(X_train, r_obs)Train with different settings
from codeE.methods import ModelInf_BP_G as G_Noise
GNoise_model = G_Noise(init_Z='softmv', n_init_Z=3, init_conf="model")
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
GNoise_model.set_model(F_model, **args)
GNoise_model.fit(X_train, r_obs, runs=10)Get the base model to predict the ground truth on some data
learned_fx = GNoise_model.get_basemodel()
learned_fx.predict(new_X)| Function | Description |
|---|---|
| get_basemodel | Returns the predictive model of the ground truth |
| get_confusionM | Returns the modeled weights of confusion matrices |
| set_model | Set the predictive model (base model) class |
| set_crowdL_model | Set the auxiliar crowd model to learning from crowds |
| init_model | Initialization of the model weights |
| train | Perform the learning based on backpropagation algorithm |
| multiples_run | Performs multiple runs of the learning and save the best weights |
| fit | same as multiples_run |
| get_global_confusionM | Returns the estimation of global confusion matrices |
| get_predictions_global | Returns the probability estimation of labels over crowdsourcing scenario |
get_basemodel()Returns
get_confusionM()Returns
- beta: array-like of shape (n_classes, n_classes)
The modeled weights of confusion matricesin the auxiliar neural network
, not bounded as probabilities, i.e. does not sum one over the observed labels j.
set_model(model, optimizer="adam", batch_size=32)Set the predictive model (base model) over the ground truth and define how to optimize it on the backpropagation.
Parameters
- model: function or class of keras model
Predictive model based on Keras. - optimizer: string, {'sgd','rmsprop','adam','adadelta','adagrad'}, default='adam'
String name of optimizer used on the back-propagation SGD, based on https://keras.io/api/optimizers/ - batch_size: int, default=32
Number of samples per gradient update, based on https://keras.io/api/models/model_training_apis/
set_crowdL_model(set_w = False, weights=0)Set the auxiliar crowd model to learning from crowds , based on neural networks.
Parameters
- set_w: boolean, default=False
If a weight matrix for initialization of the confusion weights is going to be seted. - weights: array-like of shape (n_classes, n_classes), default=False
The confusion matrix weights used as initialization values on the auxiliar crowd model. Only used if set_w=True.
init_model(X, r_ann, method="")Initialization of the neural network weights.
Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of the data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - method: string, {'softmv','hardmv',''}, default=''
The method used to initialize the ground truth probabilities:. Both posibilities are based on LabelAgg class. The empty string will use the method seted on init.
train(X_train, r_ann, max_iter=50,tolerance=1e-2)Perform the learning of the neural network weights based on backpropagation algorithm.
Parameters
- X_train: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=1e-2
The maximum relative difference on the loss between the iterations of learning.
multiples_run(Runs, X, y_ann, max_iter=50, tolerance=1e-2)Performs multiple runs of the neural network learning and save the best weights
Parameters
- Runs: int
The number of times the EM will be run to obtain different results. - X: array-like of shape (n_samples, ...)
Input patterns of the training data. - r_ann: array-like of shape (n_samples, n_classes)
Annotations of the data, should be on the global representation. - max_iter: int, default=50
The maximum number of iterations to iterate between E and M. - tolerance: float, default=1e-2
The maximum relative difference on the loss between the iterations of learning.
Returns
- found_loss: list of length=Runs
A list with the history of loss for each iteration in the different runs - best_run: int
The index of the best run between all executed, a number between 0 and Runs-1.
fit(X,Y, runs = 1, max_iter=50, tolerance=1e-2)same operation than multiples_run function.
get_global_confusionM(norm="softmax")Parameters
- norm: string, {'softmax', ''}, default=''
The normalize method used to obtain the global confusion matrices estimation. Empty string does not use a normalization step, 'softmax' use the softmax function.
Returns
get_predictions(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns
- prob_Z_hat: array-like of shape (n_samples, n_classes)
The probability predictions of the ground truth over some set
get_predictions_global(X)Parameters
- X: array-like of shape (n_samples, ...)
Input patterns of some data.
Returns