PyLearningCrowds/docs/notation.md at master · fmenat/PyLearningCrowds

Problem Notation

Supervised Scenario

Consider an input pattern $x \in \mathbb{X}$ observed with probability distribution $p(x)$ and a ground-truth label $z \in \mathbb{Z}$ observed with conditional probability distribution $p(z|x)$ .
Given a finite sample $S=\{\left(x_{i},z_{i}\right)\}_{i=1}^N$ , where $\left(x_{i},z_{i}\right) \sim p(x,z)=p(z|x)p(x) \, \ \forall i \in [N]$ .
Objective: estimate a predictive model $f(x)$ that maps $x \rightarrow z$ or learn statistics of $p(z|x)$ , where $f_k(x) = p(z=k|x)$ .

Crowdsourcing scenario

Same objective that supervised scenario, but the ground-truth labels $z_{i}$ corresponding to the input patterns $x_{i}$ are not directly observed.
Consider labels $y \in \mathbb{Z}$ that do not follow the ground-truth distribution $p(z|x)$ . Instead, they are generated from an unknown process $p(y^{(\ell)}|x,z)$ that represents the $\ell$ annotator ability to detect the ground truth.

Individual

Consider multiple noise labels $\mathcal{L}_i = \{y_i^{(1)},\ldots, y_i^{(T_i)}\}$ given by $T_i$ annotators.
These annotations come from a subset $\mathcal{A}_i$ of the set of all the annotators $\mathcal{A}$ participating in the labelling process. ( $T = |\mathcal{A}|$ )
The annotator identity could be define as a input variable: $a_{i}^{(\ell)} \in \mathcal{A}$ , with $\mathcal{A} = \{ 1, \ldots, T\}$
- Then $p(y^{(\ell)}|x,z)=p(y|x,z, a=\ell)$
Given a sample $\{(x_i, \mathcal{L}_i )\}_{i=1}^N$ or $\{(x_i, (\mathcal{L}_i, \mathcal{A}_i) )\}_{i=1}^N$

Global

Consider that we do not known or do not care which annotators provided the labels: we know $|\mathcal{A}_i|$ but not $\mathcal{A}_i$
Consider the number of times that all the annotators gives each possible labels: $r_{ij} \in \{0,1,\ldots,T_i\}$
Given a sample $\{ (x_i,r_i) \}_{i=1}^N$ .

Focus

In this implementation, we study the pattern recognition case, that is, we let $\mathbb{Z}$ be a small set of K categories or classes $\{c_1,c_2,\ldots,c_K\}$ .

One also can define two scenarios based on the annotation density and assumptions:

Dense:
- All the annotators labels each data: $\mathcal{A}_i = \mathcal{A}$
- The implementation is simpler since fixed size matrices are assumed.
Sparse:
- The number of labels collected by data point and annotator varies: $|\mathcal{A}_i| \neq |\mathcal{A}_j| < |\mathcal{A}| = T$
- An appropiate implementation lead to computational efficiency.

Confusion Matrices

Individual confusion matrix (for an annotator t):

$\beta_{k,j}^{(t)} = p(y=j | z=k, a=t)$

Global confusion matrix (for all the annotations):

$\beta_{k,j} = p(y=j | z=k)$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem Notation

Supervised Scenario

Crowdsourcing scenario

Individual

Global

Focus

Confusion Matrices

FilesExpand file tree

notation.md

Latest commit

History

notation.md

File metadata and controls

Problem Notation

Supervised Scenario

Crowdsourcing scenario

Individual

Global

Focus

Confusion Matrices