This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Weak Supervision with Incremental Source Accuracy Estimation

Richard Correro
Abstract

Motivated by the desire to generate labels for real-time data we develop a method to estimate the dependency structure and accuracy of weak supervision sources incrementally. Our method first estimates the dependency structure associated with the supervision sources and then uses this to iteratively update the estimated source accuracies as new data is received. Using both off-the-shelf classification models trained using publicly-available datasets and heuristic functions as supervision sources we show that our method generates probabilistic labels with an accuracy matching that of existing off-line methods.

Index Terms:
Weak Supervision, Transfer Learning, On-line Algorithms.

I Introduction

Weak supervision approaches obtain labels for unlabeled training data using noiser or higher level sources than traditional supervision [1]. These sources may be heuristic functions, off-the-shelf models, knowledge-base-lookups, etc. [2]. By combining multiple supervision sources and modeling their dependency structure we may infer the true labels based on the outputs of the supervision sources.

Problem Setup

In the weak supervision setting we have access to a dataset X={x1,,xn}X=\{x_{1},\dots,x_{n}\} associated with unobserved labels Y={y1,,yn},yi{1,,k}Y=\{y_{1},\dots,y_{n}\},\ \ y_{i}\in\{1,\dots,k\} and a set of weak supervision sources pi(y|x),i=1,,mp_{i}(y|x),i=1,\dots,m.

We denote the outputs of the supervision sources by λ1,,λm\lambda_{1},\dots,\lambda_{m} and let λ𝐣=[λ1λ2λm]T\mathbf{\lambda_{j}}=[\lambda_{1}\ \lambda_{2}\ \dots\ \lambda_{m}]^{T} denote the vector of labels associated with example xjx_{j}. The objective is to learn the joint density

f(y,λ)f(y,\mathbf{\lambda})

over the sources and the latent label. Using this we may estimate the conditional density

fYΛ(y|λ)=fY,Λ(y,λ)fΛ(λ),fΛ(λ)>0.\displaystyle f_{Y\mid\Lambda}(y|\mathbf{\lambda})=\frac{f_{Y,\Lambda}(y,\mathbf{\lambda})}{f_{\Lambda}(\lambda)},\quad f_{\Lambda}(\lambda)>0. (1)

These sources may take many forms but we restrict ourselves to the case in which λi{0,,k}\lambda_{i}\in\{0,\dots,k\} and thus the label functions generate labels belonging to the same domain as YY. Here λi=0\lambda_{i}=0 indicates the ithi^{th} source has not generated a label for this example. Such supervision sources may include heuristics such as knowledge base lookups, or pre-trained models.

II Related Work

Varma et. al. [3] and Ratner, et. al. [4] model the joint distribution of λ1,,λm,Y\lambda_{1},\dots,\lambda_{m},Y in the classification setting as a Markov Random Field

fG(λ1,,λm,y)=1Zexp(λiVθiλi+(λi,λj)Eθi,jλiλj+θYy+λiVθY,yyλi)f_{G}(\lambda_{1},\dots,\lambda_{m},y)=\frac{1}{Z}\exp\left(\sum_{\lambda_{i}\in V}\theta_{i}\lambda_{i}+\sum_{(\lambda_{i},\lambda_{j})\in E}\theta_{i,j}\lambda_{i}\lambda_{j}+\theta_{Y}y+\sum_{\lambda_{i}\in V}\theta_{Y,y}y\lambda_{i}\right)

associated with graph G=(V,E)G=(V,E) where θi,j 1i,jm+1\theta_{i,j}\ 1\leq i,j\leq m+1 denote the canonical parameters associated with the supervision sources and YY, and ZZ is a partition function [here V={λ1,,λm}{Y}V=\{\lambda_{1},\dots,\lambda_{m}\}\cup\{Y\}]. If λi\lambda_{i} is not independent of λj\lambda_{j} conditional on YY and all sources λk,k{1,,m}{i,j}\lambda_{k},\ k\in\{1,\dots,m\}\setminus\{i,j\}, then (λi,λj)(\lambda_{i},\lambda_{j}) is an edge in EE.

Let Σ\Sigma denote the covariance matrix of the supervision sources and YY. To learn GG from the labels

O={λi:λi=[λ1,,λm]T;i=1,,n}O=\{\lambda_{i}:\lambda_{i}=[\lambda_{1},\dots,\lambda_{m}]^{T};i=1,\dots,n\}

and without the ground truth labels, Varma et. al. assume that GG is sparse and therefore that the inverse covariance matrix Σ1\Sigma^{-1} associated with λ1,,λm,Y\lambda_{1},\dots,\lambda_{m},Y is graph-structured. Since YY is a latent variable the full covariance matrix Σ\Sigma is unobserved. We may write the covariance matrix in block-matrix form as follows:

Cov[OS]:=Σ=[ΣOΣOSΣOSTΣS]Cov[O\cup S]:=\Sigma=\begin{bmatrix}\Sigma_{O}&\Sigma_{OS}\\ \Sigma_{OS}^{T}&\Sigma_{S}\end{bmatrix}

Inverting Σ\Sigma, we write

Σ1=[KOKOSKOSTKS]\Sigma^{-1}=\begin{bmatrix}K_{O}&K_{OS}\\ K_{OS}^{T}&K_{S}\end{bmatrix}

ΣO\Sigma_{O} may be estimated empirically:

Σ^O=ΛΛTnννT\hat{\Sigma}_{O}=\frac{\Lambda\Lambda^{T}}{n}-\nu\nu^{T}

where Λ=[λ1λ2,,λn]\Lambda=[\mathbf{\lambda}_{1}\mathbf{\lambda}_{2},\dots,\mathbf{\lambda}_{n}] denotes the m×nm\times n matrix of labels generates by the sources and ν=E^[O]m\nu=\hat{E}[O]\in\mathbb{R}^{m} denotes the observed labeling rates.

Using the block-matrix inversion formula, Varma et. al. show that

KO=ΣO1+cΣO1ΣOSΣOSTΣO1K_{O}=\Sigma_{O}^{-1}+c\Sigma_{O}^{-1}\Sigma_{OS}\Sigma_{OS}^{T}\Sigma_{O}^{-1}

where c=(ΣSΣOSTΣO1ΣOS)1+c=(\Sigma_{S}-\Sigma_{OS}^{T}\Sigma_{O}^{-1}\Sigma_{OS})^{-1}\in\mathbb{R}^{+}. Letting z=cΣO1ΣOSz=\sqrt{c}\Sigma_{O}^{-1}\Sigma_{OS}, they write

ΣO1=KOzzT\Sigma_{O}^{-1}=K_{O}-zz^{T}

where KOK_{O} is sparse and zzTzz^{T} is low-rank positive semi definite. Because ΣO1\Sigma_{O}^{-1} is the sum of a sparse matrix and a low-rank matrix we may use Robust Principal Components Analysis [5] to solve the following:

(S^,L^)=argmin(S,L)L+γS1\displaystyle(\hat{S},\hat{L})=\text{argmin}_{(S,L)}||L||_{*}+\gamma||S||_{1}
s.t.SL=Σ^O1\displaystyle s.t.\quad\quad S-L=\hat{\Sigma}^{-1}_{O}

Varma et. al. then show that we may learn the structure of GG from KOK_{O} and we may learn the accuracies of the sources from zz using the following algorithm:

Result: G^=(V,E^),L^\hat{G}=(V,\hat{E}),\ \hat{L}
𝐈𝐧𝐩𝐮𝐭𝐬:\mathbf{Inputs:} Estimate of covariance matrix Σ^O\hat{\Sigma}_{O}, parameter γ\gamma, threshold TT
𝐒𝐨𝐥𝐯𝐞:(S^,L^)=argmin(S,L)||L||+γ||S||1\mathbf{Solve:}\quad(\hat{S},\hat{L})=\text{argmin}_{(S,L)}||L||_{*}+\gamma||S||_{1}
s.t. SL=Σ^O1S-L=\hat{\Sigma}^{-1}_{O}
E^{(i,j):i<j,S^i,j>T}\hat{E}\xleftarrow{}\{(i,j):i<j,\hat{S}_{i,j}>T\}
Algorithm 1 Weak Supervision Structure Learning and Source Estimation Using Robust PCA (From [3])

Note that L^=zzT\hat{L}=zz^{T}.

Ratner, et. al. [4] show that we may estimate the source accuracies μ^\hat{\mu} from zz and they propose a simpler algorithm for estimating zz if the graph structure is already known: If EE is already known we may construct a dependency mask Ω={(i,j):(λi,λj)E}\Omega=\{(i,j):(\lambda_{i},\lambda_{j})\not\in E\}. They use this in the following algorithm:

Result: μ^\hat{\mu}
𝐈𝐧𝐩𝐮𝐭𝐬:\mathbf{Inputs:} Observed labeling rates 𝔼^[O]\hat{\mathbb{E}}[O] and covariance Σ^O\hat{\Sigma}_{O}; class balance 𝔼^[Y]\hat{\mathbb{E}}[Y] and variance Σ^S\hat{\Sigma}_{S}; dependency mask Ω\Omega
z^argminZΣ^O1+zzTΩ\hat{z}\xleftarrow{}\text{argmin}_{Z}||\hat{\Sigma}_{O}^{-1}+zz^{T}||_{\Omega}
c^ΣS1(1+z^TΣ^Oz^)\hat{c}\xleftarrow{}\Sigma_{S}^{-1}(1+\hat{z}^{T}\hat{\Sigma}_{O}\hat{z})
Σ^OSΣ^Oz^/c^\hat{\Sigma}_{OS}\xleftarrow{}\hat{\Sigma}_{O}\hat{z}/\sqrt{\hat{c}}
μ^Σ^OS+𝔼^[Y]𝔼^[O]\hat{\mu}\xleftarrow{}\hat{\Sigma}_{OS}+\hat{\mathbb{E}}[Y]\hat{\mathbb{E}}[O]
Algorithm 2 Source Estimation for Weak Supervision (From [4])

Snorkel, an open-source Python package, provides an implementation of algorithm 2 [6].

III Motivating Our Approach

Although the algorithm proposed by Varma et. al. may be used determine the source dependency structure and source accuracy, it requires a robust principal components decomposition of the matrix Σ^O\hat{\Sigma}_{O} which is equivalent to a convex Principal Components Pursuit (PCP) problem [5]. Using the current state-of-the-art solvers such problems have time complexity O(ϵ2)O(\epsilon^{-2}) where ϵ\epsilon denotes the solver convergence tolerance [5]. For reasonable choices of ϵ\epsilon this may be a very expensive calculation.

In the single-task classification setting, algorithm 2 may be solved by least-squares and is therefore much less expensive to compute than algorithm 1. Both algorithms, however, require the observed labeling rates and covariance estimates of the supervision sources over the entire dataset and therefore cannot be used in an on-line setting.

We therefore develop an on-line approach which estimates the structure of GG using algorithm 1 on an initial ”minibatch” of unlabeled examples and then iteratively updates the source accuracy estimate μ^\hat{\mu} using using a modified implementation of algorithm 2.

IV Methods

Given an initial batch b1b_{1} of unlabeled examples Xb1={x1,,xk}X_{b_{1}}=\{x_{1},\dots,x_{k}\} we estimate GG by first soliciting labels λ1,,λk\mathbf{\lambda}_{1},\dots,\mathbf{\lambda}_{k} for Xb1X_{b_{1}} from the sources. We then calculate estimated labeling rates E^[O]\hat{E}[O] and covariances Σ^Ob1\hat{\Sigma}_{Ob_{1}} which we then input to algorithm 1, yielding G^=(V,E^)\hat{G}=(V,\hat{E}) and L^\hat{L}. From E^\hat{E} we create the dependency mask Ω^={(i,j):(λ1,λj)E^}\hat{\Omega}=\{(i,j):(\lambda_{1},\lambda_{j})\not\in\hat{E}\} which we will use with future data batches. Using the fact that L^=zzT\hat{L}=zz^{T} we recover z^\hat{z} by first calculating

|z^|=diag(L^)|\hat{z}|=\sqrt{diag(\hat{L})}

We then break the symmetry using the method in [4]. Note that if a source λi\lambda_{i} is conditionally independent of the others then the sign of ziz_{i} determines the sign of all other elements of zz.

Using z^,E^[O],Σ^Ob1\hat{z},\ \hat{E}[O],\ \hat{\Sigma}_{Ob_{1}}, class balance prior E^[Y]\hat{E}[Y] and class variance prior Σ^S\hat{\Sigma}_{S} we calculate μ^\hat{\mu}, an estimate of the source accuracies [if we have no prior beliefs about the class distribution then we simply substitute uninformative priors for E^[O]\hat{E}[O] and Σ^Ob1\hat{\Sigma}_{Ob_{1}}].

For each following batch bpb_{p} of unlabeled examples XbpX_{b_{p}} we estimate ΣObp\Sigma_{Ob_{p}} and E[O]bpE[O]_{b_{p}}. Using these along with E^[O]\hat{E}[O] and Σ^Ob1\hat{\Sigma}_{Ob_{1}} we calculate μ^bp\hat{\mu}_{b_{p}}, an estimate of the source accuracies over the batch. We then update μ^\hat{\mu} using the following update rule:

μ^:=(1α)μ^+αμbp\hat{\mu}:=(1-\alpha)\hat{\mu}+\alpha\mu_{b_{p}}

where α[0,1]\alpha\in[0,1] denotes the mixing parameter. Our method thus models the source accuracies using an exponentially-weighted moving average of the estimated per-batch source accuracies.

Using the estimated source accuracies and dependency structure we may estimate p(y,λ)p(y,\mathbf{\lambda}) which we may then use to estimate p(y|λ)p(y|\mathbf{\lambda}) by (1).

Result: μ^\hat{\mu}
𝐈𝐧𝐩𝐮𝐭𝐬:\mathbf{Inputs:} Observed labeling rates 𝔼^[O]b\hat{\mathbb{E}}[O]_{b} and covariance Σ^Ob\hat{\Sigma}_{Ob}; class balance 𝔼^[Y]\hat{\mathbb{E}}[Y] and variance Σ^S\hat{\Sigma}_{S}
for each batch b do
       if is initial batch then
             Use algorithm 1 to calculate G^\hat{G} and L^\hat{L}
             |z^|diag(L^)|\hat{z}|\xleftarrow{}\sqrt{diag(\hat{L})}
             Determine the sign of the entries of zz using method from [4]
            
      else
             z^argminzΣ^Ob1+zzTΩ\hat{z}\xleftarrow{}\text{argmin}_{z}||\hat{\Sigma}_{Ob}^{-1}+zz^{T}||_{\Omega}
            
       end if
      c^ΣS1(1+z^TΣ^Obz^)\hat{c}\xleftarrow{}\Sigma_{S}^{-1}(1+\hat{z}^{T}\hat{\Sigma}_{Ob}\hat{z})
       Σ^OSΣ^Obz^/c^\hat{\Sigma}_{OS}\xleftarrow{}\hat{\Sigma}_{Ob}\hat{z}/\sqrt{\hat{c}}
       μ^bΣ^OS+𝔼^[Y]𝔼^[O]b\hat{\mu}_{b}\xleftarrow{}\hat{\Sigma}_{OS}+\hat{\mathbb{E}}[Y]\hat{\mathbb{E}}[O]_{b}
       if is initial batch then
             μ^μ^\hat{\mu}\xleftarrow{}\hat{\mu}
            
      else
             μ^(1α)μ^+αμ^b\hat{\mu}\xleftarrow{}(1-\alpha)\hat{\mu}+\alpha\hat{\mu}_{b}
            
       end if
      
end for
Algorithm 3 Incremental Source Accuracy Estimation

V Tests

Supervision Sources

We test our model in an on-line setting using three supervision sources. Two of the sources are off-the-shelf implementations of Naïve Bayes classifiers trained to classify text by sentiment. Each was trained using openly-available datasets. The first model was trained using a subset of the IMDB movie reviews dataset which consists of a corpus of texts labeled by perceived sentiment [either ”positive” or ”negative”]. Because the labels associated with this dataset are binary the classifier generates binary labels.

The second classifier was trained using another openly-available dataset, this one consisting of a corpus of text extracted from tweets associated with air carriers in the United States and labeled according to sentiment. These labels in this dataset belong to three seperate classes [”positive”, ”neutral”, and ”negative”] and therefore the model trained using this dataset classifies examples according to these classes.

The final supervision source is the Textblob Pattern Analyzer. This is a heuristic function which classifies text by polarity and subjectivity using a lookup-table consisting of strings mapped to polarity/subjectivity estimates. To generate discrete labels for an example using this model we threshold the polarity/subjectivity estimates associated with the label as follows:

  • If polarity is greater than 0.33 we generate a positive label

  • If polarity is less than or equal to 0.33 but greater than -0.33 we generate a neutral label

  • If polarity is less than or equal to 0.33 we generate a negative label

Test Data

We test our incremental model using a set of temporally-ordered text data extracted from tweets associated with a 2016 GOP primary debate labeled by sentiment [”positive”, ”neutral”, or ”negative”]. We do so by solicting labels λ1,,λn\mathbf{\lambda}_{1},\dots,\mathbf{\lambda}_{n} associated with the nn examples from the three supervision sources.

Weak Supervision as Transfer Learning

Note that this setting is an example of a transfer learning problem [7]. Specifically, since we are using models pre-trained on datasets similar to the target dataset we may view the Naive Bayes models as transferring knowledge from those two domains [Tweets associated with airlines and movie reviews, respectively] to provide supervision signal in the target domain [7]. The Pattern Analyzer may be viewed through the same lens as it uses domain knowledge gained through input from subject-matter experts.

Test Setup

Because our model is generative we cannot use a standard train-validation-test split of the dataset to determine model performance. Instead, we compare the labels generated by the model with the ground-truth labels over separate folds of the dataset.

Data Folding Procedure

We split the text corpus into five folds. The examples are not shuffled to perserve temporal order within folds. Using these folds we perform 5 separate tests, each using four of the five folds in order. For example, the fifth test uses the fold 5 and folds 1—3, in that order.

Partition Tests

For each set of folds we further partition the data into k=100k=100 batches of size qq which we refer to as ”minibatches” [as they are subsets of the folds]. For each minibatch we solicit labels λ1,,λq,λi𝐑𝟑\mathbf{\lambda}_{1},\dots,\mathbf{\lambda}_{q},\ \mathbf{\lambda}_{i}\in\mathbf{R^{3}} from the two pretrained models and the Pattern Analyzer. Note that both pretrained classifiers first transform the text by tokenizing the strings and then calculating the term-frequency to inverse document frequency (Tf-idf) for each token. We store these labels in an array 𝐋\mathbf{L} for future use. We then calculate E^[O]b\hat{E}[O]_{b} and Σ^Ob\hat{\Sigma}_{Ob} for the minibatch, which we use with algorithm 3 to generate μ^b\hat{\mu}_{b} and the dependency graph G^\hat{G}. Using these we generate labels corresponding to the examples contained within the minibatch.

Using the ground-truth labels associated with the examples contained within the minibatch we calculate the accuracy of our method by comparing the generated labels 𝐲^\mathbf{\hat{y}} with the ground-truth labels 𝐲\mathbf{y}:

accuracy(𝐲,𝐲^)=1qi=0q1𝟏(𝐲^i=𝐲i)\texttt{accuracy}(\mathbf{y},\mathbf{\hat{y}})=\frac{1}{q}\sum_{i=0}^{q-1}\mathbf{1}(\mathbf{\hat{y}}_{i}=\mathbf{y}_{i})

We then average the accuracy scores associated with each minibatch over the number of minibatches used in each test to calculate the average per-test accuracy [calculated using four of the five folds of the overall dataset].

We then compare the average accuracies of the labels produced using our incremental method to the accuracies of the labels produced by an existing off-line source accuracy estimation method based on algorithm 2 [6]. Since this method works in an off-line manner it requires access to the entire set 𝐋\mathbf{L} of labels generated by the supervision sources. Using these this method generates its own set of generated labels 𝐲^baseline\mathbf{\hat{y}}_{baseline} with which we then calculate the baseline accuracy using the accuracy metric above.

Finally, we compare the accuracy of the labels generated by our method with the accuracy of the labels generated by each of the supervision sources.

Comparing Values of α\alpha

We then follow the same procedure as above to generate labels for our method, except this time we use different values of α\alpha.

VI Results

Our tests demonstrate the following:

  1. 1.

    Our model generates labels which are more accurate than those generated by the baseline [when averaged over all 5 tests].

  2. 2.

    Both our method and the baseline generate labels which are more accurate than those generated by each of the supervision sources.

  3. 3.

    Our tests of the accuracy of labels generated by our method using different values of α\alpha yields an optimal values α=0.05\alpha=0.05 and shows convexity over the values tested.

Theses tests show that the average accuracy of the incremental model qualitatively appears to increase as the number of samples seen grows. This result is not surprising as we would expect our source accuracy estimate approaches the true accuracy μ^μ\hat{\mu}\xrightarrow{}{\mu} as the number of examples seen increases. This implies that the incremental approach we propose generates more accurate labels as a function of the number of examples seen, unlike the supervision sources which are pre-trained and therefore do not generate more accurate labels as the number of labeled examples grows.

These tests also suggest that an optimal value for α\alpha for this problem is approximately 0.050.05 which is in the interior of the set of values tested for α\alpha. Since we used 100100 minibatches in each test of the incremental model this implies that choosing an α\alpha which places greater weight on more recent examples yields better performance, although more tests are necessary to make any stronger claims.

Finally, we note that none of the models here tested are in themselves highly-accurate as classification models. This is not unexpected as the supervision sources were intentionally chosen to be ”off-the-shelf” models and no feature engineering was performed on the underlying text data, neither for the datasets used in pre-training the two classifier supervision sources nor for the test set [besides Tf-idf vectorization]. The intention in this test was to compare the relative accuracies of the two generative methods, not to design an accurate discriminative model.

VII Conclusion

We develop an incremental approach for estimating weak supervision source accuracies. We show that our method generates labels for unlabeled data which are more accurate than those generated by pre-existing non-incremental approaches. We frame our specific test case in which we use pre-trained models and heuristic functions as supervision sources as a transfer learning problem and we show that our method generates labels which are more accurate than those generated by the supervision sources themselves.

Refer to caption
Figure 1: Comparison of incremental and non-incremental model accuracy over minibatches.
Refer to caption
Figure 2: Average model accuracy over minibatches.
Refer to caption
Figure 3: Average per-batch accuracies for different values of α\alpha.
TABLE I: Average Accuracy of Incremental Model For Different Alpha Values
Alpha 0.001 0.01 0.025 0.05 0.1 0.25
Accuracy 0.61245 0.61287 0.61832 0.61709 0.61498 0.61473

References

  • [1] Alexander Ratner. Stephen Bach. Paroma Varma. Chris Ré. (2017) ”Weak Supervision: The New Programming Paradigm for Machine Learning”. Snorkel Blog.
  • [2] Mayee Chen. Frederic Sala. Chris Ré. ”Lecture Notes on Weak Supervision”. CS 229 Lecture Notes. Stanford University.
  • [3] Paroma Varma. Frederic Sala. Ann He. Alexander Ratner. Christopher Ré. (2019) ”Learning Dependency Structures for Weak Supervision Models”. Preprint.
  • [4] Alexander Ratner. Braden Hancock. Jared Dunnmon. Frederic Sala. Shreyash Pandey. Christopher Ré. (2018) ”Training Complex Models with Multi-Task Weak Supervision”. Preprint.
  • [5] Emmanuel J. Candès. Xiaodong Li. Yi Ma. John Wright. ”Robust principal component analysis?” Journal of the ACM. Vol 58. Issue 11.
  • [6] Alexander Ratner. Stephen H. Bach. Henry Ehrenberg. Jason Fries. Sen Wu. Christopher Ré. ”Snorkel: Rapid Training Data Creation with Weak Supervision.” Preprint.
  • [7] Sinno Jialin Pan. Qiang Yang (2009) ”A Survey on Transfer Learning”. IEEE Transactions on Knowledge and Data Engineering. Vol 22. Issue 10.