This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring Demonstration Ensembling for In-context Learning

Muhammad Khalifa , Lajanugen Logeswaran, Moontae Lee†‡,
Honglak Lee∗†, Lu Wang
University of Michigan, LG AI Research, University of Illinois at Chicago
   Correspondence to khalifam@umich.edu
Abstract

In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are irrelevant to the test example. Second, due to the input length limit of some transformer models, it might be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j (Wang & Komatsuzaki, 2021) and experiment on 12 language tasks. Our experiments show weighted max ensembling to outperform vanilla concatenation by as large as 2.4 average points.111Code available at https://github.com/mukhal/icl-ensembling.

1 Introduction

Large-scale language model (LM) pre-training on large corpora is currently dominating the NLP scene. One impressive aspect of such large language models (LLMs) is their capability to do in-context learning (Brown et al., 2020) by conditioning the model on a few examples (i.e., demonstrations) of the desired task and then asking the LM to predict the label for a given input.

The standard approach for feeding in-context demonstrations (demos, for short) to the LM is by concatenating the task examples (Brown et al., 2020; Min et al., 2022c; Lu et al., 2022). While simple, concatenation suffers from a few drawbacks. First, it provides no control over each demo’s contribution to the model’s output, which is left to the attention weights to decide. Second, the concatenation of demos can easily use up the context window of transformer-based models, especially when we have access to many demonstrations or when dealing with lengthy inputs. Lastly, it has been shown that LLMs are sensitive to the ordering of the demonstrations Zhao et al. (2021); Lu et al. (2022), and a long chain of concatenated demos can indeed exacerbate this problem.

In this work, we explore an alternative to the concatenation approach, which is to prompt the model with demonstrations in an ensembling approach. In particular, we partition the examples into non-empty subsets or buckets and then combine the predictions obtained from each bucket to obtain the final prediction. We investigate three different ensembling methods to combine the predictions from different buckets including a clustering-based approach to partition the examples. Experiments on 7 different language tasks show that ensembling can outperform the standard concatenation approach.

2 Related Work

This work is related to work that aims to improve few-shot learning with LLMs (Min et al., 2022b; Rubin et al., 2022; Lu et al., 2022). For instance, Perez et al. (2021) try to find optimal prompts using techniques such as cross-validation and minimum description length. Min et al. (2022a) applied demonstration ensembling for text classification in a limited setting. This paper, on the other hand, explores the more generalized ensembling setting with different bucket sizes and different types of tasks. Wang et al. (2022) explore rationale-augmented ensembles, where different in-context demonstrations are augmented with LM-generated rationales. Different from our work, the ensembling is done over the rationales in the examples, while we ensemble the examples themselves. Qin & Eisner (2021) trained mixtures of soft prompts for knowledge extraction from language models. Our prior work (Khalifa et al., 2022) also explored demonstration ensembling to in-context learn to rerank document paths for multi-hop question answering.

3 Demonstration Ensembling

Refer to caption
Figure 1: In-context learning with six demonstrations. Left: The standard concat-based approach for feeding the examples (Brown et al., 2020). Right: Ensembling with three buckets of size two each. For a given label yy, the probability p~i(y|x)\tilde{p}_{i}(y|x) is computed using the ii-th bucket. All probabilities are ensembled to give the final probability p(y|x)p(y|x).

We assume a list of nn demonstrations 𝒟=(x1,y1),..,(xn,yn)\mathcal{D}=\langle(x_{1},y_{1}),..,(x_{n},y_{n})\rangle, where xix_{i} and yiy_{i} are the demonstration input and ground-truth output or label, respectively. We now formalize our approach for demonstration ensembling.

3.1 Bucket Allocation

DENSE allocates the nn demos in 𝒟\mathcal{D} to bb non-empty buckets {0,1,,b1}\{\mathcal{B}_{0},\mathcal{B}_{1},...,\mathcal{B}_{b-1}\}. More precisely, if each bucket has γ\gamma demos, then i\mathcal{B}_{i} is assigned the demos 𝒟γi:γ(i+1)1\mathcal{D}_{\gamma i:\gamma(i+1)-1}. We predict a set of probabilities of a label yy by separately conditioning the LM on the different buckets along with the test input xx. Formally, for bucket i\mathcal{B}_{i}, we predict p~i(y|x)\tilde{p}_{i}(y|x) as:

p~i(y|x)=PLM(y|i,x)\begin{split}\tilde{p}_{i}(y|x)&=P_{LM}(y|\mathcal{B}_{i},x)\\ \end{split}

The aggregate probability of the label yy is proportional to the output of an ensembling operator Φ\Phi that combines different bucket probabilities:

P(y|x)Φ(y|p~0(y|x),,p~B1(y|x),x)P(y|x)\propto\Phi(y|\tilde{p}_{0}(y|x),\ldots,\tilde{p}_{B-1}(y|x),x) (1)

Where Φ\Phi is a function that takes in the probabilities p~0(y|x),,p~B1(y|x)\tilde{p}_{0}(y|x),\ldots,\tilde{p}_{B-1}(y|x) and the test example xx, and computes a (possibly unnormalized) probability of the output label yy. For brevity, we will just use Φ(y|x)\Phi(y|x) from now on.

3.2 Ensembling Method

We assume each bucket i\mathcal{B}_{i} has a normalized importance weight wiw_{i} assigned to it where i=0bwi=1\sum_{i=0}^{b}w_{i}=1. One form of Φ\Phi is the product operator in which P(y|x)P(y|x) corresponds to a product-of-experts (Hinton, 2002):

ΦPoE(y|x)=i=0bp~i(y|x)wi\Phi^{\operatorname{PoE}}(y|x)=\prod_{i=0}^{b}\tilde{p}_{i}(y|x)^{w_{i}} (2)

In addition, we can explore a mixture-of-experts formulation:

ΦMoE(y|x)=i=0bwip~i(y|x)\Phi^{\operatorname{MoE}}(y|x)=\sum_{i=0}^{b}w_{i}\tilde{p}_{i}(y|x) (3)

We also explore max ensembling, which uses the most confident prediction probability across different buckets:

Φmax(y|x)=maxjwjp~j(y|x)\Phi^{\operatorname{max}}(y|x)=\max_{j}{w_{j}\tilde{p}_{j}(y|x)} (4)

3.3 Bucket Weighting

Inspired by recent work (Gao et al., 2021; Liu et al., 2022) that has shown that demonstrations that are more similar to the input perform better than distant ones, we weigh each bucket using the average of the similarity of its examples with the input xx:

wi=1|i|(xj,yj)icos(xje,xe)w_{i}=\frac{1}{|\mathcal{B}_{i}|}\sum_{(x_{j},y_{j})\in\mathcal{B}_{i}}\text{cos}(x_{j}^{e},x^{e}) (5)

where cos is the cosine similarity and xex^{e} is the embedding of the xx.

3.4 Clustering Demonstrations

While the bucket construction approach explained in Section 3.1 constructs buckets arbitrarily based on the order of the demos in 𝒟\mathcal{D}, one heuristic is to use similarity information between demos to construct buckets. We experiment with k-means clustering (Hartigan & Wong, 1979) to construct buckets. More precisely, we apply k-means over vector representations of the demonstrations to obtain bb clusters and then use each cluster as a bucket.222Note that in this case, not all buckets will have the same number of demos. Each bucket can operate as a semantically coherent expert. We refer to this approach as similar-together bucket allocation.

Refer to caption
Figure 2: 6-shot (Left) and 10-shot (Right) performance of different ensembling methods and concatenation. Metrics are averaged over three seeds of demos, 12 datasets, and different numbers of buckets. (Un)weighted indicates whether we use similarity with the input examples to weigh the contribution of each ensemble. Similar-together bins and Diverse buckets are achieved through k-means clustering as explained in  3.4

As opposed to maximizing the similarity between the demos within a given bucket, instead, we can maximize dissimilarity to achieve diverse buckets. To do that, we use K-means to cluster demos into n/b\lfloor n/b\rfloor clusters, each with bb demos.333We assume nn is always divisible by bb for simplicity. Then, we construct bb buckets by picking a unique demo from each cluster.444Here, we use a constrained version of k-means (Bradley et al., 2000) to make sure we get exactly bb demos in each k-means cluster. Having diverse buckets might result in a prediction that is less biased towards a certain category of demonstrations. We refer to this approach as diverse bucket allocation.

Besides yielding better bucket allocation, clustering makes bucket assignment less sensitive with respect to the demonstration order in 𝒟\mathcal{D}. As a result, it can greatly reduce the sensitivity to the order of the demos studied in previous work (Zhao et al., 2021; Lu et al., 2022).

4 Experiments and Results

4.1 Experimental Setup

Data.

We experiment with 12 datasets in total. Details on the datasets, metrics used, and the number of evaluation examples can be found in Appendix A.

Baselines.

We compare ensembling to two concatenation-based settings. Concat is the standard approach for ICL, which concatenates the demonstrations in an arbitrary order, and Concat-sort which sorts the demos based on similarity with the input (i.e., more similar demos come later). Concat-sort uses similarity with the input, allowing for a fair comparison with the weighted ensembling methods.

Model.

For all the experiments, we use GPT-j (6B) (Wang & Komatsuzaki, 2021). To compute embeddings of examples for similarity calculations, we use a fine-tuned 6-layer MiniLM (Wang et al., 2020).555https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. Our experimental setup is detailed in Appendix B.

Number of demonstrations and bucket count.

We experiment with number of examples n=6,10n=6,10. For n=6n=6, we use ensembling with bucket counts b=2,3,6b=2,3,6, and for n=10n=10, we set b=2,5,10b=2,5,10. We note that the concat method in Brown et al. (2020) is a special case of ensembling with b=1b=1.

Refer to caption
Figure 3: 6-shot (Left) and 10-shot (Right) performance with different bucket count bb. We show performance with weighted MoE, PoE, and Max ensembling. Performance is averaged over 7 tasks and over 3 different seeds of demos.

4.2 Comparing Ensembling Methods

Figure 2 compares the performance of the concat approach against various ensembling methods and in the 6- and 10-shot cases. We observe that unweighted (i.e. wi=1bw_{i}=\frac{1}{b} for all ii), PoE, MoE, and max ensembling outperform the concat baseline. Precisely, max ensembling outperforms concat by 0.8 average points in the 6-shot setting while MoE outperforms it by 0.6 in the 10-shot setting. However, unweighted ensembling underperforms the concat-sort baseline.

Furthermore, weighing the buckets based on the similarity with inputs boosts the ensembling performance in all cases. In the 6-shot case, weighted max ensembling outperforms concat and concat-sort by 2.4 and 1.2, respectively. In the 10-shot setting, weighted max ensembling outperforms concat and concat-sort by 1.9 and 0.75 respectively.

Lastly, we study the effect of bucket allocation based on clustering the demonstrations, where similar-together clustering gives a consistent boost to weighted ensembling methods in the 6-shot case. However, we do not have conclusive evidence as to which clustering strategy is best. We find the performance of the clustering strategy to vary depending on the task and the total number of demonstrations. For instance, diverse always outperforms similar-together allocation in the 10-shot case, which is not the case for the 6-shot setting. This is likely because having more demos allows for more diverse buckets. We leave it to future work to explore different methods of bucket allocation. Figure 4 in Appendix C shows per-task improvement obtained by ensembling.

4.3 Buckets Count

Here we study what role the bucket count bb plays in the performance of ensembling. Figure 3 shows the effect of changing the bucket count on the performance. Using a small bb seems to perform worse across the board. Interestingly, the performance improves as bb increases for all ensembling methods in the 6-shot setting.

5 Conclusion

In this work, we explore an alternative to the popular in-context learning paradigm where examples are concatenated and provided to a language model. We show through experiments on 12 language tasks that ensembling, where examples are partitioned into buckets and a final prediction is made by combining predictions from each bucket, yields better performance over concatenation. In particular, we find that max ensembling performs best compared to product-of-experts and mixture-of-experts. In addition, we analyze the effect of varying different aspects of ensembling such as the number of buckets and bucket construction strategies.

References

  • Barbieri et al. (2020) Francesco Barbieri, José Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 1644–1650. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.148. URL https://doi.org/10.18653/v1/2020.findings-emnlp.148.
  • Bradley et al. (2000) Paul S Bradley, Kristin P Bennett, and Ayhan Demiriz. Constrained k-means clustering. Microsoft Research, Redmond, 20(0):0, 2000.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. (2019) Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fernandez, and Doug Downey. CODAH: An adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pp.  63–69, Minneapolis, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2008. URL https://aclanthology.org/W19-2008.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • De Gibert et al. (2018) Ona De Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. Hate speech dataset from a white supremacy forum. arXiv preprint arXiv:1809.04444, 2018.
  • Diggelmann et al. (2020) Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614, 2020.
  • Dolan & Brockett (2005) Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
  • Hartigan & Wong (1979) John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979.
  • Hinton (2002) Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  • Khalifa et al. (2022) Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. Lepus: Prompt-based unsupervised multi-hop reranking for open-domain qa. arXiv preprint arXiv:2205.12650, 2022.
  • Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  • Liu et al. (2022) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulic (eds.), Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022, pp.  100–114. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.deelio-1.10. URL https://doi.org/10.18653/v1/2022.deelio-1.10.
  • Lu et al. (2022) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  8086–8098. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.acl-long.556.
  • Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. A sick cure for the evaluation of compositional distributional semantic models. In Lrec, pp.  216–223. Reykjavik, 2014.
  • McCreery et al. (2020) Clara H McCreery, Namit Katariya, Anitha Kannan, Manish Chablani, and Xavier Amatriain. Effective transfer learning for identifying similar questions: matching user questions to covid-19 faqs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  3458–3465, 2020.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2381–2391. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1260. URL https://doi.org/10.18653/v1/d18-1260.
  • Min et al. (2022a) Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  5316–5330. Association for Computational Linguistics, 2022a. URL https://aclanthology.org/2022.acl-long.365.
  • Min et al. (2022b) Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  2791–2809. Association for Computational Linguistics, 2022b. doi: 10.18653/v1/2022.naacl-main.201. URL https://doi.org/10.18653/v1/2022.naacl-main.201.
  • Min et al. (2022c) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022c.
  • Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070, 2021.
  • Qin & Eisner (2021) Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  5203–5212. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.410. URL https://doi.org/10.18653/v1/2021.naacl-main.410.
  • Rubin et al. (2022) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  2655–2671. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.191. URL https://doi.org/10.18653/v1/2022.naacl-main.191.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  • Tafjord et al. (2019) Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yih, and Ashish Sabharwal. Quarel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  7063–7071, 2019.
  • Wang & Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021.
  • Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022.
  • Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  12697–12706. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhao21c.html.

Appendix A Datasets

Dataset Task Metric # Eval
Glue-SST2 (Socher et al., 2013) sentiment analysis macro F1 872
Medical questions pairs (McCreery et al., 2020) paraphrase detection macro F1 610
Glue-MRPC Levesque et al. (2012) paraphrase detection macro F1 408
Climate Fever (Diggelmann et al., 2020) fact verification macro F1 307
SICK (Marelli et al., 2014) NLI macro F1 495
Glue-WNLI Dolan & Brockett (2005) NLI macro F1 71
Hate speech18 De Gibert et al. (2018) hate speech detection macro F1 2141
TweetEval-stance (feminism) (Barbieri et al., 2020) stance detection macro F1 67
OpenbookQA Mihaylov et al. (2018) question answering accuracy 500
ARC Clark et al. (2018) question answering accuracy 299
QUAREL Tafjord et al. (2019) question answering accuracy 278
CODAH Chen et al. (2019) sentence completion accuracy 556
Table 1: Datasets, tasks, metrics, and the number of evaluation examples for each dataset.

Appendix B Experimental Setup

We run few-shot inference using fp16 half-precision. All experiments are run on a workstation with 4 Nvidia A100 GPUs with a batch size of 16. We use the GPT-j checkpoint provided by Huggingface.666https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16 For clustering, we use the K-means implementation provided by sklearn.777https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html For constrained K-means, we use this implementation.888https://github.com/joshlk/k-means-constrained

Appendix C Detailed Results

Figure 4 shows relative improvement obtained by different weighted ensembling approaches over the concatenation approach.

Refer to caption
Figure 4: Relative performance improvement resulting from different ensembling methods shown per task. The improvement is aggregated over a different number of examples 6,106,10, different numbers of buckets, and different seeds.