Active Learning for Direct Preference Optimization
Abstract
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.
1 Introduction
Reinforcement learning from human feedback (RLHF) has been effective in aligning and fine-tuning large language models (LLMs) (Ouyang et al., 2022; Rafailov et al., 2023). The main difference from classic reinforcement learning (RL) (Sutton and Barto, 1998) is that the agent learns from human feedback, which is expressed as preferences for different potential choices (Christiano et al., 2017). The human feedback allows LLMs to be adapted beyond the distribution of data that was used for their pre-training and generate more human-like responses. The feedback can be incorporated by learning a reward model (Ouyang et al., 2022) from preferences over two (Bradley and Terry, 1952) or multiple (Plackett, 1975; Luce, 2005) choices. Proximal policy optimization (PPO) (Schulman et al., 2017) is then used to maximize the expected reward of the LLM policy under the reward model. Learning of reward models can be avoided by directly optimizing the policy with preferential feedback, known as direct preference optimization (DPO) (Rafailov et al., 2023).
Learning of human preferences for LLM optimization has two main components: preference modeling (Rafailov et al., 2023; Ethayarajh et al., 2024) and how the preferences are elicited (Lightman et al., 2024). We focus on the latter and note that this problem is analogous to classic active learning (Bishop, 2006). Prior works formulated this problem as identifying a subset of prompts with candidate responses, either online or offline, where preferential feedback would improve policy learning by RLHF, either through a reward model or DPO. These works differ in how the prompts are selected: Mehta et al. (2023); Ji et al. (2024); Muldrew et al. (2024) choose prompts based on differences of estimated rewards to their responses; Mukherjee et al. (2024); Scheid et al. (2024); Thekumparampil et al. (2024) derive optimal policies for offline exploration using D-optimal designs (Pukelsheim, 2006); and Das et al. (2024); Liu et al. (2024) solve D-optimal designs online using a greedy algorithm. Most works prove that the errors in learned reward models diminish with more feedback. Interestingly, many works propose two kinds of algorithms (Mehta et al., 2023; Das et al., 2024; Ji et al., 2024), which are either analyzable or practical. We present the first analysis of active learning in DPO and our algorithms are practical.
We study active learning in direct preference optimization. At a high level, we collect preferential feedback to improve DPO policies learned from it. We study two settings: online and offline. In the online setting, the input is a dataset of prompts with two candidate responses per prompt. The human feedback is unknown in advance and we elicit it online. This setting is motivated by statistical efficiency; we elicit the most informative feedback within a fixed budget on human labor. In the offline setting, the input is a dataset of prompts with two candidate responses per prompt, and logged preferential feedback for the responses. This setting is motivated by computational efficiency; even if the human feedback is known in advance, we may not have computational resources to learn from all of it. We solve both settings in a unified way. The key idea in our work is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and identify the most informative subset of prompts out of using a D-optimal design (Pukelsheim, 2006). D-optimal designs are a well-established tool in adaptive learning (Lattimore and Szepesvari, 2019) for near-optimal information gathering. Several recent papers applied them to learning reward models in RLHF (Das et al., 2024; Mukherjee et al., 2024; Liu et al., 2024; Scheid et al., 2024).
We make the following contributions:
-
1.
We formalize active learning for DPO as choosing a subset of data points out of such the error in DPO logits, the log odds of preferring one response to the other, is minimized (Section 3).
-
2.
This is the first work that derives a D-optimal design for DPO (Section 4). The key idea is to assume log-linear policies, which linearize the DPO objective at the last layer of the neural network policy representation. The derived D-optimal design resembles that of logistic regression, with additional terms due to the reference policy and regularization by it. We propose two computationally-efficient algorithms, and , which select the most informative data points for DPO. elicits preferential feedback online and leverages previously logged preferential feedback to have a better design.
-
3.
We analyze and , and show that their logit errors are , where is the number of features in the linearized DPO policies and is the budget on preferential human feedback. This is the first analysis for DPO and has several novel technical aspects. The main technical trick is relating the feedback model and policy parameter under the assumption of log-linear policies. Therefore, we can argue for concentration of the policy parameter with more feedback. The analysis is also under a practical assumption that preferential feedback can be elicited at most once per prompt. To attain a rate in this setting, we introduce a novel assumption on the sufficient diversity of prompts and candidate responses.
-
4.
We evaluate and empirically. We experiment with both log-linear DPO policies, which match our theory, and on LLMs. Our methods perform well empirically, despite the fact that they are the first ones with an analysis for active learning in DPO.
The paper is structured as follows. In Section 2, we introduce classic methods for training LLMs. In Section 3, we introduce active learning for DPO. We introduce our algorithms in Section 4 and analyze them in Section 5. In Section 6, we evaluate our algorithms empirically. We review related work in detail in Appendix C and conclude in Section 7.
2 Background
We start by introducing our notation. The prompt is a string , where is the space of all strings. The response is a string . A large language model (LLM) is a policy that maps to . We denote the probability of generating response to prompt by a policy parameterized by by , where is the space of policy parameters. To simplify terminology, we call a policy when it is clear that we refer to . Pre-trained LLMs can be optimized by supervised fine-tuning (Mangrulkar et al., 2022; Hu et al., 2022) and reinforcement learning from human feedback, which may require learning of a reward model (Ouyang et al., 2022) or not (Rafailov et al., 2023). These methods are introduced next.
2.1 Supervised Fine-Tuning
Supervised fine-tuning (SFT) (Mangrulkar et al., 2022; Hu et al., 2022) is a direct application of supervised learning to LLMs. The objective of SFT is to minimize the negative log-likelihood (loglik) of response given prompt ,
(1) |
in expectation over prompt-response pairs sampled from a training set. One limitation of SFT is that we learn only from positive examples. Therefore, it is hard to learn not to generate certain given . This motivates learning of policies through rewards in Section 2.2.
2.2 Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) has two stages: reward model learning and policy optimization. The reward model is learned from human feedback (Ouyang et al., 2022). The LLM policy is then optimized to maximize the expected reward under the reward model using proximal policy optimization (PPO) (Schulman et al., 2017). The objective is
(2) | |||
where is a prompt sampled from a training set. The first term is the reward of response to prompt . The second term penalizes for deviations of policy from a reference policy , usually obtained by SFT (Section 2.1). The regularization is needed because the reward model is usually learned from data collected by and thus cannot estimate the value of significantly different policies well. The parameter trades off the two terms. We define the optimal RLHF policy as .
2.3 Direct Preference Optimization
Direct preference optimization (DPO) (Rafailov et al., 2023) recasts RLHF as follows. Under the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952; Luce, 2005) of human feedback, a response with reward is preferred to that with reward with probability
where is a sigmoid function. The key observation in DPO is that the policy that maximizes (2) has a closed form
where is the normalizer (Rafailov et al., 2023). This holds for any prompt and response , under the assumption that the space of optimized policies can represent each conditional distribution exactly. This can be rearranged as and thus
(3) | |||
holds when . A nice property of this substitution is that the normalizers , which are difficult to estimate when the space of responses is infinite, cancel out.
Therefore, instead of learning a reward model and optimizing (2), we can directly optimize the policy in (3). Specifically, let be a random variable such that when is preferred to given , and when is preferred to given . This problem can be viewed as fitting (3) to the distribution of and written as maximizing the negative loglik
(4) | ||||
where the expectation is over prompt-candidate response pairs sampled from a training set, and stochastic preferential feedback . We define the optimal DPO policy as
(5) |
and note that it is the maximum likelihood estimate (MLE) for (4). Note that (4) is equivalent to a more classic
when the winning response is and the losing response is . We use the reparameterized objective in (4) because it clearly separates the random variable from the rest of the objective.
We also note that (3) can be rewritten as
where depends on the reference policy but not on the optimized policy . We use this algebraic form because it separates the optimized part of the objective from essentially constants.
3 Setting
We study active learning in DPO (Section 2.3). Simply put, instead of assuming that (4) is approximated using a fixed dataset, we choose the dataset actively with the objective of learning policies that are close to . We study two variants of this problem, offline and online, which we present next.
Offline feedback. The input to this setting is a dataset of size with preferential human feedback for all data points. The dataset is , where is the prompt in data point , and are the candidate responses, and is the preferential feedback. Specifically, if the preferred response is , and if the preferred response is . Our goal is to select a subset of of size so that the DPO policy on this subset is \sayclose to . This setting is motivated by computational efficiency. In particular, even if preferential feedback is known, we may not have computational resources to learn from all of it. Choosing the most informative subset of of size is a natural way of maximizing the information gain within the computational cost constraint.
Online feedback. The input to this setting is a dataset of size without preferential human feedback. The dataset is , where is the prompt in data point , and and are the candidate responses. The human feedback is elicited online. This setting is motivated by statistical efficiency. We want to collect the most informative feedback using only information about prompts , and candidate responses and .
Let be a subset of data point indices from , either collected online or offline. After the algorithm selects , we minimize an empirical approximation to (4) on . Before we define it, we introduce a more compact notation. Let
be the probability that response is preferred to given under policy , where is the bias due to the reference policy . Let
(6) | |||
be the DPO negative loglik on . Then (4) can be approximated on by . We propose algorithms for choosing in Section 4.
Objective. Now we are ready to state our objective. Let be the optimal DPO policy in (5). Let
(7) |
be the maximum logit error under policy , the difference of DPO logits under and . Note that the biases cancel. Let denote the optimal DPO policy on . We want to be close to in terms of (7). Specifically, we want to decrease with with a high probability. The motivation for (7) is that it can bound many other errors. For instance, since the Lipschitz factor of is , we get
Therefore, when the maximum logit error is small, the estimated probability that is preferred to under policy , for any data point , is close to that under .
4 Algorithms
The key idea in our paper is to linearize the policy at the last layer of its neural network representation and use linear algebra for active learning. Active learning on linearized neural networks was popularized in regret minimization by Riquelme et al. (2018). Das et al. (2024); Mukherjee et al. (2024); Thekumparampil et al. (2024); Liu et al. (2024); Scheid et al. (2024) applied it recently to learning reward models. In our work, we linearize policies and formalize it as follows.
Assumption 1.
All policies are log-linear,
(8) |
where is the feature vector for pair and is a policy parameter.
We make this assumption for the rest of the paper. Under this assumption, in (6) becomes
(9) |
where is the difference of the feature vectors of responses and given . We note that the normalizers of cancel out. We also note that when (9) is substituted into (6), we obtain a similar expression to the negative loglik of logistic regression, except for the bias and . The key idea in our algorithms is to optimize the Hessian of the DPO negative loglik.
Lemma 1.
Let be a log-linear policy. Then the Hessian of in (6) with respect to is
It is also positive semi-definite.
Proof.
The proof is in Section A.1. ∎
The Hessian can be used to derive the covariance matrix of the MLE of and is also known as the Fisher information matrix (Fisher, 1922). Therefore, it can be used for both uncertainty quantification and information gathering (Lattimore and Szepesvari, 2019). Since the MLE of is a policy, we can use the Hessian to select a subset of data points to learn better policies.
Specifically, let be a subset of data point indices and be the corresponding MLE. We show in Theorem 2 that the error in the logit estimate at data point is bounded with a high probability as
up to logarithmic factors. To minimize it, we want to maximize all eigenvalues of . We achieve this by maximizing over .
This optimization problem is challenging for two reasons. First, it is a discrete optimization problem over . In our work, we maximize greedily. An informal justification for this approach is that is monotone and concave in for , and thus a greedy algorithm should be near optimal (Nemhauser et al., 1978). We prove this formally in Section 5. Second, is unknown. We overcome this by using its plug-in estimates (Stufken and Yang, 2012)
4.1 Active DPO with Online Preferential Feedback
Our first algorithm does not have access to any preferential feedback initially. It collects it online, re-estimates , and approximately maximizes .
The pseudo-code of the algorithm is in Algorithm 1 and we call it active DPO (). chooses data points in rounds. The indices of the chosen data points in the first rounds are denoted by and the corresponding Hessian is . We refer to it as the design matrix since it is used to select next data points. The design matrix is initialized to , where is a constant that guarantees that all are well defined. In round , selects the index that greedily maximizes the information gain given and the empirical estimate of up to round , (line 6). This is because
can be viewed as the incremental gain due to data point in Lemma 1. After the data point is chosen, we observe preferential feedback on it (line 7) and update all statistics (lines 8-9). Finally, after rounds, outputs chosen indices (line 10) and an LLM policy is optimized on them using DPO.
The time complexity of is . The former term is due to training on all past feedback in each round (line 4) and the latter is due to maximizing exactly in line 6. In experiments, we reduce the former to by estimating only a logarithmic number of times, when for some integer . We reduce the latter to by replacing with its random subset of a fixed size . Finally, note that in line can be equivalently expressed (Section A.3) as
(10) |
Therefore, the determinant does not need to be computed. The inverse can be computed incrementally using the Sherman-Morrison formula, with update time. The statistical efficiency of is analyzed in Section 5.
4.2 Active DPO with Offline Preferential Feedback
Our second algorithm has access to preferential feedback initially. All feedback is used to estimate , which is then used to approximately maximize .
The pseudo-code of our algorithm is in Algorithm 2 and we call it , where indicates that has access to more information than . differs from in two steps. First, is estimated initially (line 3) from all preferential feedback. Second, no preferential feedback is collected online. Similarly to , the time complexity of is because of the exact maximization in line 7. We reduce it to in experiments as in Section 4.1.
5 Analysis
In this section, we provide a unified analysis for and . This is possible because the algorithms only differ in how the instance-specific factors in the design matrix are estimated. In , they are estimated from all preferential feedback. In , only the online elicited feedback up to round is used. We state our assumptions first.
We assume that all policies are log-linear (Assumption 1) and that the collected feedback is conditionally independent given all feedback up to round , for all . Under this assumption, the negative loglik in (6) is similar to that of logistic regression and we can use existing concentration inequalities (Abbasi-Yadkori et al., 2011).
Assumption 2.
[Boundedness] For any , and . We assume that is a unit sphere, and hence and .
Assumptions on feature vectors, comprising and , are standard in the analyses of generalized linear models (Li et al., 2017; Kveton et al., 2020; Mukherjee et al., 2024). Our assumption on and can be guaranteed by applying DPO to a unit sphere . The assumption can be weakened to using initial exploration (Li et al., 2017; Kveton et al., 2020).
We can analyze and in a unified way because the instance-specific factors in their design matrices can be bounded from below by and above by .
Assumption 3.
[Design matrix] For any and , we have .
These constants obviously exist and can be easily derived. For instance, since , we get . Moreover, under Assumption 2, we have for any that
The argument for is similar. The constants and appear in our bounds.
The last assumption is that the dataset is sufficiently diverse.
Assumption 4.
[Diverse dataset] There exists a constant such that holds for any and .
This assumption says that the maximizer in (10) is an approximate upper bound, up to a multiplicative , on the information gain at each data point, including those previously chosen that cannot be chosen again. We note that the assumption holds for when repeated independent observations of the data points are allowed, as in all prior works (Appendix C). In this case, the maximization in (10) would be over .
5.1 Main Result
We state our main claim below.
Theorem 2.
Let . Then the maximum logit error under and is
with probability at least , where hides all logarithmic factors but those in .
We prove the claim as follows. For log-linear policies, (7) reduces to . By the Cauchy-Schwarz inequality, for any data point ,
(11) |
where a regularized Hessian at the optimal DPO policy . To bound the first term, we note that the feedback at data point is distributed as
(12) |
This assumption follows from the definition of DPO in (3), which says that is the probability that response is preferred to given . Thus we can build on existing concentration results for sub-Gaussian random variables to prove the following.
Theorem 3.
For any set of indices ,
holds with probability at least .
To bound the second term in (11), we use the fact that the standard errors of the logit estimates do not increase over time and decrease at a desired rate if Assumption 4 holds for some constant .
Theorem 4.
For any data point ,
All proofs are in Appendix A.
5.2 Discussion
The bound in Theorem 2 is and holds with probability at least . As a result, the maximum logit error decreases with more feedback and increases with the number of learned policy parameters . The bound is not directly comparable to prior works in Appendix C because they bound reward model errors, while we bound a policy learning error. That being said, the dependence on and is similar. The linear dependence on arises because Theorem 4 is proved through a self-normalizing bound in Theorem 3 that would apply even to infinitely-large datasets. We would get an bound, where is the dataset size, if we followed the analysis of Kveton et al. (2020) and applied a union bound over all data points.
6 Experiments


We experiment with both log-linear (Section 6.1) and LLM (Section 6.2) policies. The log-linear experiments validate that and work as analyzed. The LLM experiments show that and perform well in practice when applied to LLMs. We conduct more experiments with log-linear policies in Appendix B.
6.1 Log-Linear Policies
This experiment is designed as follows. First, we take an existing multi-class classification dataset and turn it into a preferential feedback dataset. More specifically, we choose a random positive label and generate vectors , where is the difference of feature vectors of random positive and negative examples. Second, we label all with and learn a logistic regression model to simulate preferential feedback. Let and be the learned model parameter and its covariance, respectively. Third, we generate preferential feedback for all and get a dataset . Fourth, we generate a reference policy as and set the bias as . Simply put, is close , as measured by the uncertainty of . Finally, we compute the optimal DPO policy on . All compared methods apply DPO to their selected subset of and learn .
We compare to in three metrics. The first metric is the maximum logit error, , which we bound in Theorem 2. The second metric is the mean logit error . Although we do not analyze it, our methods minimize it indirectly through the maximum error. The last metric is the error rate,
which is the fraction of incorrectly ordered responses by when is the ground truth.


We compare five algorithms. The first two algorithms are and . We expect to perform better because it has access to more information. We consider three baselines: , , and . selects data points uniformly at random. While simple, it is known to be competitive in real-world problems where feature vectors may cover the feature space close to uniformly (Ash et al., 2020, 2021; Mukherjee et al., 2024; Muldrew et al., 2024). is the practical incremental D-optimal design for linear models proposed in Das et al. (2024). The main difference from is that neglects logistic model factors and (Lemma 1). Therefore, while it selects diverse , they do not necessarily maximize the information gain in DPO. The last baseline is of Muldrew et al. (2024), which selects data points with the highest differences between estimated rewards of their responses.
We experiment with CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). The features are a random subset of ResNet-50 embeddings (He et al., 2016) of size . The dataset size is . We set the DPO regularizer to and experiment with other in Appendix B. Our CIFAR-10 results are reported in the first row of Figure 1. is the best performing method in all metrics. Many improvements are major. For instance, the lowest maximum logit error of () is attained by at . The lowest maximum logit error of () is attained by at . is the second best method in the maximum logit error. It is never worse than , , and . improves in all metrics over all baselines at larger sample sizes. Our CIFAR-100 results are reported in the second row of Figure 1 and we observe the same trends as on the CIFAR-10 dataset.
6.2 LLM Policies
We also experiment with a real-world preference dataset Nectar (Zhu et al., 2023) and two LLM policies: Llama-3.2 (3B parameters) (Dubey et al., 2024) and Phi-3 (Abdin et al., 2024). We sample prompts from the dataset, each with two responses. The accepted and rejected responses are determined based on the ground truth in the dataset. The feature vector is the embedding of the concatenated prompt and response from the last hidden layer of the LLM, of size . The bias term is , where is the initial LLM reference policy.
We report three metrics. The accuracy measures how well we distinguish between positive and negative responses,
This metric is minus the error rate in Figure 1 and thus identical, up to how we plot it. We could not plot the two other metrics in Figure 1 because they require knowing . Therefore, we decided to plot two other metrics that reflect the confidence in distinguishing the responses. The margin is the advantage of a positive response over a negative one,
The negative loglik is the logistic regression loss,
Our results with Llama-3.2 and Phi-3 models are reported in Figure 2. We observe similar trends to Figure 1. is clearly the best performing method in both the margin and negative loglik. is among the best three methods for larger sample sizes. The least clear trend is in accuracy. We believe that this is because many responses are of a similar quality. Therefore, they cannot be easily distinguished and lie close to the decision boundary, which can be impacted by even minor changes in the LLM.
7 Conclusions
We propose an active learning framework for DPO. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We propose two algorithms. One is for the online setting, where the human feedback is elicited online, and the other is for the offline setting, where the feedback has already been collected and we choose its subset to improve the computation efficiency of DPO. We analyze both algorithms and also evaluate them empirically, in the setting that matches our theory and on LLMs.
This is the first work that applies optimal designs to DPO. The main difference from prior works is that the optimal design is applied to policy optimization. A natural direction for future work are other policy optimization frameworks, such as KTO (Ethayarajh et al., 2024). Our analysis could also be improved in several aspects. For instance, it is for log-linear policies and we have not derived an upper bound on in Assumption 4. In the setting of prior works, where multiple independent observations of preferential feedback for the same prompt are possible, .
References
- Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
- Abdin et al. [2024] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- Ash et al. [2020] Jordan Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations, 2020.
- Ash et al. [2021] Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with Fisher embeddings. In Advances in Neural Information Processing Systems 34, 2021.
- Audibert et al. [2010] Jean-Yves Audibert, Sebastien Bubeck, and Remi Munos. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 41–53, 2010.
- Azizi et al. [2022] Mohammad Javad Azizi, Branislav Kveton, and Mohammad Ghavamzadeh. Fixed-budget best-arm identification in structured bandits. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, 2022.
- Bayer and Reuter [2024] Markus Bayer and Christian Reuter. Activellm: Large language model-based active learning for textual few-shot scenarios. arXiv preprint arXiv:2405.10808, 2024.
- Bishop [2006] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
- Bradley and Terry [1952] Ralph Allan Bradley and Milton Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952.
- Bubeck et al. [2009] Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pages 23–37, 2009.
- Chen et al. [2024] Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, and Yelong Shen. Cost-effective proxy reward model construction with on-policy and active learning. arXiv preprint arXiv:2407.02119, 2024.
- Christiano et al. [2017] Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30, 2017.
- Das et al. [2024] Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient RLHF. CoRR, abs/2402.10500, 2024. URL https://arxiv.org/abs/2402.10500.
- Doucet et al. [2024] Paul Doucet, Benjamin Estermann, Till Aczel, and Roger Wattenhofer. Bridging diversity and uncertainty in active learning with self-supervised pre-training. arXiv preprint arXiv:2403.03728, 2024.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Ethayarajh et al. [2024] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. In Proceedings of the 41th International Conference on Machine Learning, 2024.
- Fisher [1922] Ronald Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London: Series A, 222:309–368, 1922.
- Guo et al. [2024] Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Hu et al. [2022] Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022.
- Ji et al. [2024] Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024.
- Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Kveton et al. [2015] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Kveton et al. [2020] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020.
- Lagree et al. [2016] Paul Lagree, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, pages 1597–1605, 2016.
- Lattimore and Szepesvari [2019] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.
- Li et al. [2017] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 2071–2080, 2017.
- Li et al. [2016] Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016.
- Lightman et al. [2024] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In Proceedings of the 12th International Conference on Learning Representations, 2024.
- Liu et al. [2024] Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback. CoRR, abs/2410.02504, 2024. URL https://arxiv.org/abs/2410.02504.
- Luce [2005] Robert Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. Dover Publications, 2005.
- Mangrulkar et al. [2022] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Margatina et al. [2023] Katerina Margatina, Timo Schick, Nikolaos Aletras, and Jane Dwivedi-Yu. Active learning principles for in-context learning with large language models. arXiv preprint arXiv:2305.14264, 2023.
- Mehta et al. [2023] Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. CoRR, abs/2312.00267, 2023. URL https://arxiv.org/abs/2312.00267.
- Mukherjee et al. [2024] Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Deshmukh, Ge Liu, Yifei Ma, and Branislav Kveton. Optimal design for human preference elicitation. In Advances in Neural Information Processing Systems 37, 2024.
- Muldrew et al. [2024] William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024.
- Nemhauser et al. [1978] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 14(1):265–294, 1978.
- Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35, 2022.
- Plackett [1975] Robin Lewis Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
- Pukelsheim [2006] Friedrich Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006.
- Radlinski et al. [2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791, 2008.
- Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36, 2023.
- Riquelme et al. [2018] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In Proceedings of the 6th International Conference on Learning Representations, 2018.
- Scheid et al. [2024] Antoine Scheid, Etienne Boursier, Alain Durmus, Michael Jordan, Pierre Menard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in RLHF. CoRR, abs/2410.17055, 2024. URL https://arxiv.org/abs/2410.17055.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL https://arxiv.org/abs/1707.06347.
- Stufken and Yang [2012] John Stufken and Min Yang. Optimal designs for generalized linear models. In Design and Analysis of Experiments, pages 137–164. John Wiley & Sons, 2012.
- Sutton and Barto [1998] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Thekumparampil et al. [2024] Kiran Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, and Branislav Kveton. Comparing few to rank many: Active human preference learning using randomized Frank-Wolfe. CoRR, abs/2412.19396, 2024. URL https://arxiv.org/abs/2412.19396.
- Wang et al. [2024] Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, and Dianhui Chu. A survey on data selection for llm instruction tuning. arXiv preprint arXiv:2402.05123, 2024.
- Yang and Tan [2022] Junwen Yang and Vincent Tan. Minimax optimal fixed-budget best arm identification in linear bandits. In Advances in Neural Information Processing Systems 35, 2022.
- Zhang et al. [2022] Yiming Zhang, Shi Feng, and Chenhao Tan. Active example selection for in-context learning. arXiv preprint arXiv:2211.04486, 2022.
- Zhu et al. [2023] Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
- Zong et al. [2016] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016.
Appendix A Proofs and Supporting Lemmas
This section contains proofs of our main claims and supporting lemmas.
A.1 Proof of Lemma 1
Let and . Then
We start with computing the gradient of (6),
It follows that the Hessian is
The term is an outer product, which is positive semi-definite. Because , the Hessian is a weighted sum of positive semi-definite matrices, and thus a positive semi-definite matrix.
A.2 Proof of Theorem 3
Let . We start by noting that is a positive semi-definite matrix (Lemma 1). Therefore, is strongly convex in and
holds. Now we use that and that , rearrange the inequality, and get
Then we apply the Cauchy–Schwarz inequality to the right-hand side and get
Now we divide both sides by and get
The last inequality follows from
which is proved using , and that .
Therefore, to bound , it suffices to show that is small with a high probability. We show this next. We start by recalling from Lemma 1 that
where is a binary random variable with mean , as described in (12). Let . Since
we get
for . Finally, since are conditionally independent given the history and their variance proxy is , we can use Theorem 1 of Abbasi-Yadkori et al. [2011] and get that
holds with probability at least . Finally, we collect all inequalities and get that
holds with probability at least .
A.3 Proof of Theorem 4
First, we introduce , and note that in and can be redefined as
Now note that
because . Next we utilize the fact that the standard errors of the estimates decrease with more observations.
Lemma 5.
For any and ,
Proof.
The proof follows from the Sherman–Morrison formula. Specifically, since
we get for any vector . This completes the proof. ∎
Lemma 5 implies that
holds for any . This allows us to attribute the quality of the solution to individual greedy steps in and . The next step is to relate to . The key observation is that
The second equality holds because is fixed when is selected. The last equality holds because the logarithm is a monotone function. It follows that is the index of the feature vector with the maximum variance.
If the scope of the maximization was , the inequality would hold for any . Since the scope is , we make Assumption 4, which equates to assuming that are sufficiently diverse. We also use the following logarithmic transformation.
Lemma 6.
For any and ,
Proof.
We start with an upper bound on . By Weyl’s inequalities, we have
Thus, under the assumption that , we have . Now note that for ,
Finally, we set and , and get our claim. ∎
Now we apply Assumption 4 and Lemma 6, use the telescoping property of the sum, and get
where . Furthermore,
Finally, we combine all claims and get
This completes the proof.
Appendix B Ablation Study




In Section 6.1, we experiment with . There is nothing specific about this choice. In Figure 3, we report results for and observe improvements in both settings.
To increase the stability of our algorithms at small sample sizes, we replace with a high probability upper confidence bound (UCB). Let be the covariance matrix for . Then the UCB is computed as
(13) |
for some . We set in Section 6. In Figure 4, we set and observe that this has no major impact on our trends as the number of data points increases.
Appendix C Related Work
The closest related works are on active learning with preferential feedback, and we review them first (Section C.1). Then we review active learning for fine-tuning (Section C.2) and other related works (Section C.3).
C.1 Active Learning for Preferential Feedback
Mehta et al. [2023] applied active learning to DPO in Section 5. Their acquisition function is
where is the UCB and is the LCB of . The analysis is for dueling the UCB response with a random response. Their optimized metric is the maximum gap
(14) |
where is the estimated reward model and is the best response given . They prove that the maximum gap is for sampling with replacement.
Das et al. [2024] proposed two algorithms for active RLHF. The acquisition function in APO is
where is a logistic regression Hessian in round , which is re-estimated in each round. They prove that (14) is for sampling with replacement. APO is not evaluated. This is the closest algorithm design to . The main difference in is that we maximize the information gain (line 6) and do not compute . Das et al. [2024] also proposed a practical APO,
where is a linear regression Hessian in round . Practical APO is not analyzed. We use it as a baseline in Section 6.
Mukherjee et al. [2024] studied active learning with absolute and ranking feedback with responses. For , their algorithm Dope is , where is a distribution over prompts with responses obtained by the D-optimal design. They prove that
for sampling with replacement, where is the true model parameter and is its estimate from observations. Dope is evaluated on RLHF datasets. Thekumparampil et al. [2024] extended Mukherjee et al. [2024] to ranking items from responses.
Liu et al. [2024] extended APO of Das et al. [2024] to selecting both the prompt and teacher model. They prove that (14) is for sampling with replacement. The proposed algorithm is empirically evaluated.
Scheid et al. [2024] proposed offline and online algorithms for active learning of reward models in RLHF. The offline algorithm, which is in the same setting as our work, computes the D-optimal design, similarly to Mukherjee et al. [2024] for , and explores by sampling with replacement. They prove a bound on (14). The paper does not contain any experiments.
Ji et al. [2024] proposed two active learning algorithms: APPO and ADPO. APPO is a regret minimizing algorithm similar to those in dueling bandits. In round , APPO is given a prompt as an input and proposes two responses to duel. APPO is analyzed. ADPO is a heuristic that queries responses on prompts where the agent is uncertain. The response is uncertain if in the DPO objective is high.
Muldrew et al. [2024] proposed an active learning algorithm for DPO that repeatedly acquires labels and fine-tunes on them. The data are acquired in batches until a budget is met. The acquisition function is
where is the estimated reward model. We use it as a baseline in Section 6.
Guo et al. [2024] proposed online DPO from AI feedback. The key is to elicit AI feedback instead of human feedback and then use it in DPO. This is an empirical paper.
Chen et al. [2024] proposed active learning with coresets for reward models. They learn cluster centroids in the space of prompt embeddings that minimize the maximum distance of the prompt to its closest centroid. This is an empirical paper.
C.2 Active Learning for Fine-Tuning
There are many related works on active learning in LLMs [Margatina et al., 2023, Bayer and Reuter, 2024, Zhang et al., 2022]. A recent survey by Wang et al. [2024] categorizes existing methods for data selection in instruction tuning. Most of these methods rely on heuristic approaches, such as uncertainty sampling, clustering, or diversity-based strategies, which often lack theoretical grounding. Doucet et al. [2024] proposed a method that bridges diversity and uncertainty in active learning by leveraging self-supervised pre-training to address the cold-start problem and enhance data efficiency. However, these approaches do not align data selection directly with the task-specific objective, limiting their effectiveness in optimizing downstream performance. Zhang et al. [2022] used LLMs for selecting instances for in-context learning. More recently, Bayer and Reuter [2024] proposed ActiveLLM, which is a pool-based sampling method that leverages LLMs to select batches of instances for humans to label. Despite this fundamental difference, they also study two variants of their approach, one that incorporates feedback and another one that does not.
C.3 Multi-Armed Bandits
Our setting is also related to multi-armed bandits. Due to the budget , it is reminiscent of fixed-budget best arm identification (BAI) [Bubeck et al., 2009, Audibert et al., 2010, Azizi et al., 2022, Yang and Tan, 2022]. The main difference is that we do not want to identify the best arm. We want to get a good estimate for a set of arms, essentially pairs of items, in the worst case. Online learning to rank has also been studied extensively [Radlinski et al., 2008, Kveton et al., 2015, Zong et al., 2016, Li et al., 2016, Lagree et al., 2016]. We do not minimize cumulative regret or try to identify the best arm.