- LLMs
- Large Language Models
- LLM
- Large Language Model
- FL
- Federated Learning
- GPO
- Group Preference Optimization
- RLHF
- Reinforcement Learning from Human Feedback
- DPO
- Direct Preference Optimization
PluralLLM: Pluralistic Alignment in LLMs via Federated Learning
Abstract.
Ensuring Large Language Models (LLMs) align with diverse human preferences while preserving privacy and fairness remains a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized data collection, making them computationally expensive and privacy-invasive. We introduce PluralLLM 111The code will be released. a federated learning-based approach that enables multiple user groups to collaboratively train a transformer-based preference predictor without sharing sensitive data, which can also serve as a reward model for aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate preference updates efficiently, achieving faster convergence, a improvement in alignment scores, and nearly the same group fairness measure as in centralized training. Evaluated on a Q/A preference alignment task, PluralLLM demonstrates that federated preference learning offers a scalable and privacy-preserving alternative for aligning LLMs with diverse human values.
1. Introduction
LLMs have rapidly emerged as a cornerstone of modern artificial intelligence, powering applications ranging from conversational agents to content generation and decision support systems (Bubeck et al., 2023). Their ability to generate human-like text has revolutionized various industries, but their effectiveness depends on their ability to align with human values and societal expectations (Zhang et al., 2023). However, achieving robust human alignment remains a significant challenge, particularly in ensuring that these models fairly represent diverse perspectives, a concept known as Pluralistic Alignment (Sorensen et al., 2024).
Existing LLM alignment methods fall into two categories: prompt engineering and gradient-based alignment (Sahoo et al., 2024; Zhang et al., 2023). While prompt engineering guides model behavior through carefully crafted prompts and in-context examples without modifying model parameters, it often struggles with complex behaviors (Zhou et al., 2022). Gradient-based alignment fine-tunes models using reward mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), improving traits like honesty and helpfulness but requiring extensive supervision and high computational costs (Ouyang et al., 2022). However, existing approaches do not scale efficiently for aligning LLMs across multiple user groups with limited supervision, making pluralistic alignment challenging (Chakraborty et al., 2024; Casper et al., 2023).
Despite recent advancements in alignment techniques, preference learning for Large Language Model (LLM) faces three significant challenges: (1) privacy risks, (2) collection of preference data, and (3) computational overhead (Kuang et al., 2024; Casper et al., 2023). Centralized approaches, such as RLHF and gradient-based alignment, require collecting and processing user interactions, raising concerns about data security and user confidentiality. On the other hand, Federated Learning (FL) enables privacy-preserving model training by keeping data decentralized. However, it comes with high communication and processing costs due to frequent updates between clients and global server (Kuang et al., 2024) . These challenges are even greater for pre-trained LLM, which require significant computational resources to incorporate preference data. Aligning LLM with FL adds further complexity, demanding a balance between alignment quality, efficiency, and privacy. In preference learning, this burden is particularly heavy, making efficient aggregation strategies or alternative models essential (Wu et al., 2024). In this paper, we address the following research question:
How can we align LLMs to capture the preference of various perspectives of different communities while preserving privacy, maintaining fairness, and ensuring computational efficiency?
We introduce PluralLLM, a framework for pluralistic alignment in LLMs via FL. Our approach leverages FL to train a transformer-based preference predictor (Zhao et al., 2023) to capture group-specific preferences in a distributed privacy-preserving manner. This preference predictor serves as a lightweight alternative to conventional alignment methods that require training a reward model RLHF, significantly reducing computational overhead and can adapt to a new unseen group. Unlike the centralized group preference optimization training approach proposed by Zhao et. al. (Zhao et al., 2023), our FL-based preference predictor ensures privacy-preserving preference learning by allowing different groups to collaboratively train the model without exposing their sensitive preference data. In addition, it enables diverse groups to participate in training while maintaining the fairness properties of the centralized approach. Our results demonstrate that our proposed PluralLLM approach applied to preference learning achieves higher alignment scores and faster convergence compared to centralized methods, making it a scalable and efficient solution for capturing diverse group preferences.

2. Related Work
Prompt Engineering: Prompt engineering provides a mechanism for fine-tuning model outputs through the modification of the input to the LLM, thereby aligning with user preferences without altering the parameters of the core LLM model. Prompt engineering approaches are characterized by their computational efficiency, a property that stems from the absence of any training requirements (Sahoo et al., 2024). However, the design of the prompt itself can be a laborious task that relies on heuristics. The efficacy of these heuristics is not guaranteed to transfer well across different LLMs (Zhao et al., 2023). Recent work in the literature has shown that prompt engineering has limited success in aligning LLMs to complex groups on challenging survey datasets such as GlobalOpinionQA (Durmus et al., 2023a).
Pluralistic Alignment in LLM: A growing number of pluralistic alignment studies show that it is important to design LLM systems that can accommodate and represent diverse human values, perspectives, and preferences. Unlike traditional alignment approaches that aim to align models to a single, averaged set of human preferences, pluralistic alignment seeks to reflect the complexity and plurality of human societies. For example, Cao et. al. introduced an age fairness reward in LLM to reduce response quality disparities across distinct age groups during training (Cao et al., 2024). Traditional Reinforcement Learning alignment approaches, such as RLHF often reinforce majority viewpoints while marginalizing minority perspectives. The question of balancing openness to diverse values with ethical constraints, such as the avoidance of harmful ideologies, remains largely unaddressed.
Group Preference Alignment: Group preference alignment refers to techniques designed to adapt LLM outputs to reflect the distinct preferences, values, or judgments of different groups or demographics. Group Preference Optimization (GPO) (Zhao et al., 2023) was introduced as a few-shot alignment framework that steers LLMs toward group-specific preferences. GPO augments the base LLM with an independent transformer module, trained via in-context supervised learning with only a handful of samples to predict group preferences and refine model outputs. This module acts as a preference model for different groups, learning distinct alignment patterns across diverse communities. By leveraging an in-context autoregressive transformer, GPO enables flexible and efficient alignment, allowing LLMs to adapt dynamically to varying user preferences.
3. Pluralistic Alignment in LLMs via Federated Learning
We chose the Q/A preference alignment task, which involves aligning LLM responses based on group-specific preferences. This task is particularly well-suited for evaluating pluralistic alignment, as it requires the model to adapt to diverse user opinions while maintaining coherence and fairness. Unlike standard classification tasks, preference-based Q/A alignment provides a richer evaluation metric, allowing us to measure not only the correctness of responses but also how well the model captures nuanced group preferences. This setup also reflects real-world applications, where LLM must personalize responses based on collective user preferences. We adopt a FL setup for pluralistic alignment in LLM. The framework consists of three main actors, as described in Figure 1:
-
•
Training Clients (Groups): The training set, , comprises distinct groups, each representing a client in a FL setup. Each client performs local training to develop a transformer-based preference model (Zhao et al., 2023). This model aims to learn group-specific preference patterns and generalize to unseen data. Each client trains its transformer independently using its respective group’s preference dataset.
-
•
Aggregation Server: A central server coordinates FL by collecting, aggregating, and redistributing model updates from training clients. This process enables learning from diverse data while preserving privacy and preventing direct data sharing.
-
•
Evaluation Clients (Groups): A separate set consisting of groups is introduced, where each group acts as a client in the FL setup to assess the alignment performance of the trained model. Unlike training clients, these groups do not participate in model updates. Instead, they represent new unseen groups that serve as an independent benchmark to evaluate the generalizability of the trained model. Their feedback helps determine how well the aggregated model aligns with unseen groups.
3.1. Group Local Training
Each training group has a preference dataset: where represents the embedding of LLM concatenated prompt-response pair: and is a language model embedding function. The preference represents the group preference probability for the generated response to the query . For instance, in Figure 1, the query represents the question, while the response corresponds to the aggregated probability distribution of answers within a group. This preference data is then processed by the Alpaca model (LLM) to generate embeddings, following the approach in (Zhao et al., 2023).
The dataset for each group is randomly divided into context samples and target samples . The model is trained to predict the target preferences given the context examples, optimizing the following loss function as introduced in (Zhao et al., 2023):
(1) |
where denotes the target points predicted preference distribution conditioned on the context points. At the end of local training, each client transmits its updated model parameters to the central aggregation server, which combines the received updates to train the global model.
3.2. Model Aggregation
We employ the FedAvg technique to aggregate updates from multiple groups (McMahan et al., 2017). The aggregation process is designed to minimize the global optimization function of FL.
(2) |
where is the weight assigned to each training group , defined as: . Here, represents the size of the preference dataset for group . The term ensures that each group’s contribution to the global objective is weighted by the proportion of its dataset size relative to the total dataset across all training groups. The local objective function for group is defined as: , where represents the loss function defined at Equation 1 and is the model output for input . To approximate the global objective, model updates from training groups are aggregated as follows:
(3) |
where represents the locally updated model parameters at the training group in round . The aggregated model is then redistributed to training and evaluation clients.
4. Experiments


4.1. Experimental Setup
Our experiments utilize a transformer-based preference predictor model (GPO (Zhao et al., 2023)), originally designed to train groups sequentially in an ordered manner. However, in PluralLLM, we adapt it to a FL setting, employing FedAvg as a completely different learning paradigm. The primary goal of our evaluation is to determine whether our approach in PluralLLM impacts the alignment score and group fairness across different groups compared to the original centralized learning.
Experiments are conducted on machines equipped with one NVIDIA A30 GPU, an Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, and 256GB RAM. All results reported in this study are averaged over four different runs with varying random seeds to ensure stability.
4.2. Dataset
We used the dataset from the Pew Research Center’s Global Attitudes Surveys (PewResearch), which collects public opinions on a wide range of social, political, and economic issues (Durmus et al., 2023b). These surveys capture diverse perspectives from various demographic and geographic groups, providing a rich foundation for analyzing preference alignment. To ensure a fair comparison with GPO (Zhao et al., 2023), we use the same subset of groups as in the original GPO training setup. In both FL and centralized learning, groups are partitioned into and groups for training and evaluation, respectively.
4.3. Implementation
-
•
Federated Learning: We trained the transformer-based preference predictor model (GPO) for communication rounds in the FL setup, assuming that all clients participate in each communication round. Each local training step consists of 6 local epochs, where, in each epoch, we randomly sample context questions and corresponding preferences, then the target questions that we wish to predict its preferences to train the model. Adam optimizer used with a learning rate of .
-
•
Centralized Learning: We trained the transformer-based preference predictor model (GPO) for epochs, iterating over all training groups in each epoch. During the training epoch, each group samples a random selection of context questions and their corresponding preferences and target questions. Unlike FL, where model updates are aggregated after each communication round, the centralized approach updates the model sequentially for each group within a single epoch.
-
•
Preferences Embedding: We use Alpaca-7B, a fine-tuned version of LLaMA-7B, as the embedding model to represent preference data, which is then passed to transformer input (Zhao et al., 2023). The embedding step is done once over all the preference data for each group at the beginning of training.
To assess alignment performance, evaluation is conducted every communication rounds (in the FL setting) or every epochs (in the centralized setting). The alignment score is computed over randomly sampled data from all the evaluation groups to measure how well the trained model adapts to new group preferences over time.
4.4. Evaluation Metrics
To quantify the impact of PluralLLM on alignment performance, we evaluate:
-
•
Alignment Score(AS): We assess the degree of alignment between opinion distributions and by calculating the Alignment Score over a set of questions as used in (Zhao et al., 2023). Jensen-Shannon Distance 222The Jensen–Shannon divergence (JSD) is a symmetric measure of similarity between two probability distributions, always non-negative, with denoting identical distributions and any value above indicating differences., denoted as JSD, is used to assess preference distribution similarity shifts.
(4) -
•
Convergence Speed of Loss Function: We measure how quickly the model optimizes preference alignment by tracking the average loss across all clients at each communication round. This is compared to the centralized training loss per epoch. Convergence speed is defined as the point where the model reaches of its final loss value.
-
•
Fairness Metrics: We analyze the effect of PluralLLM on group fairness in preference alignment across different groups compared to centralized learning. Numerous studies have demonstrated that FL can inadvertently introduce unfairness into the trained models (Shi et al., 2023). This unfairness arises primarily from data heterogeneity across clients, which leads to disparate performance results and challenges in achieving equitable model accuracy across all participants. Furthermore, these fairness issues have the potential to show disparity in privacy leakage risks, as adversaries can potentially exploit shared model parameters to infer sensitive information (Zhao et al., 2025). We assess the fairness by adapting Coefficient of Variation (CoV) and Fairness Index(FI) to measure the disparity of alignment scores across distinct groups. These metrics are used to measure the relative perception of fairness in human-centered systems (Zhao and Elmalaki, 2024; Zhao et al., 2024). For groups, we define the alignment score of group as . The average alignment score across groups is calculated as . The CoV of alignment scores is calculated as:
(5) where is the standard derivation of . A lower CoV indicates a more equitable distribution of alignment scores among groups, suggesting better fairness in aligning distinct groups. We apply the Fairness Index (FI) transformation to interpret the fairness in percentage between and . A higher FI indicates greater fairness, where represents perfect fairness.
(6) In the context of group fairness in Machine Learning (ML) classification task, the principle of equal opportunity implies that individuals who are eligible for a favorable outcome have an equal chance of being correctly classified by the prediction model, regardless of their group membership. Similar to the classification task, we investigate the equal opportunity in LLM alignment task. To adapt the definition of equal opportunity in the probabilistic setting of LLM alignment, we use our definition of (Equation 5). In this context, equal opportunity would imply that the variability of alignment scores between groups is minimum or the (Equation 6) is close to .
4.5. Analysis of Convergence Speed
PluralLLM converges at communication round , whereas the centralized approach requires significantly more steps, converging at iteration epoch as seen in Figure 2. Hence, PluralLLM achieves convergence 46% faster than the centralized learning approach, highlighting its efficiency in accelerating model training. Additionally, PluralLLM maintains a lower loss throughout training as observed in Figure 2, demonstrating improved stability compared to the centralized approach. The faster convergence of PluralLLM suggests that it is well-suited for distributed learning scenarios where reducing communication rounds is critical for efficiency.
4.6. Analysis of Alignment Performance

Figure 4 demonstrates that PluralLLM achieves a % improvement in the mean evaluation alignment score compared to the centralized approach. While the centralized method shows slower improvements and remains at a lower score, PluralLLM maintains a more stable progression with lower fluctuations, indicating that the distributed training approach enhances robustness in achieving better alignment and improves generalization across diverse data distributions. Additionally, Figure 3 highlights that for two evaluation groups, PluralLLM more accurately represents the baseline distribution for a given Q/A task compared to the centralized approach.
4.7. Analysis of Fairness in Alignment
As observed in Figure 5, PluralLLM improves the by on average before converging at a round compared to the centralized GPO. PluralLLM maintains a comparable across training steps till round achieving an equal opportunity with .

5. Discussion & Conclusion
PluralLLM introduces a federated learning-based approach for pluralistic alignment in LLMs, addressing privacy, efficiency, and scalability challenges. Decentralizing the training of a transformer-based preference predictor preserves user privacy while capturing diverse group preferences more effectively than centralized methods. Our evaluation employs FedAvg for efficient preference update aggregation, resulting in faster convergence, a % improvement in alignment scores, and maintaining group fairness comparable to centralized training.
In addition, this predictor can serve as a lightweight reward function for RLHF, reducing computational costs or generating high-quality preference datasets for DPO, improving efficiency. While effective in Q/A tasks, its applicability to other domains like summarization and translation remains unexplored. Future work will focus on integrating learned preferences into LLM fine-tuning methods and exploring alternative aggregation strategies to enhance fairness across tasks. Furthermore, extending PluralLLM beyond Q/A to diverse learning tasks.
Acknowledgements.
This work is supported by the U.S. National Science Foundation (NSF) under grant number 2339266.References
- (1)
- Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
- Cao et al. (2024) Shuirong Cao, Ruoxi Cheng, and Zhiqiang Wang. 2024. AGR: Age Group fairness Reward for Bias Mitigation in LLMs. arXiv preprint arXiv:2409.04340 (2024).
- Casper et al. (2023) Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023).
- Chakraborty et al. (2024) Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang. 2024. MaxMin-RLHF: Alignment with diverse human preferences. arXiv preprint arXiv:2402.08925 (2024).
- Durmus et al. (2023a) Esin Durmus, Karina Nguyen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023a. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388 (2023).
- Durmus et al. (2023b) Esin Durmus, Karina Nguyen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023b. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388 (2023).
- Kuang et al. (2024) Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5260–5271.
- McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
- Sahoo et al. (2024) Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 (2024).
- Shi et al. (2023) Yuxin Shi, Han Yu, and Cyril Leung. 2023. Towards fairness-aware federated learning. IEEE Transactions on Neural Networks and Learning Systems (2023).
- Sorensen et al. (2024) Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, et al. 2024. A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070 (2024).
- Wu et al. (2024) Feijie Wu, Zitao Li, Yaliang Li, Bolin Ding, and Jing Gao. 2024. Fedbiot: Llm local fine-tuning in federated learning without full model. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3345–3355.
- Zhang et al. (2023) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
- Zhao et al. (2023) Siyan Zhao, John Dang, and Aditya Grover. 2023. Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523 (2023).
- Zhao and Elmalaki (2024) Tianyu Zhao and Salma Elmalaki. 2024. FinA: Fairness of adverse effects in decision-making of human-cyber-physical-system. In 2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS). IEEE, 202–211.
- Zhao et al. (2025) Tianyu Zhao, Mahmoud Srewa, and Salma Elmalaki. 2025. FinP: Fairness-in-Privacy in Federated Learning by Addressing Disparities in Privacy Risk. arXiv preprint arXiv:2502.17748 (2025).
- Zhao et al. (2024) Tianyu Zhao, Mojtaba Taherisadr, and Salma Elmalaki. 2024. Fairo: Fairness-aware sequential decision making for human-in-the-loop cps. In 2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS). IEEE, 87–98.
- Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.