PPO-ABR: Proximal Policy Optimization based Deep Reinforcement Learning for Adaptive BitRate streaming
Abstract
Providing a high Quality of Experience (QoE) for video streaming in 5G and beyond 5G (B5G) networks is challenging due to the dynamic nature of the underlying network conditions. Several Adaptive Bit Rate (ABR) algorithms have been developed to improve QoE, but most of them are designed based on fixed rules and unsuitable for a wide range of network conditions. Recently, Deep Reinforcement Learning (DRL) based Asynchronous Advantage Actor-Critic (A3C) methods have recently demonstrated promise in their ability to generalise to diverse network conditions, but they still have limitations. One specific issue with A3C methods is the lag between each actor’s behavior policy and central learner’s target policy. Consequently, suboptimal updates emerge when the behavior and target policies become out of synchronization. In this paper, we address the problems faced by vanilla-A3C by integrating the on-policy-based multi-agent DRL method into the existing video streaming framework. Specifically, we propose a novel system for ABR generation - Proximal Policy Optimization-based DRL for Adaptive Bit Rate streaming (PPO-ABR). Our proposed method improves the overall video QoE by maximizing sample efficiency using a clipped probability ratio between the new and the old policies on multiple epochs of minibatch updates. The experiments on real network traces demonstrate that PPO-ABR outperforms state-of-the-art methods for different QoE variants.
Index Terms:
Reinforcement learning, video streaming, policy optimization, adaptive bit rate.I Introduction
Due to the widespread use of the Internet, the volume of multimedia traffic has increased, including video streaming. The Cisco annual Internet Report projects that by 2023, 69% of the world’s population will have access to the Internet, with Internet video traffic significantly outnumbering other Internet traffic. In order to ensure seamless video streaming, Dynamic Adaptive Streaming over HTTP (DASH) [1] uses an adaptive bit rate (ABR) algorithm to send the video encoded at a specific bitrate based on the network conditions. Several ABR algorithms such as RB [2], BB [3], BOLA [4], and Robust-MPC [5] use network conditions including throughput estimation, playback buffer occupancy or a combination of both for bitrate estimation with the aim to enhance the QoE for end users. However, traditional ABR algorithms are designed with specific network conditions and traffic pattern assumptions. As a result, they may not perform optimally in networks where network conditions and traffic patterns are subject to rapid and unpredictable change. Recently, several data-driven deep reinforcement learning (DRL) approaches, including Pensieve [6], [7], VSiM [8], NANCY [9], AL-FFEA3C [10], AL-AvgA3C [10], MARL-A3C [11], SAC-ABR [12] and ALISA [13] are proposed to improve the ABR algorithms. DRL is a branch of deep learning that deals with how agents should behave depending on the state of the environment. In DRL, a policy is created to maximize the expected cumulative reward. The policy is the mapping function from states of the environment to actions. Pensieve [6], being one of the first DRL-based methods for ABR generation, is built upon the basic vanilla-A3C algorithm, whereas ALISA [13], being the latest DRL-based ABR method, utilizes soft updates with an A3C algorithm. Both Pensieve and ALISA update the ABR control policy based on the current network conditions and past decisions, and it is able to identify policies that outperform traditional ABR algorithms.
However, these state-of-the-art DRL-based methods suffer from two key drawbacks: (i) there is a lag between each actor’s behavior policy and the central learner’s target policy. Consequently, suboptimal updates emerge when the behavior and target policies become out of synchronization, and (ii) there is a constraint on the divergence between the new and the old policies. Due to these constraints, these algorithms may result in imprecise throughput prediction when there are fluctuations in the network, re-buffering at the client’s device, and inaccurate bitrate selection impacting the overall QoE for the end users. To resolve the above issues, we propose the integration of Proximal Policy Optimization-based DRL for ABR (PPO-ABR) to use a clipped probability ratio for constraining the divergence between the new and the old policy parameters. Our experimental results show that PPO-ABR improves overall video QoE as compared to other state-of-the-art methods.
The rest of the paper is organized as follows: Section II presents the relevant background on reinforcement learning and on-policy RL methods. Section III presents the design of the proposed PPO-ABR algorithm. We present the experimental setup and results in Section IV where we include both training and testing results. Finally, we conclude our work in Section V.
II Background
RL [14] is a learning process that is adaptive to dynamic environments, even in cases where there is little or no prior information. By learning from its mistakes, an agent seeks to optimize its long-term return in the future. The agent’s interactions with the environment are described using a Markov decision process (MDP), where at each time step (represented by ), the agent is situated in a specific state (), chooses an action from a set of available actions (), and then receives a reward () based on its action. The goal of the agent is to find a policy that maps states to actions. The state-value function is given by and the action-value function is given by, where, is a discount factor. The basic on-policy RL method is a vanilla policy gradient method [15] where policy parameters are updated after the calculation of the total reward at the end of the episode instead of a single-step. The policy gradient is given by,
(1) |
where is the advantage function, is the policy optimization using a gradient operator, is the number of steps in the episode and is the current policy parameters. However, the vanilla policy gradient suffers from high variance and high training time due to value estimates being calculated at the end of the episodes instead of every time step. To address these issues, actor-critic methods [15] are proposed. These methods have two components: an actor represented by a policy and a critic represented by an estimate of the action-value function. Neural network function approximators are typically used to represent both of them. With parameters , the critic estimates the current policy’s value function. The main goal of this method is to reduce the variance using single-step state-value estimates. The single-step state-value estimates are derived using a temporal difference , and it is given by:
(2) |
The gradient operator is used to define the policy and critic updates with regard to its parameters and , respectively:
(3) |
(4) |
where and are the actor and critic learning rates, respectively. Furthermore, as an improvement, vanilla-A3C [15] is proposed that uses several copies of the same agent with asynchronous updates. It is more efficient than the actor-critic methods because samples for data can be parallelized using several copies of the same agent resulting in an even smaller training time. In the vanilla-A3C algorithm, the current policy parameters () are updated based on previously collected experience with old policy parameters () after every steps, i.e., after every state-action pairs. The equation below represents the value function update for vanilla-A3C is:
|
(5) |
where presents distribution of state-action pairs, represents the old policy and represents current policy. Note that 0 aims to increase the value function, however, can result in a decrease in the value function and in a increase of divergence between the old and the new policies.
To alleviate this issue, the on-policy trust region policy optimization (TRPO) [16] proposes Kullback–Leibler (KL) divergence constraint to update the value function. The equation (5) is rewritten with KL divergence constraint as follows:
|
(6) |
where is the importance sampling ratio, and is used to constrain the divergence between the new and old policies with as a KL-divergence limit, . We can rewrite equation (6) to maximize only the second part, also known as the surrogate advantage objective, as follows:
(7) |
Although TRPO provides constraints on the divergence between the new and the old policies, it can still lead to instability in policy updates. To address this issue, the on-policy PPO algorithm [17] is proposed that uses a clipped probability ratio to constrain the divergence between the old and the new policy parameters. The objective function in PPO is derived from Equation (7), and the maximization problem is given as:
|
(8) |
where is the hyperparameter for clipping and where CPI refers to a conservative policy iteration. From Equation (8), the first term represents the TRPO unclipped surrogate objective, and the second term represents a modification of the TRPO surrogate objective using a clipped probability ratio , which ensures that the remains within the range . The PPO maximization considers the minimum of the clipped and unclipped objectives resulting in a smaller divergence between the new and the old policy parameters.
III Proposed on-policy ABR Method: PPO-ABR
In this paper, we focus on the HTTP-based video distribution system, as shown in Figure 1 that utilize the DASH framework for multimedia streaming. In such systems, the videos are stored on the server in separate chunks, where each chunk is encoded with a specific bitrate. The client then requests each chunk with the appropriate bitrate from the server using an ABR algorithm, where the ABR algorithm generates the bit rate based on factors such as the available network conditions and the capabilities of the client device. Specifically, an ABR algorithm selects the bitrate for each video chunk based on chunk processor input observations, including the number of chunks , chunk size , chunk bitrate , size of the buffer , throughput , and download time . Additionally, the ABR controller takes the network statistics such as bandwidth and delay into account.
For the state-of-the-art vanilla-A3C, the ABR controller uses multi-agent training with multiple actor and critic neural networks. Each agent is trained in parallel with its own environment based on several state inputs . Moreover, each agent is trained and sends the local gradients to the central agent. Once the central agent has collected experience from the local agents, it updates its model parameters. Further, the central agent will make the decision to play the chunk with a specified bitrate to the chunk handler. The chunk handler sends the information about the chunk to the buffer and finally, the client will play the chunk with quality based on buffer occupancy.
In addition to being less sample efficient, the vanilla-A3C also has a high divergence between the target policy of the central learner and every actor’s behavior policy. The suboptimal updates emerge when the behavior and target policies become out of synchronization. To address these issues, PPO-ABR uses a clipped probability ratio to constrain the KL-divergence between the new and the old policy parameters among several epochs instead of a single epoch as in vanilla-A3C.
Algorithm 1 presents the PPO-ABR algorithm and outlines the critical steps. The input to the algorithm is video samples, including hyperparameter setting for actor and critic networks and state input as . The first step is dividing a video file into chunks. Each chunk is played at a specified bitrate using the selection of the action based on the current state and the policy and to store the corresponding reward at Line 12. The actor-network finds the policy , and the critic network estimates the state value function. The second step of this algorithm is to compute the advantage function using a current policy at Line 15. The third step is to compute the policy divergence between the new and the old policies using an important sampling ratio at Line 17. The fourth step is to update the actor parameters at Line 18 using PPO-clip where occurs when the advantage estimation is positive else is used from Lines 19 to 23. The PPO-clip imposes the penalty on the ratio in both cases. The fourth step is to update the critic parameter () at Line 24.
The output to the algorithm is the actor-network that makes the decision to play the chunk by chunk with a specified bitrate at Line 29, the critic network evaluates the state-value of the policy with PPO-clip for maximizing rewards at Line 30 and the actor and critic parameters are updated based on the actor and the critic loss functions at Line 31. The PPO-ABR trains multiple agents in parallel, so the multi-agents are trained with their environments for each batch iteration. Moreover, the actor and critic parameters are updated using PPO-clip for each batch iteration. The value function parameters are updated after multiple epochs instead of a single epoch. Further, the central agent collects the mini-batch samples and updates the gradient to the next batch iterations. Overall, PPO-ABR results in a stable update and provides the bit rate to encode the next chunk.

IV Experimental details and Results
This section will describe the experimental methodology utilised for this study. This will include a description of the datasets used, the training method employed, the algorithms used for comparison, and the performance metrics used to assess their efficacy.
IV-A Datasets
We utilised multiple datasets FCC [18], Norway [19], LIVE [20], OBOE [21] for our experimentation, including both broadband and mobile datasets. First, we utilised the FCC [18] and Norway datasets [19], which include fixed broadband technologies and Telenor’s 3G/HSDPA mobile wireless network. We utilized 59 and 68 traces from FCC and Norway throughput traces, respectively for our experiments. The range of throughput for both datasets is 0 to 6 Mbps. Secondly, we used live video streaming datasets [20], which consists of data from wireless networks such as WiFi and 4G. The throughput range of these traces is between 0.2 Mbps and 4 Mbps, and 100 traces are utilised in our experiments. Lastly, we utilised OBOE dataset [21], which include 428 traces from 500 video streaming sessions. Each OBOE trace stores the bandwidth measurements collected from wired, wireless, and cellular connections, and the throughput range is between 0 and 3 Mbps.
IV-B Methodologies for Training, Comparative Algorithms, and Performance Metrics
Hyperparameter | Description | Value | Actor-critic algorithms |
Discount factor | 0.99 | Pensieve, SAC-ABR, PPO-ABR | |
Actor network’s learning rate | 0.0001 | Pensieve, SAC-ABR, PPO-ABR | |
Critic network’s learning rate | 0.001 | Pensieve, SAC-ABR, PPO-ABR | |
Entropy regularization factor range | 6 to 0.01 | Pensieve, SAC-ABR, PPO-ABR | |
Interpolation factor | 0.995 | SAC-ABR | |
clipping parameter | 0.2 | PPO-ABR | |
R | Random seed | 42 | PPO-ABR |
Total number of agents | 16 | Pensieve, SAC-ABR, PPO-ABR |
We train PPO-ABR on the aforementioned datasets for 100,000 iterations, and then we choose the model with the highest average reward. Table I summarizes the hyperparameters utilized for PPO-ABR training. Specifically, clipped probability hyperparameter determines how much the new policy deviates from the old policy. These values have been selected based on the previous works [6], [21], and [20]. We use agents for all our experiments. Finally, the performance of the proposed PPO-ABR is compared to that of the following state-of-the-art DRL-based and non-DRL-based ABR algorithms: SAC-ABR [12], Pensieve [6], BB [3], RB [2], BOLA [4], and Robust-MPC [5].
We compare the performance of all ABR algorithms using QoE [12] as a metric. The QoE is expressed as:
(9) |
The QoE is composed of three elements: (i) the total bit rates of all video chunks, (ii) the penalty incurred by re-buffering, and (iii) the video’s smoothness, which is assessed by calculating the difference in bit rates used to encode consecutive chunks. Various versions of the QoE metric are examined in this context as follows: (i) : with rebuffer penalty as and (ii) : with .
Note that we have utilized the above QoE metric formulation since it is commonly used in several other works including Robust-MPC [5], [6], [21], [22], [23] and [12]. There also exist other QoE metric formulations, for example in [7] and [8], that can also be used for the performance evaluation. In this work, we focus only on the QoE metric defined in Equation 9.



RL algorithm | FCC Norway Traces | OBOE Traces | Live traces | |||
---|---|---|---|---|---|---|
PPO-ABR | 45.48 | 45.40 | 45.79 | 46.36 | 44.84 | 45.89 |
SAC-ABR | 42.60 | 45.20 | 41.33 | 43.88 | 41.70 | 43.46 |
Pensieve | 37.45 | 37.84 | 37.05 | 36.30 | 37.20 | 37.59 |
IV-C Training results
We trained PPO-ABR, SAC-ABR, and Pensieve using the three datasets mentioned in the preceding section. Furthermore, in order to maximize entropy, we utilized an entropy regularization ranging from 6 to 0.01 for a better exploration-exploitation tradeoff, i.e., initially, an entropy value of six is used for a few iterations, and then it is gradually decreased to 0.01. It takes approximately eight hours to generate the training model for every algorithm with each dataset. Table II summarizes the QoE metrics obtained during training for the three datasets. The findings indicate that across all three datasets and for both and metrics, PPO-ABR consistently outperforms SAC-ABR and Pensieve, achieving higher QoE metrics.


Figure 2 presents the average QoE value achieved by PPO-ABR, SAC-ABR, and Pensieve algorithms at each training epoch. We can observe that SAC-ABR performs poorly at the initial stages of training due to high exploration. Our results show different behavior for each of these algorithms when the number of epochs increases during the training. The PPO-ABR achieves a high QoE value right from the start of the training. Similar improvements are observed with OBOE in Figure 3 and Live traces in Figure 4 as well, where Table II presents the values of QoE obtained using different ABR algorithms.
IV-D Testing results
The training models are evaluated using the Mahimahi simulator [24]. We used 250 traces from the Norway test datasets and 205 traces from the FCC test datasets to test the models, as stated in [6]. At a bit rate of 12 Mbps and a latency of 30 ms throughout the testing phase, we assessed how well each ABR algorithm performed. Figure 5 displays the average total reward obtained by various ABR algorithms with the metric for each trace when the network is simulated during testing with no packet loss. According to our findings, the PPO-ABR algorithms have a higher average QoE of 46.61 than other ABR algorithms.
ABR algorithm | FCC and Norway traces | OBOE traces | Live traces | |||
---|---|---|---|---|---|---|
PPO-ABR | 46.61 | 44.93 | 45.09 | 46.25 | 46.91 | 45.68 |
SAC-ABR | 42.77 | 43.68 | 39.72 | 45.41 | 42.59 | 43.90 |
Pensieve | 39.63 | 35.26 | 37.96 | 37.01 | 39.12 | 41.68 |
BB | 12.03 | 12.78 | 14.08 | 20 | 13.81 | 20.26 |
RB | 35.62 | 36.45 | 36.22 | 37.31 | 37.45 | 37.35 |
BOLA | 34.26 | 35.30 | 35.04 | 37.09 | 35.82 | 36.05 |
Robust-MPC | 39.93 | 40.44 | 40.18 | 38.29 | 40.59 | 38.99 |
In Figure 6, we compare various ABR algorithms using the average playback bitrate, rebuffering penalty, and smoothness penalty for the metric under emulation with no packet losses during testing in order to understand and illustrate the better performance of the PPO-ABR. Our findings indicate that, with the exception of BOLA and RB, most ABR algorithms attain greater bitrates. Several of these algorithms experience rebuffering penalties due to the higher bitrate choice, with BB and SAC-ABR having the biggest rebuffering penalties. Similarly, BB likewise has a significant smoothness penalty. The PPO-ABR delivers a higher average bit rate and, in comparison, lower smoothness and rebuffering penalties. The PPO-ABR achieves an average QoE higher than the other ABR algorithms due to the combined effects of these individual components. The average QoE values attained by the ABR algorithms when evaluated on the network emulated with no packet losses are then shown in Table III for various QoE metrics.
V Conclusion
We have shown in this study the advantages of adopting on-policy DRL-based PPO-ABR to increase QoE for video streaming. Our suggested method specifically overcomes the limitations currently faced by state-of-the-art DRL-based methods and consistently achieves higher average QoE than SAC-ABR and Pensieve, respectively, by up to and , and even higher QoE when compared to other conventional fixed-rule-based ABR algorithms. Future studies will examine PPO-ABR for edge-driven video distribution services and evaluate it using various QoE metric versions.
Acknowledgment
This work has been supported by TCS foundation under the TCS research scholar program, 2019-2023, India.
References
- [1] “ISO/IEC 23009-1:2014: Dynamic adaptive streaming over HTTP(DASH) – Part 1: Media presentation description and segment formats,” May 2014.
- [2] Y. Sun, X. Yin, J. Jiang, V. Sekar, F. Lin, N. Wang, T. Liu, and B. Sinopoli, “Cs2p: Improving video bitrate selection and adaptation with data-driven throughput prediction,” Proceedings of the 2016 ACM SIGCOMM Conference, 2016.
- [3] T.-Y. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “A buffer-based approach to rate adaptation: Evidence from a large video streaming service,” in Proceedings of the 2014 ACM Conference on SIGCOMM, ser. SIGCOMM ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 187–198.
- [4] K. Spiteri, R. Urgaonkar, and R. K. Sitaraman, “Bola: Near-optimal bitrate adaptation for online videos,” in IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, 2016, pp. 1–9.
- [5] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli, “A control-theoretic approach for dynamic adaptive video streaming over http,” in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, ser. SIGCOMM ’15. New York, NY, USA: Association for Computing Machinery, 2015, p. 325–338.
- [6] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video streaming with pensieve,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 197–210.
- [7] T. Huang, C. Zhou, R.-X. Zhang, C. Wu, and L. Sun, “Learning tailored adaptive bitrate algorithms to heterogeneous network conditions: A domain-specific priors and meta-reinforcement learning approach,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2485–2503, 2022.
- [8] Y. Yuan, W. Wang, Y. Wang, S. S. Adhatarao, B. Ren, K. Zheng, and X. Fu, “Vsim: Improving qoe fairness for video streaming in mobile environments,” in IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, 2022, pp. 1309–1318.
- [9] P. Saxena, M. Naresh, M. Gupta, A. Achanta, S. Kota, and S. Gupta, “Nancy: Neural adaptive network coding methodology for video distribution over wireless networks,” in GLOBECOM 2020 - 2020 IEEE Global Communications Conference, 2020, pp. 1–6.
- [10] M. Naresh, V. Das, P. Saxena, and M. Gupta, “Deep reinforcement learning based qoe-aware actor-learner architectures for video streaming in iot environments,” Computing, vol. 104, 07 2022.
- [11] H. Jin, Q. Wang, S. Li, and J. Chen, “Joint qos control and bitrate selection for video streaming based on multi-agent reinforcement learning,” in 2020 IEEE 16th International Conference on Control & Automation (ICCA), 2020, pp. 1360–1365.
- [12] M. Naresh, N. Gireesh, P. Saxena, and M. Gupta, “Sac-abr: Soft actor-critic based deep reinforcement learning for adaptive bitrate streaming,” in 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), 2022, pp. 353–361.
- [13] M. Naresh, P. Saxena, and M. Gupta, “Deep reinforcement learning with importance weighted a3c for qoe enhancement in video delivery services,” arXiv preprint arXiv:2304.04527, 2023.
- [14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
- [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016.
- [16] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015. [Online]. Available: http://arxiv.org/abs/1502.05477
- [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
- [18] Federal Communications Commission. (2016) Federal communications commission. 2016. raw data - measuring broadband america. [Online]. Available: https://www.fcc.gov/reports-research/reports/ measuring- broadband- america/raw- data- measuring- broadband- america- 2016
- [19] H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen, “Commute path bandwidth traces from 3g networks: Analysis and applications,” ser. MMSys ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 114–118.
- [20] G. Yi, “The acm multimedia 2019 live video streaming grand challenge,” The ACM Multimedia 2019 Live Video Streaming Grand Challenge, October 21–25, 2019, Nice, France.
- [21] Z. Akhtar, Y. S. Nam, R. Govindan, S. Rao, J. Chen, E. Katz-Bassett, B. Ribeiro, J. Zhan, and H. Zhang, “Oboe: Auto-tuning video abr algorithms to network conditions,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 44–58.
- [22] S. Sengupta, N. Ganguly, S. Chakraborty, and P. De, “Hotdash: Hotspot aware adaptive video streaming using deep reinforcement learning,” 2018 IEEE 26th International Conference on Network Protocols (ICNP), pp. 165–175, 2018.
- [23] T. Huang, C. Zhou, R.-X. Zhang, C. Wu, X. Yao, and L. Sun, “Stick: A harmonious fusion of buffer-based and learning-based approach for adaptive streaming,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2020, pp. 1967–1976.
- [24] R. Netravali, A. Sivaraman, S. Das, A. Goyal, K. Winstein, J. Mickens, and H. Balakrishnan, “Mahimahi: Accurate record-and-replay for http.” USA: USENIX Association, 2015.