# No-Regret Reinforcement Learning with Heavy-Tailed Rewards

@inproceedings{Zhuang2021NoRegretRL, title={No-Regret Reinforcement Learning with Heavy-Tailed Rewards}, author={Vincent Zhuang and Yanan Sui}, booktitle={AISTATS}, year={2021} }

Reinforcement learning algorithms typically assume rewards to be sampled from lighttailed distributions, such as Gaussian or bounded. However, a wide variety of realworld systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging… Expand

#### References

SHOWING 1-10 OF 37 REFERENCES

Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards

- Mathematics, Computer Science
- ICML
- 2019

The key idea is to exploit the Lipschitz property of the expected reward function by adaptively discretizing the arm set, and employ upper confidence bound policies with robust mean estimators designed for heavy-tailed distributions. Expand

Reinforcement Learning with Perturbed Rewards

- Computer Science, Mathematics
- AAAI
- 2020

This work develops a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed, and shows that trained policies based on the estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. Expand

Is Q-learning Provably Efficient?

- Computer Science, Mathematics
- NeurIPS
- 2018

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically… Expand

Reinforcement Learning with a Corrupted Reward Channel

- Computer Science, Mathematics
- IJCAI
- 2017

This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions. Expand

Pure Exploration of Multi-Armed Bandits with Heavy-Tailed Payoffs

- Computer Science
- UAI
- 2018

This paper derives theoretical guarantees for the proposed two bandit algorithms, and demonstrates the effectiveness of two algorithms in pure exploration of MAB with heavy-tailed payoffs in synthetic data and real-world financial data. Expand

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

- Mathematics, Computer Science
- ICML
- 2017

An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Expand

Distributionally Robust Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

This work considers risk-averse exploration in approximate RL setting and proposes the distributionally robust policy iteration scheme that provides lower bound guarantee on state-values and presents a practical algorithm that implements a different exploration strategy that acts conservatively at short-term and explores optimistically in a long-run. Expand

Bandits With Heavy Tail

- Mathematics, Computer Science
- IEEE Transactions on Information Theory
- 2013

This paper examines the bandit problem under the weaker assumption that the distributions have moments of order 1 + ε, and derives matching lower bounds that show that the best achievable regret deteriorates when ε <; 1. Expand

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2015

The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand

Robust Adversarial Reinforcement Learning

- Computer Science
- ICML
- 2017

RARL is proposed, where an agent is trained to operate in the presence of a destabilizing adversary that applies disturbance forces to the system and the jointly trained adversary is reinforced - that is, it learns an optimal destabilization policy. Expand