Publications

Safe Policy Improvement by Minimizing Robust Baseline Regret

Thirtieth Annual Conference on Advances in Neural Information Processing Systems (NIPS-2016), 2016.

Publication date: December 1, 2016

Marek Petrik, Mohammad Ghavamzadeh, Yinlam Chow

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new model-based approach that computes a safe policy, given an inaccurate model of the system’s dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.

Learn More

Research Area:  Adobe Research iconAI & Machine Learning