In this paper, we propose a framework for using reinforcement learning (RL) algorithms to learn good policies for personalized ad recommendation (PAR) systems. The RL algorithms take into account the long-term effect of an action, and thus, could be more suitable than myopic techniques like supervised learning and contextual bandit, for modern PAR systems in which the number of returning visitors is rapidly growing. However, while myopic techniques have been well-studied in PAR systems, the RL approach is still in its infancy, mainly due to two fundamental challenges: how to compute a good RL strategy and how to evaluate a solution using historical data to ensure its “safety” before deployment. In this paper, we propose to use a family of off-policy evaluation techniques with statistical guarantees to tackle both these challenges. We apply these methods to a real PAR problem, both for evaluating the final performance and for optimizing the parameters of the RL algorithm. Our results show that a RL algorithm equipped with these off-policy evaluation techniques outperforms the myopic approaches. Our results also give fundamental insights on the difference between the click through rate (CTR) and life-time value (LTV) metrics for evaluating the performance of a PAR algorithm.
Learn More