Reinforcement Learning (RL) algorithms often have poor performance early in the learning process and require many episodes to learn a good policy, which is undesirable for many real world systems such as recommendation systems. However using RL to learn to sequence recommendations is especially desirable as RL can account for the sequential nature of the problem. Additionally, in many real world systems, there is often a large body of HCI or psychology focused prior work studying factors important to users that can be leveraged to learn faster. In recommendation system, some factors are novelty, diversity and accuracy. One way we can incorporate prior findings is by using constraints on the policy to avoid exploring policies prior work suggests may be suboptimal. However it is often hard to translate prior findings into a single constraint an algorithm designer is very confident will work well. Therefore in this work we consider constructing a set of constraints, each of which an algorithm designer hypothesizes, but does not need to be certain, could be helpful. We propose a novel UCRL based algorithm Constraint Sampling Reinforcement Learning which takes as input such a set of constraints and tracks optimistic estimates of the value of following each constraint. Our algorithm then optimistically samples over them to be applied during learning. The main idea behind the method is if some of the constraints in the set are good, our method can learn the best constraint to follow faster than learning the full dynamics, resulting in faster learning and reducing the number of users who have a poor experience. We empirically evaluate our algorithm in a domain fit with real world data and show it is capable of learning faster than strong baselines.
Learn More