- Main
StaTeS-SQL: Soft Q Learning with State-Dependent Temperature Scheduling
- Hu, Dailin
- Advisor(s): Fox, Roy
Abstract
Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperature should be high early in training to avoid overfitting to noisy value estimates and decrease later in training as we increasingly trust high value estimates to truly lead to good rewards. Moreover, our confidence in value estimates is state-dependent, increasing every time we use more evidence to update a state's value estimate. In this paper, we present a simple state-based temperature scheduling approach and instantiate it for SQL as StaTeS-SQL. We prove the convergence of this method in the tabular case, describe how to use pseudo-counts generated by a density model to schedule the state-dependent temperature in large state spaces, and propose a combination of our method with advanced techniques collectively known as Rainbow. We evaluate our approach on the Atari Learning Environment benchmark and outperform Rainbow in 18 of 20 domains.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-