Deterministic Bellman Residual Minimization

Ehsan Saleh, Nan Jiang

Abstract

Bellman Residual Minimization (BRM) algorithms are alternatives to Approximate Dynamic Programming for learning value functions in reinforcement learning.

Despite that some versions of BRM have superior theoretical properties, the superiority comes from the double sampling trick, limiting their applicability to simulator environments with state resetting functionality.

In this paper, we make a simple observation that BRM can be applied without state resetting if the environment is deterministic, which Baird [1995] has hinted in his original paper.

The resulting algorithm is simple, convergent, and works well in benchmark control problems.

We compare Q-learning to its DBRM version theoretically, confirm the theoretical predictions from experiments, and also discover some surprising empirical behaviors and provide explanations.

Download link