Batch Value Function Tournament
Presenter
August 3, 2021
Abstract
Offline RL has attracted significant attention from the community as it offers the possibility of applying RL when active data collection is difficult. A key missing ingredient, however, is a reliable model-selection procedure that enables hyperparameter tuning, and reduction to off-policy evaluation either suffer exponential variance or relies on additional hyperparameters, creating a chicken-and-egg problem. In this talk I will discuss our recent progress on a version of this problem, where we need to identify Q* from a large set of candidate functions using a polynomial-sized exploratory dataset. The question is also a long-standing open problem about the information-theoretic nature of batch RL, and many suspected that the task is simply impossible. In our recent work, we provide a solution to this seemingly impossible task via (1) a tournament procedure that performs pairwise comparisons, and (2) a clever trick that partitions the large state-action space adaptively according to the compared functions. The resulting algorithm, BVFT, is very simple and can be readily applied to cross validation, with preliminary empirical results showing promising performance.