Reinforcement Learning (RL) algorithms are a popular class of algorithms for training an agent to learn desired behavior through interaction with an environment whose dynamics are unknown to the agent. Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is called "off-policy prediction". In the first part of the talk, I will discuss a convergent online off-policy algorithm under linear function approximation. Subsequently, I will discuss the “off-policy control” setup, where an agent's objective is to compute an optimal policy based on the data obtained from a behavior policy. To solve this problem, we propose a deep off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the Euclidean gradient actor-critic algorithm on benchmark RL tasks.
Raghuram Bharadwaj is currently working as a senior data scientist at Myntra. He finished his Ph.D. from the Department of Computer science and Automation in 2021. His research interests include Reinforcement Learning, Multi-agent Learning, and Stochastic Approximation.