Constrained Markov Decision Process (CMDP) is a framework where a learner interacts with an unknown environment in order to maximize the discounted sum of expected rewards while simultaneously ensuring that the discounted sum of expected costs lie within a pre-specified margin. This talk will discuss our recent work on designing a sample efficient algorithm for CMDPs with general parameterized policies. Note that the general parameterization allows the possibility of the state space being infinite and subsumes the tabular and linear CMDPs as special cases. Moreover, it also allows the polices to be represented as neural networks of arbitrary shape and size. I will show that our proposed Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm can simultaneously ensure an epsilon optimality gap and an epsilon constraint violation with a sample complexity that increases at most quadratically with a decrease in epsilon. It is a significant improvement over the previous state-of-the-art and closes the gap between theoretical lower and upper bounds.
Dr. Washim Uddin Mondal is an assistant professor in the Department of Electrical Engineering at IIT Kanpur. He was a postdoctoral researcher from April 2021 to Dec. 2023 at Purdue University, USA. He obtained both his Ph. D. (2017-2021) and B. Tech.-M. Tech. Dual Degree (2011-2016) from IIT Kharagpur. For a brief period in 2016, he worked as a quantitative financial analyst in WorldQuant, Mumbai. During Ph. D., he was conferred with the Prime Minister's Research Fellowship (PMRF) and his thesis was selected as one of the national winners of the Graduate Thesis Evaluation in 7 Minutes (GraTE-7) competition organized by IEEE Communications Society. He has both coauthored multiple papers and served as a reviewer/TPC member in various reputed venues such as IEEE Transactions, IEEE ICC, AAAI, NeurIPS, AISTATS, etc. He has also worked as a consultant to the Indian Army to solve problems of national interest.