Abstract

We revisit the REINFORCE policy gradient algorithm from the literature that works with reward (or cost) returns obtained over episodes or trajectories. We propose a major enhancement to the basic algorithm where we estimate the policy gradient using a smoothed functional (random perturbation) gradient estimator obtained from direct function measurements. To handle the issue of high variance that is typical of REINFORCE, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother convergence and (ii) use clipped gradient estimates as proposed in the Proximal Policy Optimization (PPO) based scheme. We prove the asymptotic convergence of all algorithms and show the results of several experiments on various MuJoCo locomotion tasks wherein we compare the performance of our algorithms with the recently proposed ARS algorithms in the literature as well as other well known algorithms namely A2C, PPO and TRPO. Our algorithms are seen to be competitive against all algorithms and in fact show the best results on a majority of experiments.

Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms

Shalabh Bhatnagar · Deepak H R

Video

Paper PDF

Abstract