Michael Zhang PhD Student at Toronto

Guide to Guided Policy Search

It took me a long time to understand guided policy search (GPS), which is a popular algorithm in reinforcment learning. I picked up a surface level understanding from research talks and conversations, but it didn’t really click until I needed to learn it for my own research. I’ll try to distill and present the intuition here, writing in such a way that this post would have been useful to me had I read it a few months ago.

Part of the reason I wrote about model-based reinforcement learning was to setup for this guide. If you are not familiar with optimal control or dual gradient descent, I recommend reading that post first.

as inspiration, GPS in action!

Learning a policy

In our previous discussion on optimal control, we focused only on learning a controller. This controller can be useful in robotic situations where we have a specific starting point and target, but we cannot expect it to generalize. The controller is defined with respect to some specific trajectory and is unlike to work if we deviate significantly from the trajectory (such as by moving the initial position of the robot arm 10 centimeters). In contrast, a policy, such as one parametrized by a neural network, can generalize better and does not require replanning.

Guided policy search is an efficient way of learning such a policy. We will first discuss the general high level view of GPS, then dive into the details that make the algorithm work.

The only change we have to make to our previous formulation on trajectory optimization is to add an additional constraint so that the policy we train takes actions that matches the actions given by the feedback controller. To make this difference clear, recall the previous learning objective for LQR:

\[\min_{u_1, \ldots, u_t} \sum_{t= 1}^T c(x_t, u_t) \text{ s.t. } x_t = f(x_{t-1}, u_{t-1})\]

We previously discussed how to solve this problem when the model \(f\) is given and when it is not (one way is to learn locally linear models and then to use these models to solve for a time-varying linear Gaussian controller). Succinctly, all we are doing is finding the trajectory that minimizes the cost.

\[\min_{\tau} c(\tau)\]

Our objective in the GPS formulation is still to find this optimal trajectory, but we also want the actions taken to come from our policy, parametrized by \(\theta\). Hence our new problem formulation is:

\[\min_{\tau, \theta} c(\tau) \text{ s.t. } u_t = \pi_{\theta}(x_t)\]

Now we have two primal variables to optimize over, \(\tau\) and \(\theta\). We again use dual gradient descent, forming the Lagrangian and turning this into an unconstrained optimization problem.

\[\mathcal{L}(\tau, \theta, \lambda) = c(\tau) + \sum_{t=1}^T \lambda_t(\pi_{\theta}(x_t) - u_t)\]

Running dual gradient descent then corresponds to an iterative three step process.

  1. Start with some initial choice of \(\lambda\) (by \(\lambda\), we include \(\lambda_t\) corresponding to each time step)
  2. Assign \(\tau \leftarrow \arg\min_{\tau} \mathcal{L}(\tau, \theta, \lambda)\).
  3. Assign \(\theta \leftarrow \arg\min_{\theta} \mathcal{L}(\tau, \theta, \lambda)\).
  4. Compute \(\frac{dg}{d \lambda} = \frac{d \mathcal{L}}{d \lambda} (\tau, \theta, \lambda)\). Take a gradient step \(\lambda \leftarrow \lambda + \alpha \frac{dg}{d \lambda}\)
  5. Repeat steps 2-4.

Step 2 can be solved by running a trajectory optimization method (such as iLQR) with an augmented cost function that penalizes deviating from the actions output from the policy.

Step 3 is just supervised learning! Note in particular that \(\theta\) does not affect the first term \(c(\tau)\) so we get this nice decoupling where our only objective is to minimize \(\sum_{t=1}^T \lambda_t(\pi_{\theta}(x_t) - u_t)\). This approach for training a policy avoids the problem of ill-conditioning created because states within the same trajectory are coupled together.

Step 4 is easy, as is usual with DGD, because it corresponds to calculating the amount of constraint violation.

To build intuition on why this algorithm works, I’ll quote this excellent description by Zhang et al.:

Since each trajectory-centric teacher only needs to solve the task from a single initial state, it is faced with a much easier problem. The final policy is trained with supervised learning, which allows us to use a nonlinear, high-dimensional representation for this final policy, such as a multilayer neural network, in order to learn complex behaviors with good generalization. A key component in guided policy search is adaptation between the trajectories produced by the teacher and the final policy. This adaptation ensures that, at convergence, the teacher does not take actions that the final policy cannot reproduce. This is realized by an alternating optimization procedure, which iteratively optimizes the policy to match each teacher, while the teachers adapt to gradually match the behavior of the final policy.

That’s the gist of GPS. This procedure is neat in that we can view the policy as imitating a supervised learning teacher, but the teacher is adapting to produce actions that the learner can also execute (because of the alternative optimization and the ) We’ll end by describing some additional implementation details for End-to-End Training of Deep Visuomotor Policies.

  • When we form the Lagrangian, we add in a quadratic term that penalizes constraint violations. \(\mathcal{L}(\tau, \theta, \lambda) = c(\tau) + \sum_{t=1}^T \lambda_t(\pi_{\theta}(x_t) - u_t) + \sum_{t=1}^T \rho_t(\pi_{\theta}(x_t) - u_t)^2\)

This tweak is based on BADMM, a variant of ADMM (alternating direction method of multipliers). The augmented Lagrangian coverges to the same solution, but can improve stability when the constraint is heavily violated.

  • (observation / action remapping trick)

  • (reward function description)