This post focuses on understanding proximal policy optimization, which is an on-policy policy gradient reinforcement learning algorithm.
Clipping
L^{\mathrm{CPI}}(\theta) = \hat{\mathbb{E}}_{t}\!\Biggl[ \textcolor{blue}{\frac{\pi_{\theta}(a_{t}\mid s_{t})} {\pi_{\theta_{\mathrm{old}}}(a_{t}\mid s_{t})}} \,\hat{A}_{t} \Biggr] \;=\; \hat{\mathbb{E}}_{t}\!\bigl[\textcolor{blue}{r_{t}(\theta)}\,\hat{A}_{t}\bigr]
Clipped Surrogate Objective Plot1
There are 2 subfigures for this plot, illustrating 2 conditions where \(A\gt 0\) and \(A\lt 0\), the case of \(A=0\) is ignored as the clipping surrogate objective would be \(0\) for both \(r_{t}(\theta)\,\hat{A}_{t}\) and \(\mathrm{clip}\bigl(r_{t}(\theta),\,1-\epsilon,\,1+\epsilon\bigr)\,\hat{A}_{t}\bigr)\)
L^{\mathrm{CLIP}}(\theta) = \hat{\mathbb{E}}_{t}\Bigl[\, \min\bigl( r_{t}(\theta)\,\hat{A}_{t},\; \mathrm{clip}\bigl(r_{t}(\theta),\,1-\epsilon,\,1+\epsilon\bigr)\,\hat{A}_{t} \bigr) \Bigr]
The clipped surrogate objective function is the objective function we want to maximize, thus we want to use gradient ascent rather than gradient descent to update the network parameters \(\theta\). In addition, \(\hat{A}\) is only related to \(pi_{\theta_{\mathrm{old}}}(a_{t}\mid s_{t})\) as \(\pi_{\theta}(a_{t}\mid s_{t})\) is not being used to interact with the environment.
When \(A\gt 0\) which refers to the left figure, it could be seen that \(L^{\mathrm{CLIP}}(\theta)\) is on first quadrant of the coordinate axis as \(A\gt 0\) and ratio should always larger than \(0\).
At first, this plot is confusing in a sense that a \(\mathrm{min}\left(\right)\) is applied to the clipped and unclipped functions, especially in the region where \(r\leqslant 1-\epsilon\), why it does not clip like the region where \(r\geqslant 1+\epsilon\), how this clipping mechanism could help policy from going far away. Suppose \(\pi_{\theta}(a_{t}\mid s_{t})\) and \(\pi_{\theta_{\mathrm{old}}}(a_{t}\mid s_{t})\) generate 2 probabilities \(a\) based on the current state \(s\), after the calculation we get \(r\leqslant 1-\epsilon\) and \(\hat{A}\gt 0\), this means our new policy is not good as the old policy as surrogate objective function value is and we want to adjust back to the old policy, even when \(\hat{A}\gt 0\). Note that surrogate objective function is not a loss function but to provide stable and reliable policy update, in loss function we want to minimize its values.
- Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).[↩]
Leave a Reply