vwxyzjn · vwxyzjn · Mar 23, 2022 · Mar 23, 2022
diff --git a/docs/rl-algorithms/ddpg.md b/docs/rl-algorithms/ddpg.md
@@ -23,7 +23,7 @@ Reference resources:
 | :material-github: [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py), :material-file-document: [docs](/rl-algorithms/ddpg/#ddpg_continuous_actionpy) | For continuous action space |
 
 
-Below are our single-file implementations of PPO:
+Below is our single-file implementation of DDPG:
 
 ## `ddpg_continuous_action.py`
 
@@ -46,16 +46,21 @@ python cleanrl/ddpg_continuous_action.py --env-id Hopper-v3
 
 ### Explanation of the logged metrics
 
-Running `python cleanrl/ddpg_continuous_action.py` will automatically record various metrics such as various losses in Tensorboard. Below are the documentation for these metrics:
+Running `python cleanrl/ddpg_continuous_action.py` will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
 
 * `charts/episodic_return`: episodic return of the game
 * `charts/SPS`: number of steps per second
-* `losses/qf1_loss`: the MSE between the Q values at timestep $t$ and the target Q values at timestep $t+1$, which minimizes temporal difference. 
+* `losses/qf1_loss`: the mean squared error (MSE) between the Q values at timestep $t$ and the Bellman update target estimated using the reward $r_t$ and the Q values at timestep $t+1$, thus minimizing the *one-step* temporal difference. Formally, it can be expressed by the equation below.
+$$
+    J(\theta^{Q}) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \big[ (Q(s, a) - y)^2 \big],
+$$
+with the Bellman update target $y = r + \gamma \, Q^{'}(s', a')$, where $a' \sim \mu^{'}(s')$, and the replay buffer $\mathcal{D}$.
+
 * `losses/actor_loss`: implemented as `-qf1(data.observations, actor(data.observations)).mean()`; it is the *negative* average Q values calculated based on the 1) observations and the 2) actions computed by the actor based on these observations. By minimizing `actor_loss`, the optimizer updates the actors parameter using the following gradient (Lillicrap et al., 2016, Algorithm 1)[^1]:
 
 $$ \nabla_{\theta^{\mu}} J \approx  \frac{1}{N}\sum_i\left.\left.\nabla_{a} Q\left(s, a \mid \theta^{Q}\right)\right|_{s=s_{i}, a=\mu\left(s_{i}\right)} \nabla_{\theta^{\mu}} \mu\left(s \mid \theta^{\mu}\right)\right|_{s_{i}} $$
 
-* `losses/qf1_values`: implemented as `qf1(data.observations, data.actions).view(-1); it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over esitmations happen
+* `losses/qf1_values`: implemented as `qf1(data.observations, data.actions).view(-1)`, it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over estimation happens.
 
 
 ### Implementation details
@@ -173,7 +178,7 @@ Below are the average episodic returns for [`ddpg_continuous_action.py`](https:/
 
     Note that [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses gym MuJoCo v2 environments while [`OurDDPG.py`](https://github.com/sfujim/TD3/blob/master/OurDDPG.py) (Fujimoto et al., 2018)[^2] uses the gym MuJoCo v1 environments. According to the :material-github: [openai/gym#834](https://github.com/openai/gym/pull/834), gym MuJoCo v2 environments should be equivalent to the gym MuJoCo v1 environments.
 
-    Also note the performance of our `ddpg_continuous_action.py` seems to perform worse than the reference implementation on Walker2d and Hopper. This is likely due to :material-github: [openai/gym#938](https://github.com/openai/baselines/issues/938). We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated.
+    Also note that our `ddpg_continuous_action.py` scores lower than the reference implementation on Walker2d and Hopper. This is likely due to :material-github: [openai/gym#938](https://github.com/openai/baselines/issues/938). We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated.
 
 Learning curves: