|
9 | 9 |
|
10 | 10 |
|
11 | 11 | class AntEnv(mujoco_env.MujocoEnv, utils.EzPickle):
|
12 |
| - """ |
13 |
| - ### Description |
14 |
| -
|
15 |
| - This environment is based on the environment introduced by Schulman, |
16 |
| - Moritz, Levine, Jordan and Abbeel in ["High-Dimensional Continuous Control |
17 |
| - Using Generalized Advantage Estimation"](https://arxiv.org/abs/1506.02438). |
18 |
| - The ant is a 3D robot consisting of one torso (free rotational body) with |
19 |
| - four legs attached to it with each leg having two links. The goal is to |
20 |
| - coordinate the four legs to move in the forward (right) direction by applying |
21 |
| - torques on the eight hinges connecting the two links of each leg and the torso |
22 |
| - (nine parts and eight hinges). |
23 |
| -
|
24 |
| - ### Action Space |
25 |
| - The action space is a `Box(-1, 1, (8,), float32)`. An action represents the torques applied at the hinge joints. |
26 |
| -
|
27 |
| - | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit | |
28 |
| - |-----|----------------------|---------------|----------------|---------------------------------------|-------|------| |
29 |
| - | 0 | Torque applied on the rotor between the torso and front left hip | -1 | 1 | hip_1 (front_left_leg) | hinge | torque (N m) | |
30 |
| - | 1 | Torque applied on the rotor between the front left two links | -1 | 1 | angle_1 (front_left_leg) | hinge | torque (N m) | |
31 |
| - | 2 | Torque applied on the rotor between the torso and front right hip | -1 | 1 | hip_2 (front_right_leg) | hinge | torque (N m) | |
32 |
| - | 3 | Torque applied on the rotor between the front right two links | -1 | 1 | angle_2 (front_right_leg) | hinge | torque (N m) | |
33 |
| - | 4 | Torque applied on the rotor between the torso and back left hip | -1 | 1 | hip_3 (back_leg) | hinge | torque (N m) | |
34 |
| - | 5 | Torque applied on the rotor between the back left two links | -1 | 1 | angle_3 (back_leg) | hinge | torque (N m) | |
35 |
| - | 6 | Torque applied on the rotor between the torso and back right hip | -1 | 1 | hip_4 (right_back_leg) | hinge | torque (N m) | |
36 |
| - | 7 | Torque applied on the rotor between the back right two links | -1 | 1 | angle_4 (right_back_leg) | hinge | torque (N m) | |
37 |
| -
|
38 |
| - ### Observation Space |
39 |
| -
|
40 |
| - Observations consist of positional values of different body parts of the ant, |
41 |
| - followed by the velocities of those individual parts (their derivatives) with all |
42 |
| - the positions ordered before all the velocities. |
43 |
| -
|
44 |
| - By default, observations do not include the x- and y-coordinates of the ant's torso. These may |
45 |
| - be included by passing `exclude_current_positions_from_observation=False` during construction. |
46 |
| - In that case, the observation space will have 113 dimensions where the first two dimensions |
47 |
| - represent the x- and y- coordinates of the ant's torso. |
48 |
| - Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates |
49 |
| - of the torso will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively. |
50 |
| -
|
51 |
| - However, by default, an observation is a `ndarray` with shape `(111,)` |
52 |
| - where the elements correspond to the following: |
53 |
| -
|
54 |
| - | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit | |
55 |
| - |-----|-------------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------| |
56 |
| - | 0 | z-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) | |
57 |
| - | 1 | x-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) | |
58 |
| - | 2 | y-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) | |
59 |
| - | 3 | z-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) | |
60 |
| - | 4 | w-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) | |
61 |
| - | 5 | angle between torso and first link on front left | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) | |
62 |
| - | 6 | angle between the two links on the front left | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) | |
63 |
| - | 7 | angle between torso and first link on front right | -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) | |
64 |
| - | 8 | angle between the two links on the front right | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) | |
65 |
| - | 9 | angle between torso and first link on back left | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) | |
66 |
| - | 10 | angle between the two links on the back left | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) | |
67 |
| - | 11 | angle between torso and first link on back right | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) | |
68 |
| - | 12 | angle between the two links on the back right | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) | |
69 |
| - | 13 | x-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) | |
70 |
| - | 14 | y-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) | |
71 |
| - | 15 | z-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) | |
72 |
| - | 16 | x-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) | |
73 |
| - | 17 | y-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) | |
74 |
| - | 18 | z-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) | |
75 |
| - | 19 | angular velocity of angle between torso and front left link | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) | |
76 |
| - | 20 | angular velocity of the angle between front left links | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) | |
77 |
| - | 21 | angular velocity of angle between torso and front right link| -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) | |
78 |
| - | 22 | angular velocity of the angle between front right links | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) | |
79 |
| - | 23 | angular velocity of angle between torso and back left link | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) | |
80 |
| - | 24 | angular velocity of the angle between back left links | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) | |
81 |
| - | 25 | angular velocity of angle between torso and back right link | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) | |
82 |
| - | 26 |angular velocity of the angle between back right links | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) | |
83 |
| -
|
84 |
| -
|
85 |
| - The remaining 14*6 = 84 elements of the observation are contact forces |
86 |
| - (external forces - force x, y, z and torque x, y, z) applied to the |
87 |
| - center of mass of each of the links. The 14 links are: the ground link, |
88 |
| - the torso link, and 3 links for each leg (1 + 1 + 12) with the 6 external forces. |
89 |
| -
|
90 |
| - The (x,y,z) coordinates are translational DOFs while the orientations are rotational |
91 |
| - DOFs expressed as quaternions. One can read more about free joints on the [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html). |
92 |
| -
|
93 |
| -
|
94 |
| - **Note:** There have been reported issues that using a Mujoco-Py version > 2.0 results |
95 |
| - in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0 |
96 |
| - when using the Ant environment if you would like to report results with contact forces (if |
97 |
| - contact forces are not used in your experiments, you can use version > 2.0). |
98 |
| -
|
99 |
| - ### Rewards |
100 |
| - The reward consists of three parts: |
101 |
| - - *healthy_reward*: Every timestep that the ant is healthy (see definition in section "Episode Termination"), it gets a reward of fixed value `healthy_reward` |
102 |
| - - *forward_reward*: A reward of moving forward which is measured as |
103 |
| - *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is the time |
104 |
| - between actions and is dependent on the `frame_skip` parameter (default is 5), |
105 |
| - where the frametime is 0.01 - making the default *dt = 5 * 0.01 = 0.05*. |
106 |
| - This reward would be positive if the ant moves forward (in positive x direction). |
107 |
| - - *ctrl_cost*: A negative reward for penalising the ant if it takes actions |
108 |
| - that are too large. It is measured as *`ctrl_cost_weight` * sum(action<sup>2</sup>)* |
109 |
| - where *`ctr_cost_weight`* is a parameter set for the control and has a default value of 0.5. |
110 |
| - - *contact_cost*: A negative reward for penalising the ant if the external contact |
111 |
| - force is too large. It is calculated *`contact_cost_weight` * sum(clip(external contact |
112 |
| - force to `contact_force_range`)<sup>2</sup>)*. |
113 |
| -
|
114 |
| - The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost - contact_cost* and `info` will also contain the individual reward terms. |
115 |
| -
|
116 |
| - ### Starting State |
117 |
| - All observations start in state |
118 |
| - (0.0, 0.0, 0.75, 1.0, 0.0 ... 0.0) with a uniform noise in the range |
119 |
| - of [-`reset_noise_scale`, `reset_noise_scale`] added to the positional values and standard normal noise |
120 |
| - with mean 0 and standard deviation `reset_noise_scale` added to the velocity values for |
121 |
| - stochasticity. Note that the initial z coordinate is intentionally selected |
122 |
| - to be slightly high, thereby indicating a standing up ant. The initial orientation |
123 |
| - is designed to make it face forward as well. |
124 |
| -
|
125 |
| - ### Episode Termination |
126 |
| - The ant is said to be unhealthy if any of the following happens: |
127 |
| -
|
128 |
| - 1. Any of the state space values is no longer finite |
129 |
| - 2. The z-coordinate of the torso is **not** in the closed interval given by `healthy_z_range` (defaults to [0.2, 1.0]) |
130 |
| -
|
131 |
| - If `terminate_when_unhealthy=True` is passed during construction (which is the default), |
132 |
| - the episode terminates when any of the following happens: |
133 |
| -
|
134 |
| - 1. The episode duration reaches a 1000 timesteps |
135 |
| - 2. The ant is unhealthy |
136 |
| -
|
137 |
| - If `terminate_when_unhealthy=False` is passed, the episode is terminated only when 1000 timesteps are exceeded. |
138 |
| -
|
139 |
| - ### Arguments |
140 |
| -
|
141 |
| - No additional arguments are currently supported in v2 and lower. |
142 |
| -
|
143 |
| - ``` |
144 |
| - env = gym.make('Ant-v2') |
145 |
| - ``` |
146 |
| -
|
147 |
| - v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. |
148 |
| -
|
149 |
| - ``` |
150 |
| - env = gym.make('Ant-v3', ctrl_cost_weight=0.1, ...) |
151 |
| - ``` |
152 |
| -
|
153 |
| - | Parameter | Type | Default |Description | |
154 |
| - |-------------------------|------------|--------------|-------------------------------| |
155 |
| - | `xml_file` | **str** | `"ant.xml"` | Path to a MuJoCo model | |
156 |
| - | `ctrl_cost_weight` | **float** | `0.5` | Weight for *ctrl_cost* term (see section on reward) | |
157 |
| - | `contact_cost_weight` | **float** | `5e-4` | Weight for *contact_cost* term (see section on reward) | |
158 |
| - | `healthy_reward` | **float** | `1` | Constant reward given if the ant is "healthy" after timestep | |
159 |
| - | `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the z-coordinate of the torso is no longer in the `healthy_z_range` | |
160 |
| - | `healthy_z_range` | **tuple** | `(0.2, 1)` | The ant is considered healthy if the z-coordinate of the torso is in this range | |
161 |
| - | `contact_force_range` | **tuple** | `(-1, 1)` | Contact forces are clipped to this range in the computation of *contact_cost* | |
162 |
| - | `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) | |
163 |
| - | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies | |
164 |
| -
|
165 |
| - ### Version History |
166 |
| -
|
167 |
| - * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen) |
168 |
| - * v2: All continuous control environments now use mujoco_py >= 1.50 |
169 |
| - * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments. |
170 |
| - * v0: Initial versions release (1.0.0) |
171 |
| - """ |
172 |
| - |
173 | 12 | def __init__(
|
174 | 13 | self,
|
175 | 14 | xml_file="ant.xml",
|
@@ -199,7 +38,7 @@ def __init__(
|
199 | 38 | exclude_current_positions_from_observation
|
200 | 39 | )
|
201 | 40 |
|
202 |
| - mujoco_env.MujocoEnv.__init__(self, xml_file, 5) |
| 41 | + mujoco_env.MujocoEnv.__init__(self, xml_file, 5, mujoco_bindings="mujoco_py") |
203 | 42 |
|
204 | 43 | @property
|
205 | 44 | def healthy_reward(self):
|
|
0 commit comments