Skip to content

Commit a913229

Browse files
author
EnliteAI Bot
committed
RL-2049: Remove leftovers; Update gymnasium urls; Update seed and reset method to use a flag determining if seed or None is used in the reset; Update way of keeping track of termination and truncation in the step method
(Issue RL-2049 - maze-rl update gym to gymnasium)
1 parent f48afe3 commit a913229

File tree

5 files changed

+27
-10
lines changed

5 files changed

+27
-10
lines changed

docs/source/logging/action_distribution_visualization.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Discrete and Multi Binary Actions
4040
Each :ref:`action space <action_spaces_and_distributions>` has a dedicated visualization assigned.
4141
Discrete and multi-binary action spaces are visualized via histograms.
4242
The example below shows an action sampling distribution for the discrete version of
43-
`LunarLander-v3 <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_.
43+
`LunarLander-v3 <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_.
4444
The indices on the x-axis correspond to the available actions:
4545

4646
- Action :math:`a_0` - do nothing
@@ -58,7 +58,7 @@ Continuous Actions
5858

5959
Continuous actions (Box spaces) are visualized via violin plots.
6060
The example below shows an action sampling distribution for
61-
`LunarLanderContinuous-v3 <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_.
61+
`LunarLanderContinuous-v3 <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_.
6262
The indices on the x-axis correspond to the available actions:
6363

6464
- Action :math:`a_1` - controls the main engine:

docs/source/policy_and_value_networks/perception_custom_models.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ Even though designed for more complex models that process multiple observations
7272
same time you can also compose models for simpler use cases, of course.
7373

7474
In this example we utilize the custom model composer in combination with the perception blocks to compose an
75-
actor-critic model for OpenAI Gym's `CartPole <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_
75+
actor-critic model for OpenAI Gym's `CartPole <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_
7676
using a single dense block in each network.
7777
CartPole has an observation space with dimensionality four and a discrete action space with two options.
7878

@@ -162,7 +162,7 @@ but not necessarily need to use them.
162162

163163
**Important**: Your models have to use dictionaries with torch.Tensors as values for both inputs and outputs.
164164

165-
For Gym's `CartPole <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_ the policy model could be defined like this:
165+
For Gym's `CartPole <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_ the policy model could be defined like this:
166166

167167
.. literalinclude:: code_snippets/custom_plain_cartpole_policy_net.py
168168
:language: PYTHON

docs/source/policy_and_value_networks/perception_template_models.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ which process multiple observations and prediction multiple actions at the same
146146
you can of course also compose models for simpler use cases.
147147

148148
In this example we utilize the :ref:`ConcatModelBuilder <concat_model_builder>`
149-
to compose an actor-critic model for OpenAI Gym's `CartPole Env <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_.
149+
to compose an actor-critic model for OpenAI Gym's `CartPole Env <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_.
150150
CartPole has an observation space with dimensionality four and a discrete action spaces with two options.
151151

152152
The model config is defined as:

docs/source/workflow/imitation_and_fine_tuning.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ As the training trajectories might be already available (e.g., collected in prac
2929
this step is optional.
3030

3131
As an example environment we pick the discrete version of the
32-
`LunarLander environment <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_
32+
`LunarLander environment <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_
3333
as it already provides a heuristic policy which we can use to collect or training trajectories for imitation learning.
3434

3535
.. image:: lunar_lander.png

maze/core/wrappers/maze_gym_env_wrapper.py

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -165,16 +165,20 @@ def __init__(self, env: gym.Env):
165165
self._maze_state: Optional[Dict] = None
166166

167167
self._current_seed = None
168+
self._need_seeding = True
168169

169170
def step(self, maze_action: MazeActionType) -> Tuple[MazeStateType, Union[float, np.ndarray, Any], bool, Dict[Any, Any]]:
170171
"""Intercept ``CoreEnv.step``"""
171172
maze_state, rew, terminated, truncated, info = self.env.step(maze_action)
172173
self._maze_state = maze_state
173174

174-
info['step-terminated'] = terminated
175-
info['step-truncated'] = truncated
176-
done = np.logical_or(terminated, truncated)
175+
if terminated:
176+
info['TimeLimit.terminated'] = True
177+
178+
if truncated:
179+
info['TimeLimit.truncated'] = True
177180

181+
done = np.logical_or(terminated, truncated)
178182
return maze_state, rew, done, info
179183

180184
@override(CoreEnv)
@@ -200,8 +204,20 @@ def close(self) -> None:
200204
@override(CoreEnv)
201205
def reset(self) -> MazeStateType:
202206
"""Intercept ``CoreEnv.reset``"""
203-
maze_state, _ = self.env.reset(seed=self._current_seed)
207+
# Newer versions of gymnasium (v0.26+) require setting the seed with env.reset(seed) the first time this seed is
208+
# applied. Subsequent resets using the same seed only need an env.reset(seed=None).
209+
# The previous workflow, where env.seed(seed) was followed by env.reset(), is not possible to use right out of
210+
# the box anymore. Added the _need_seeding flag to keep track of the need to apply a seed and to allow/enable
211+
# the old workflow.
212+
seed = None
213+
if self._need_seeding:
214+
seed = self._current_seed
215+
216+
maze_state, _ = self.env.reset(seed=seed)
217+
204218
self._maze_state = maze_state
219+
self._need_seeding = False
220+
205221
return maze_state
206222

207223
def get_current_seed(self) -> int:
@@ -212,6 +228,7 @@ def get_current_seed(self) -> int:
212228
def seed(self, seed: int) -> None:
213229
"""Intercept ``CoreEnv.seed``"""
214230
self._current_seed = seed
231+
self._need_seeding = True
215232

216233
@override(CoreEnv)
217234
def is_actor_done(self) -> bool:

0 commit comments

Comments
 (0)