RL-2049: Remove leftovers; Update gymnasium urls; Update seed and reset method to use a flag determining if seed or None is used in the reset; Update way of keeping track of termination and truncation in the step method

EnliteAI Bot · EnliteAI Bot · commit a9132291aadb · 2025-03-17T06:54:42.000Z
(Issue RL-2049 - maze-rl update gym to gymnasium)
diff --git a/docs/source/logging/action_distribution_visualization.rst b/docs/source/logging/action_distribution_visualization.rst
@@ -40,7 +40,7 @@ Discrete and Multi Binary Actions
 Each :ref:`action space <action_spaces_and_distributions>` has a dedicated visualization assigned.
 Discrete and multi-binary action spaces are visualized via histograms.
 The example below shows an action sampling distribution for the discrete version of
-`LunarLander-v3 <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_.
+`LunarLander-v3 <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_.
 The indices on the x-axis correspond to the available actions:
 
 - Action :math:`a_0` - do nothing
@@ -58,7 +58,7 @@ Continuous Actions
 
 Continuous actions (Box spaces) are visualized via violin plots.
 The example below shows an action sampling distribution for
-`LunarLanderContinuous-v3 <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_.
+`LunarLanderContinuous-v3 <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_.
 The indices on the x-axis correspond to the available actions:
 
 - Action :math:`a_1` - controls the main engine:
diff --git a/docs/source/policy_and_value_networks/perception_custom_models.rst b/docs/source/policy_and_value_networks/perception_custom_models.rst
@@ -72,7 +72,7 @@ Even though designed for more complex models that process multiple observations
 same time you can also compose models for simpler use cases, of course.
 
 In this example we utilize the custom model composer in combination with the perception blocks to compose an
-actor-critic model for OpenAI Gym's `CartPole <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_
+actor-critic model for OpenAI Gym's `CartPole <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_
 using a single dense block in each network.
 CartPole has an observation space with dimensionality four and a discrete action space with two options.
 
@@ -162,7 +162,7 @@ but not necessarily need to use them.
 
 **Important**: Your models have to use dictionaries with torch.Tensors as values for both inputs and outputs.
 
-For Gym's `CartPole <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_ the policy model could be defined like this:
+For Gym's `CartPole <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_ the policy model could be defined like this:
 
 .. literalinclude:: code_snippets/custom_plain_cartpole_policy_net.py
   :language: PYTHON
diff --git a/docs/source/policy_and_value_networks/perception_template_models.rst b/docs/source/policy_and_value_networks/perception_template_models.rst
@@ -146,7 +146,7 @@ which process multiple observations and prediction multiple actions at the same
 you can of course also compose models for simpler use cases.
 
 In this example we utilize the :ref:`ConcatModelBuilder <concat_model_builder>`
-to compose an actor-critic model for OpenAI Gym's `CartPole Env <https://www.gymlibrary.dev/environments/classic_control/cart_pole/#cart-pole>`_.
+to compose an actor-critic model for OpenAI Gym's `CartPole Env <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`_.
 CartPole has an observation space with dimensionality four and a discrete action spaces with two options.
 
 The model config is defined as:
diff --git a/docs/source/workflow/imitation_and_fine_tuning.rst b/docs/source/workflow/imitation_and_fine_tuning.rst
@@ -29,7 +29,7 @@ As the training trajectories might be already available (e.g., collected in prac
 this step is optional.
 
 As an example environment we pick the discrete version of the
-`LunarLander environment <https://www.gymlibrary.dev/environments/box2d/lunar_lander/>`_
+`LunarLander environment <https://gymnasium.farama.org/environments/box2d/lunar_lander/>`_
 as it already provides a heuristic policy which we can use to collect or training trajectories for imitation learning.
 
 .. image:: lunar_lander.png
diff --git a/maze/core/wrappers/maze_gym_env_wrapper.py b/maze/core/wrappers/maze_gym_env_wrapper.py
@@ -165,16 +165,20 @@ def __init__(self, env: gym.Env):
         self._maze_state: Optional[Dict] = None
 
         self._current_seed = None
+        self._need_seeding = True
 
     def step(self, maze_action: MazeActionType) -> Tuple[MazeStateType, Union[float, np.ndarray, Any], bool, Dict[Any, Any]]:
         """Intercept ``CoreEnv.step``"""
         maze_state, rew, terminated, truncated, info = self.env.step(maze_action)
         self._maze_state = maze_state
 
-        info['step-terminated'] = terminated
-        info['step-truncated'] = truncated
-        done = np.logical_or(terminated, truncated)
+        if terminated:
+            info['TimeLimit.terminated'] = True
+
+        if truncated:
+            info['TimeLimit.truncated'] = True
 
+        done = np.logical_or(terminated, truncated)
         return maze_state, rew, done, info
 
     @override(CoreEnv)
@@ -200,8 +204,20 @@ def close(self) -> None:
     @override(CoreEnv)
     def reset(self) -> MazeStateType:
         """Intercept ``CoreEnv.reset``"""
-        maze_state, _ = self.env.reset(seed=self._current_seed)
+        # Newer versions of gymnasium (v0.26+) require setting the seed with env.reset(seed) the first time this seed is
+        # applied. Subsequent resets using the same seed only need an env.reset(seed=None).
+        # The previous workflow, where env.seed(seed) was followed by env.reset(), is not possible to use right out of
+        # the box anymore. Added the _need_seeding flag to keep track of the need to apply a seed and to allow/enable
+        # the old workflow.
+        seed = None
+        if self._need_seeding:
+            seed = self._current_seed
+
+        maze_state, _ = self.env.reset(seed=seed)
+
         self._maze_state = maze_state
+        self._need_seeding = False
+
         return maze_state
 
     def get_current_seed(self) -> int:
@@ -212,6 +228,7 @@ def get_current_seed(self) -> int:
     def seed(self, seed: int) -> None:
         """Intercept ``CoreEnv.seed``"""
         self._current_seed = seed
+        self._need_seeding = True
 
     @override(CoreEnv)
     def is_actor_done(self) -> bool: