Hard Maze¶

Action Space |
|
Observation Space |
|
Creation |
gymnasium.make(“HardMaze-v0”) |
A classic deceptive-reward maze environment for neuroevolution benchmarks.
Description¶
The Hard Maze environment is a classic deceptive-reward navigation task, originally introduced as a benchmark for neuroevolution and exploration algorithms. This environment is a reimplementation of the canonical ‘hard maze’ used in seminal research on Novelty Search and Quality-Diversity.
The agent, a differential-drive robot, starts at the bottom of a maze and must navigate to a goal location at the top. The maze is “deceptive” because a purely goal-seeking (greedy) agent will be led into a dead-end, as the shortest path is blocked. To succeed, the agent must explore a much longer, seemingly suboptimal path that bypasses the trap. This makes the environment an excellent benchmark for evaluating an algorithm’s ability to handle deception and perform robust exploration.
Action Space¶
The action space is a Box(0, 1, (3,), float32)
. The 3-element vector
corresponds to motor control signals: [left_motor, forward_thrust, right_motor]
.
forward_thrust
: Controls the robot’s forward velocity.left_motor
andright_motor
: Control turning. The turning speed is proportional to the difference(left_motor - right_motor)
.
Observation Space¶
The observation space is a Box(0, 1, (9,), float32)
, which is a concatenation of
the robot’s sensor readings:
5 Rangefinders: These sensors are distributed symmetrically across the robot’s front, from -90 to +90 degrees. They return the normalized distance to the nearest wall in their line of sight. A value of
1.0
means no wall is detected within max range, while0.0
indicates a wall is very close.4 Radar “Pie-Slices”: These sensors detect the goal. They divide the robot’s forward view into four 90-degree arcs. Each sensor returns a binary value (
0.0
or1.0
) indicating whether the goal is within its angular range and maximum detection distance.
Rewards¶
The reward function is sparse and designed to guide the agent through a series of waypoints, or Points of Interest (POIs), before reaching the final goal. This structure is what makes the environment a “hard” maze.
The agent receives a reward of
+1.0
for each POI visited in the correct sequence.If the agent fails to reach the next POI in the sequence, its reward for the rest of the episode is based on its proximity to that unreached POI, calculated as
1.0 - (distance / max_distance)
.A large reward of
10.0
is given upon reaching the final goal.
A simple distance-to-goal reward function would fail in this environment, as it would reinforce moving towards the deceptive trap.
Starting State¶
The robot starts at a fixed position (205, 387)
at the bottom of the maze, with a
fixed heading of 90 degrees (facing upwards).
Episode End¶
The episode is considered terminated
if the robot reaches the goal location
(i.e., its distance to the goal is less than 35.0 units). There are no other
termination conditions. truncated
is always False
.
Arguments¶
import gymnasium as gym
import gymnasium_hardmaze
env = gym.make("HardMaze-v0", render_mode="human")
env_file
: The XML file to load the maze layout from. Defaults tohardmaze_env.xml
.render_mode
: The rendering mode, either"human"
to display the environment or"rgb_array"
to return frames as numpy arrays.time_step
: The duration of each simulation step in seconds. Defaults to0.099
.