Hard Maze

Hard Maze

Action Space

Box(0.0, 1.0, (3,), float32)

Observation Space

Box(0.0, 1.0, (9,), float32)

Creation

gymnasium.make(“HardMaze-v0”)

A classic deceptive-reward maze environment for neuroevolution benchmarks.

Description

The Hard Maze environment is a classic deceptive-reward navigation task, originally introduced as a benchmark for neuroevolution and exploration algorithms. This environment is a reimplementation of the canonical ‘hard maze’ used in seminal research on Novelty Search and Quality-Diversity.

The agent, a differential-drive robot, starts at the bottom of a maze and must navigate to a goal location at the top. The maze is “deceptive” because a purely goal-seeking (greedy) agent will be led into a dead-end, as the shortest path is blocked. To succeed, the agent must explore a much longer, seemingly suboptimal path that bypasses the trap. This makes the environment an excellent benchmark for evaluating an algorithm’s ability to handle deception and perform robust exploration.

Action Space

The action space is a Box(0, 1, (3,), float32). The 3-element vector corresponds to motor control signals: [left_motor, forward_thrust, right_motor].

  • forward_thrust: Controls the robot’s forward velocity.

  • left_motor and right_motor: Control turning. The turning speed is proportional to the difference (left_motor - right_motor).

Observation Space

The observation space is a Box(0, 1, (9,), float32), which is a concatenation of the robot’s sensor readings:

  • 5 Rangefinders: These sensors are distributed symmetrically across the robot’s front, from -90 to +90 degrees. They return the normalized distance to the nearest wall in their line of sight. A value of 1.0 means no wall is detected within max range, while 0.0 indicates a wall is very close.

  • 4 Radar “Pie-Slices”: These sensors detect the goal. They divide the robot’s forward view into four 90-degree arcs. Each sensor returns a binary value (0.0 or 1.0) indicating whether the goal is within its angular range and maximum detection distance.

Rewards

The reward function is sparse and designed to guide the agent through a series of waypoints, or Points of Interest (POIs), before reaching the final goal. This structure is what makes the environment a “hard” maze.

  • The agent receives a reward of +1.0 for each POI visited in the correct sequence.

  • If the agent fails to reach the next POI in the sequence, its reward for the rest of the episode is based on its proximity to that unreached POI, calculated as 1.0 - (distance / max_distance).

  • A large reward of 10.0 is given upon reaching the final goal.

A simple distance-to-goal reward function would fail in this environment, as it would reinforce moving towards the deceptive trap.

Starting State

The robot starts at a fixed position (205, 387) at the bottom of the maze, with a fixed heading of 90 degrees (facing upwards).

Episode End

The episode is considered terminated if the robot reaches the goal location (i.e., its distance to the goal is less than 35.0 units). There are no other termination conditions. truncated is always False.

Arguments

import gymnasium as gym
import gymnasium_hardmaze

env = gym.make("HardMaze-v0", render_mode="human")
  • env_file: The XML file to load the maze layout from. Defaults to hardmaze_env.xml.

  • render_mode: The rendering mode, either "human" to display the environment or "rgb_array" to return frames as numpy arrays.

  • time_step: The duration of each simulation step in seconds. Defaults to 0.099.