SingleStepEnv

class SingleStepEnv:

A lightweight environment for single-step RL with LLMs as policy.

This environment models a single interaction between an LLM-based agent and a problem drawn from a dataset—such as a question-answering or math problem—where the agent produces one response and receives feedback.

Core Flow:

  • A question is sampled from a (possibly infinitely long) dataset.
  • The LLM generates a single-step response (the action).
  • The response is verified against the ground truth.
  • A reward is computed based on correctness and optional custom logic.

Key Features:

  • Batched evaluation with per-sample state tracking.
  • Async setup and teardown for verifiers and related resources.
  • Supports deterministic sampling via local RNG (optional seed).
  • Extensible reward computation via subclassing.

init

def __init__(
    self,
    dataset: Union[StaticDataset, BaseGenerator],
    verifier: BaseVerifier,
    timeout: Optional[float] = 180.0,
    **kwargs
):

Initialize the SingleStepEnv.

Parameters:

  • dataset (Union[StaticDataset, BaseGenerator]): Dataset to sample problems from.
  • verifier (BaseVerifier): Verifier used to evaluate LLM responses against ground-truth answers.
  • timeout (Optional[float], optional): The execution timeout in seconds. (default: :obj:180.0) **kwargs: Optional metadata or configuration values.
  • Notes: This class assumes all interactions are single-step: one question, one LLM response, one reward.

_normalize_actions

def _normalize_actions(self, action: Union[Action, List[Action], str, Dict[int, str]]):

Normalize the user-provided action(s) into a validated list of Action objects.

This method handles flexibility in input format by converting raw strings (only allowed when batch size is 1) and dictionaries, ensuring all necessary structure and integrity checks on actions (e.g., index bounds, duplicates).

Parameters:

  • action (Union[Action, List[Action], str]): The raw input action(s) provided by the agent. Can be: - A single Action object. - A list of Action objects. - A raw string (if batch_size == 1), auto-wrapped in an Action. - A dict mapping int indices to str responses

Returns:

List[Action]: A list of validated Action instances ready for evaluation.

_batch_done

def _batch_done(self):

Returns:

bool: True if all states are marked as done, False otherwise.

_batch_started

def _batch_started(self):

Returns:

bool: True if at least one state is marked as done, False otherwise.

metadata

def metadata(self):

Returns:

Dict[str, Any]: A copy of the environment’s metadata.