SingleStepEnv
- A question is sampled from a (possibly infinitely long) dataset.
- The LLM generates a single-step response (the action).
- The response is verified against the ground truth.
- A reward is computed based on correctness and optional custom logic.
- Batched evaluation with per-sample state tracking.
- Async setup and teardown for verifiers and related resources.
- Supports deterministic sampling via local RNG (optional seed).
- Extensible reward computation via subclassing.
init
- dataset (Union[StaticDataset, BaseGenerator]): Dataset to sample problems from.
- verifier (BaseVerifier): Verifier used to evaluate LLM responses against ground-truth answers.
- timeout (Optional[float], optional): The execution timeout in seconds. (default: :obj:
180.0
) **kwargs: Optional metadata or configuration values.
_normalize_actions
Action
objects.
This method handles flexibility in input format by converting
raw strings (only allowed when batch size is 1) and dictionaries,
ensuring all necessary structure and integrity checks on
actions (e.g., index bounds, duplicates).
Parameters:
- action (Union[Action, List[Action], str]): The raw input action(s) provided by the agent. Can be: - A single
Action
object. - A list ofAction
objects. - A raw string (ifbatch_size == 1
), auto-wrapped in anAction
. - A dict mapping int indices to str responses
Action
instances
ready for evaluation.