camel.environments package#

Submodules#

camel.environments.base module#

Module contents#

class camel.environments.Action(*, index: int = 0, llm_response: str, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, timestamp: ~datetime.datetime = <factory>)[source]#

Bases: BaseModel

Represents an action taken in an environment.

This class defines the input context, the LLM-generated output, and metadata required for verification and tracking within an RL framework.

llm_response#

The response generated by the LLM.

Type:

str

metadata#

Additional metadata such as model parameters, prompt details, or response confidence scores.

Type:

Dict[str, Any]

timestamp#

The timestamp when the action was generated (UTC).

Type:

datetime

index: int#
llm_response: str#
metadata: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

timestamp: datetime#
class camel.environments.Environment(*args, **kwargs)[source]#

Bases: Protocol

async close() None[source]#

Perform a full cleanup of all environment resources.

async reset() Observation[source]#

Reset the environment to an initial state.

Returns:

Initial observation for the episode

async step(action: Action) StepResult[source]#

Take a step in the environment.

Parameters:
  • action – Action containing everything that is needed

  • environment (to progress in the)

Returns:

StepResult containing next observation, reward, done flag, and info

class camel.environments.MultiStepEnv(extractor: BaseExtractor, max_steps: int | None = None, **kwargs)[source]#

Bases: ABC

A multi-step environment for reinforcement learning with LLMs.

async close() None[source]#

Clean up and close all resources used by the environment. This method shuts down the verifier, calls the internal close function that is implemented in any MultiStepEnv, and ensures that the environment is properly closed.

Raises:

Exception – If an error occurs while closing the environment.

abstract async compute_reward() Tuple[float, Dict[str, float]][source]#
property current_step: int#

Get the current step number.

Returns:

The number of the step we are currently in.

Return type:

int

is_done() bool[source]#

Check if the episode should terminate.

This function terminates the episode if the maximum number of steps is reached or if any other terminating criterion is met.

Returns:

A boolean flag.

Return type:

bool

property metadata: Dict[str, Any]#

Retrieve the metadata of the environment.

This provides additional parameters and configuration details.

Returns:

A copy of the environment’s metadata.

Return type:

Dict[str, Any]

async reset() Observation[source]#

Reset the environment to an initial state.

Returns:

The initial observation for the episode.

Return type:

Observation

Raises:

RuntimeError – If we fail to get the initial observation.

async setup() None[source]#

Set up the environment by initializing the verifier and extractor.

This method ensures that the environment is ready for interaction. It sets up necessary components, including the verifier and extractor.

Raises:

Exception – If setup fails due to an internal error.

async step(action: Action) Tuple[Observation, float, bool, Dict[str, Any]][source]#

Take a step in the environment using the given action.

This method updates the environment state based on the LLM’s response, computes rewards, checks if the episode is done, and based on that gets the next or final observation.

Parameters:

action (Action) – The action containing the LLM response.

Returns:

StepResult containing next observation, total reward, a dictionary

of rewards, done flag, and info.

Raises:

RuntimeError – If the environment is not set up, the episode has ended, or there is no valid current observation.

class camel.environments.Observation(*, question: str, context: ~typing.Dict[str, ~typing.Any] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]#

Bases: BaseModel

Environment observation.

question#

The question posed to the LLM.

Type:

str

context#

Additional context for the question.

Type:

Dict[str, Any]

metadata#

Optional metadata about the observation.

Type:

Dict[str, Any] | None

context: Dict[str, Any]#
metadata: Dict[str, Any] | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

question: str#
class camel.environments.Opponent(play_style: Literal['optimal', 'random'] = 'optimal')[source]#

Bases: object

AI opponent for the Tic Tac Toe game.

This class implements different playing strategies for the AI opponent, including an optimal strategy using the minimax algorithm with alpha-beta pruning, and a random strategy.

get_optimal_move(board: List[str]) int | None[source]#

Get the optimal move using the minimax algorithm.

Parameters:

board (List[str]) – The current game board as a list of strings.

Returns:

The index of the optimal move, or None if no move

is available.

Return type:

Optional[int]

minimax(board: List[str], is_maximizing: bool, depth: int = 0, alpha: float = -inf, beta: float = inf) Tuple[float, int | None][source]#

Minimax algorithm with alpha-beta pruning for optimal move selection.

Recursively evaluates all possible moves to find the best one. Uses alpha-beta pruning to reduce the search space.

Parameters:
  • board (List[str]) – The current game board as a list of strings.

  • is_maximizing (bool) – True if maximizing player (O), False if minimizing (X).

  • depth (int) – Current depth in the search tree. (default: 0)

  • alpha (float) – Alpha value for pruning. (default: -math.inf)

  • beta (float) – Beta value for pruning. (default: math.inf)

Returns:

A tuple containing:
  • float: The score of the best move (1 for O win, -1 for X

    win, 0 for draw)

  • Optional[int]: The index of the best move, or None if

    terminal state

Return type:

Tuple[float, Optional[int]]

select_move(board: List[str]) int | None[source]#

Select a move based on the opponent’s play style.

Parameters:

board (List[str]) – The current game board as a list of strings.

Returns:

The index of the selected move, or None if no move

is available.

Return type:

Optional[int]

class camel.environments.SingleStepEnv(dataset: StaticDataset | BaseGenerator, verifier: BaseVerifier, **kwargs)[source]#

Bases: object

A lightweight environment for single-step RL with LLMs as policy.

This environment models a single interaction between an LLM-based agent and a problem drawn from a dataset—such as a question-answering or math problem—where the agent produces one response and receives feedback.

Core Flow:
  • A question is sampled from a (possibly infinitely long) dataset.

  • The LLM generates a single-step response (the action).

  • The response is verified against the ground truth.

  • A reward is computed based on correctness and optional custom logic.

Key Features:
  • Batched evaluation with per-sample state tracking.

  • Async setup and teardown for verifiers and related resources.

  • Supports deterministic sampling via local RNG (optional seed).

  • Extensible reward computation via subclassing.

ACCURACY_REWARD = 1#
PLACEHOLDER_OBS = Observation(question='Episode ended. This is just a placeholder.', context={}, metadata=None)#
async close() None[source]#

Clean up and close all resources used by the environment.

This method shuts down the verifier, resets the internal state, and ensures that the environment is properly closed.

Raises:

Exception – If an error occurs while closing the environment.

property metadata: Dict[str, Any]#

Retrieve the metadata of the environment.

This provides additional parameters and configuration details.

Returns:

A copy of the environment’s metadata.

Return type:

Dict[str, Any]

async reset(batch_size: int = 1, seed: int | None = None) Observation | List[Observation][source]#

Resets the environment and starts a new episode.

This method samples a new batch of data points from the dataset and returns the corresponding initial observations.

If a seed is provided, a local random number generator is initialized for deterministic sampling. The global random state is not affected.

Parameters:
  • batch_size (int) – Number of data points to sample. (default: 1)

  • seed (Optional[int]) – Seed for deterministic sampling. If None, sampling is non-deterministic. (default: None)

Returns:

Initial observation(s) for the

episode.

Return type:

Observation or List[Observation]

Raises:
  • RuntimeError – If called before all previous states are processed.

  • ValueError – If batch size exceeds dataset size.

  • TypeError – If the dataset is of an unsupported type.

async setup() None[source]#

Set up the environment by initializing the verifier.

This method ensures that the environment is ready for interaction. It sets up necessary components, including the verifier.

Raises:

Exception – If setup fails due to an internal error.

async step(action: Action | List[Action] | str | Dict[int, str]) Tuple[Observation, float, bool, Dict[str, Any]] | List[Tuple[Observation, float, bool, Dict[str, Any]]][source]#

Execute one interaction step in the environment using the proposed solution.

This method processes the agent’s response(s) to the current observation(s), verifies the correctness of the responses using the verifier, computes rewards, and returns the resulting state transition(s).

The environment is strictly single-step. Once an action is submitted for a state, that state is marked as done, and the observation will not change.

Parameters:

action (Union[Action, List[Action], str, Dict[int, str]]) –

The action(s) taken by the agent,

which should contain the response(s)

to the observation(s). Can be: - A single Action object (for batch size 1), - A list of Action objects (for batched evaluation), - A raw string (only allowed when batch size is 1). - A dict that maps indices to their llm_response

(for batched evaluation)

Returns:

A tuple or list of tuples containing: - Observation: Placeholder indicating episode end. - float: The reward for the response. - bool: Whether the episode is done

(always True in this case).

  • dict: Additional info including the proposed solution,

    verification result, and original data point.

Return type:

Union[Tuple[Observation, float, bool, Dict[str, Any]], List[…]]

Raises:
  • RuntimeError – If the environment has not been set up, or if reset() has not been called.

  • ValueError – If invalid action format, duplicate indices, or out-of-bounds indices are detected.

class camel.environments.StepResult(*, observation: ~camel.environments.models.Observation, reward: float, rewards_dict: ~typing.Dict[str, float] = <factory>, done: bool, info: ~typing.Dict[str, ~typing.Any] = <factory>)[source]#

Bases: BaseModel

Result of an environment step.

observation#

The next observation.

Type:

camel.environments.models.Observation

reward#

Dictionary of reward scores for different aspects.

Type:

float

done#

Whether the episode is complete.

Type:

bool

info#

Additional information about the step.

Type:

Dict[str, Any]

as_tuple() Tuple[Observation, float, bool, Dict[str, Any]][source]#

Returns all fields of the model as a tuple, in declaration order

done: bool#
info: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

observation: Observation#
reward: float#
rewards_dict: Dict[str, float]#
class camel.environments.TicTacToeEnv(extractor: BaseExtractor | None = None, max_steps: int | None = None, play_style: Literal['optimal', 'random'] = 'optimal', **kwargs)[source]#

Bases: MultiStepEnv

A Tic Tac Toe environment for reinforcement learning with LLMs.

This environment implements a standard Tic Tac Toe game where the LLM agent plays as ‘X’ against an AI opponent that plays as ‘O’. The opponent can use either an optimal strategy (minimax with alpha-beta pruning) or a random strategy.

WIN_COMBINATIONS: ClassVar = [(0, 1, 2), (3, 4, 5), (6, 7, 8), (0, 3, 6), (1, 4, 7), (2, 5, 8), (0, 4, 8), (2, 4, 6)]#
static available_moves(board: List[str]) List[int][source]#

Get all available moves on the board.

Parameters:

board (List[str]) – The current game board as a list of strings.

Returns:

A list of indices representing empty cells on the board.

Return type:

List[int]

static check_winner(board: List[str]) Literal['X', 'O', 'draw'] | None[source]#

Check if there is a winner or a draw on the board.

Parameters:

board (List[str]) – The current game board as a list of strings.

Returns:

“X” if X has won, “O” if O

has won, “draw” if the game is a draw, or None if the game is still ongoing.

Return type:

Optional[Literal[“X”, “O”, “draw”]]

async compute_reward() Tuple[float, Dict[str, float]][source]#

Compute the reward for the current state.

Returns:

A tuple containing the total

reward and a dictionary of reward components: - 1.0 for a win - 0.0 for a loss or illegal move - 0.5 for a draw - For ongoing games, returns an evaluation of the position

Return type:

Tuple[float, Dict[str, float]]

static evaluate_position_for_x(board: List[str], is_x_turn: bool, depth: int = 0, max_depth: int = 10) float[source]#

Evaluate the current board position from X’s perspective.

Uses minimax to determine the value of the position.

Parameters:
  • board (List[str]) – The current game board as a list of strings.

  • is_x_turn (bool) – True if it’s X’s turn to move, False otherwise.

Returns:

A float value representing the position evaluation:
  • 1.0 if X has a winning position

  • 0.0 if O has a winning position

  • 0.5 for a draw

  • For ongoing positions, returns the expected outcome with

    perfect play

Return type:

float

render_board(board: List[str]) str[source]#

Render the board as a string for display.

Parameters:

board (List[str]) – The current game board as a list of strings.

Returns:

A formatted string representation of the board.

Return type:

str