StaticDataset
, which is finite and the length is known at runtime, or it can be a BaseGenerator
, which is an infinite supply of question - answer pairs, synthetically generated in some way (depending on the implementation).
For the sake of simplicity, we will start by loading the MATH dataset, remove unnecessary columns and rename the remaining ones, such that we can easily turn it into a StaticDataset
, which SingleStepEnv
can deal with.
First, install the CAMEL package with all its dependencies:
\boxed{...}
, hence we should use the pre-built BoxedStrategy
.
Sadly, MATH answers are rather complicated and a more general Math verifier to compare, for example, equations has not yet been implemented. Hence, we shall prune the dataset to only contain those rows where the content of \boxed{...}
is an int. For the sake of simplicity, we shall also prune the ground truthes to the direct answer (such that they are python expressions). That way, we can do simple verification using the vanilla PythonVerifier!
\boxed{...}
, too.
env.reset(batch_size=4)
to draw from the initial state distribution (the dataset) and return batch_size
many observations, which can then be fed into the agent.
step
function. An action in this case would simply be the answer to the question, wrapped in \boxed{}
(since we initialized our verifier with an extractor that extracts from \boxed{...}
).
Since we are dealing with batches here, it’s assign an index to each question, such that it matches up with the observation that the observation came from. This way, we support microbatching and out-of-order execution!
Let’s suppose we deal with microbatches in reverse order:
step
function contains the next observation (which in this case is just a placeholder, since the episode is over), a reward, as well as a reward dict, showing exactly which rubric brought which reward, a done
flag, indicating that the episode is over and some additional info.
In this case, we get a reward of , which is the reward for a correct final answer. This can be accessed and changed via the self.ACCURACY_REWARD
attribute.
Since we did not implement any other reward components, such as a formatting reward, the accuracy reward is our total reward.
This is how to use the batched Single Step environment!