CoT Data Generation with CAMEL and Upload Data to Huggingface#

You can also check this cookbook in colab here (Use the colab share link)

This notebook demonstrates how to set up and leverage CAMEL’s CoTDataGenerator for generating high-quality question-answer pairs like o1 thinking data, uploading the data to Hugging Face.

In this notebook, you’ll explore:

  • CAMEL: A powerful multi-agent framework that enables SFT data generation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.

  • CoTDataGenerator: A tool for generating like o1 thinking data.

  • Hugging Face Integration: Uploading datasets to the Hugging Face platform for sharing

Star the Repo

If you find CAMEL useful or interesting, please consider giving it a star on our CAMEL GitHub Repo! Your stars help others find this project and motivate us to continue improving it.

📦 Installation#

[ ]:
%%capture
!pip install camel-ai==0.2.16
[ ]:
import os
from datetime import datetime
import json
from camel.datagen.cotdatagen import CoTDataGenerator

🔑 Setting Up API Keys#

First we will set the OPENAI_API_KEY that will be used to generate the data.

[ ]:
from getpass import getpass
[ ]:
openai_api_key = getpass('Enter your OpenAI API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key
Enter your OpenAI API key: ··········

Set ChatAgent#

Create a system message to define agent’s default role and behaviors.

[ ]:
sys_msg = 'You are a genius at slow-thinking data and code'

Use ModelFactory to set up the backend model for agent

CAMEL supports many other models. See here for a list.

[ ]:
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.configs import ChatGPTConfig
[ ]:
# Define the model, here in this case we use gpt-4o-mini
model = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI,
    model_type=ModelType.GPT_4O_MINI,
    model_config_dict=ChatGPTConfig().as_dict(), # [Optional] the config for model
)
[ ]:
from camel.agents import ChatAgent
chat_agent = ChatAgent(
    system_message=sys_msg,
    model=model,
    message_window_size=10,
)

Load Q&A data from a JSON file#

please prepare the qa data like below in json file:#

‘’’ { “question1”: “answer1”, “question2”: “answer2”, … } ‘’’

[ ]:
!pwd
/content

The script fetches a example JSON file containing question-answer pairs from a GitHub repository and saves it locally. The JSON file is then loaded into the qa_data variable.

[ ]:
#get example json data

import requests
import json

# URL of the JSON file
url = 'https://raw.githubusercontent.com/zjrwtx/alldata/refs/heads/main/qa_data.json'

# Send a GET request to fetch the JSON file
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response content as JSON
    json_data = response.json()

    # Specify the file path to save the JSON data
    file_path = 'qa_data.json'

    # Write the JSON data to the file
    with open(file_path, 'w', encoding='utf-8') as json_file:
        json.dump(json_data, json_file, ensure_ascii=False, indent=4)

    print(f"JSON data successfully saved to {file_path}")
else:
    print(f"Failed to retrieve JSON file. Status code: {response.status_code}")
JSON data successfully saved to qa_data.json
[ ]:
with open(file_path, 'r', encoding='utf-8') as f:
    qa_data = json.load(f)

Create an instance of CoTDataGenerator#

[ ]:
# Create an instance of CoTDataGenerator
testo1 = CoTDataGenerator(chat_agent, golden_answers=qa_data)
[ ]:
# Record generated answers
generated_answers = {}

Test Q&A#

The script iterates through the questions, generates answers, and verifies their correctness. The generated answers are stored in a dictionary

[ ]:
# Test Q&A
for question in qa_data.keys():
    print(f"Question: {question}")

    # Get AI's thought process and answer
    answer = testo1.get_answer(question)
    generated_answers[question] = answer
    print(f"AI's thought process and answer:\n{answer}")

    # Verify the answer
    is_correct = testo1.verify_answer(question, answer)
    print(f"Answer verification result: {'Correct' if is_correct else 'Incorrect'}")
    print("-" * 50)
    print()  # Add a new line at the end of each iteration
Question: What is the coefficient of $x^2y^6$ in the expansion of $\left(\frac{3}{5}x-\frac{y}{2}\right)^8$?  Express your answer as a common fraction
AI's thought process and answer:
To find the coefficient of \(x^2y^6\) in the expansion of \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\), we will follow a systematic approach.

### Step 1: Analyze the Problem Requirements
We need to expand the expression \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\) and identify the specific term that contains \(x^2y^6\). This involves using the binomial theorem, which states that:

\[
(a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^{n-k} b^k
\]

In our case, \(a = \frac{3}{5}x\) and \(b = -\frac{y}{2}\), and \(n = 8\).

### Step 2: List the Steps to Solve the Problem
1. Identify the general term in the binomial expansion.
2. Set up the equation to find the specific term that corresponds to \(x^2y^6\).
3. Solve for the coefficients and powers.
4. Calculate the coefficient for the term \(x^2y^6\).

### Step 3: Execute the Solution Process
1. **Identify the General Term**: The general term in the expansion can be expressed as:

\[
T_k = \binom{8}{k} \left(\frac{3}{5}x\right)^{8-k} \left(-\frac{y}{2}\right)^k
\]

2. **Set Up the Equation**: We need to find \(k\) such that the term \(T_k\) contains \(x^2y^6\). This means we need:

- The power of \(x\) to be 2: \(8 - k = 2\) implies \(k = 6\).
- The power of \(y\) to be 6: \(k = 6\).

Since both conditions are satisfied with \(k = 6\), we will use this value to find the corresponding term.

3. **Calculate the Coefficient**: Substitute \(k = 6\) into the general term:

\[
T_6 = \binom{8}{6} \left(\frac{3}{5}x\right)^{2} \left(-\frac{y}{2}\right)^{6}
\]

Now calculate each component:

- \(\binom{8}{6} = \binom{8}{2} = \frac{8 \times 7}{2 \times 1} = 28\)
- \(\left(\frac{3}{5}\right)^{2} = \frac{9}{25}\)
- \(\left(-\frac{1}{2}\right)^{6} = \frac{1}{64}\)

Now, combine these to find the coefficient:

\[
T_6 = 28 \cdot \frac{9}{25} \cdot \frac{1}{64} \cdot x^2 \cdot y^6
\]

Calculating the coefficient:

\[
\text{Coefficient} = 28 \cdot \frac{9}{25} \cdot \frac{1}{64} = \frac{28 \cdot 9}{25 \cdot 64}
\]

Calculating \(28 \cdot 9 = 252\):

\[
\text{Coefficient} = \frac{252}{1600}
\]

4. **Simplify the Fraction**: We can simplify \(\frac{252}{1600}\) by finding the greatest common divisor (GCD) of 252 and 1600. The GCD is 4.

\[
\frac{252 \div 4}{1600 \div 4} = \frac{63}{400}
\]

### Step 4: Provide the Final Answer
The coefficient of \(x^2y^6\) in the expansion of \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\) is:

\[
\boxed{\frac{63}{400}}
\]
Answer verification result: Correct
--------------------------------------------------

Question: how many r in strawberry?
AI's thought process and answer:
To solve the problem of how many times the letter "r" appears in the word "strawberry," we will follow the outlined requirements step by step.

### Step 1: Analyze the Problem Requirements
The problem requires us to determine the frequency of the letter "r" in the word "strawberry." This involves:
- Identifying the specific letter we are counting, which is "r."
- Understanding that we need to count occurrences of this letter in the given word.

### Step 2: List the Steps to Solve the Problem
To solve the problem, we can break it down into the following steps:
1. Write down the word "strawberry."
2. Identify and isolate the letter "r" within the word.
3. Count how many times the letter "r" appears in the word.

### Step 3: Execute the Solution Process
Now, let's execute the steps we have outlined:

1. **Write down the word**: The word is "strawberry."
2. **Identify the letter "r"**: We will look at each letter in the word:
   - s
   - t
   - r (1st occurrence)
   - a
   - w
   - b
   - e
   - r (2nd occurrence)
   - r (3rd occurrence)
   - y

3. **Count the occurrences**: As we go through the letters, we find:
   - The first "r" is the 3rd letter.
   - The second "r" is the 8th letter.
   - The third "r" is the 9th letter.

Thus, we have identified that the letter "r" appears **three times** in the word "strawberry."

### Step 4: Provide the Final Answer
After counting the occurrences, we conclude that the letter "r" appears **3 times** in the word "strawberry."

### Summary of Thought Process
- We carefully analyzed the problem to understand what was being asked.
- We broke down the solution into manageable steps to ensure clarity and accuracy.
- We executed the counting process methodically, ensuring that we did not miss any occurrences of the letter "r."
- Finally, we provided a clear and concise answer based on our findings.

The final answer is: **3**.
Answer verification result: Correct
--------------------------------------------------

Export the generated answers to a JSON file and transform these to Alpaca traing data format#

[ ]:
simplified_output = {
    'timestamp': datetime.now().isoformat(),
    'qa_pairs': generated_answers
}
simplified_file = f'generated_answers_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
with open(simplified_file, 'w', encoding='utf-8') as f:
    json.dump(simplified_output, f, ensure_ascii=False, indent=2)
print(f"The generated answers have been exported to: {simplified_file}")
The generated answers have been exported to: generated_answers_20241227_111615.json

The script transforms the Q&A data into the Alpaca training data format, which is suitable for supervised fine-tuning (SFT). The transformed data is saved to a new JSON file.

[ ]:
import json
from datetime import datetime

def transform_qa_format(input_file):
    # Read the input JSON file
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Transform the data
    transformed_data = []
    for question, answer in data['qa_pairs'].items():
        transformed_pair = {
            "instruction": question,
            "input": "",
            "output": answer
        }
        transformed_data.append(transformed_pair)

    # Generate output filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f'transformed_qa_{timestamp}.json'

    # Write the transformed data
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(transformed_data, f, ensure_ascii=False, indent=2)

    return output_file, transformed_data
[ ]:
output_file, transformed_data = transform_qa_format(simplified_file)
print(f"Transformation complete. Output saved to: {output_file}")
Transformation complete. Output saved to: transformed_qa_20241227_111622.json

Upload the Data to Huggingface#

This defines a function upload_to_huggingface that uploads a dataset to Hugging Face. The script is modular, with helper functions handling specific tasks such as dataset name generation, dataset creation, metadata card creation, and record addition

[ ]:
# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager  # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record  # Represents a single record in the dataset
from datetime import datetime  # Handles date and time operations

# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    r"""Uploads transformed data to the Hugging Face dataset platform.

    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.
        username (str): Hugging Face username.
        dataset_name (str, optional): Custom dataset name.

    Returns:
        str: URL of the uploaded dataset.
    """
    # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
    manager = HuggingFaceDatasetManager()

    # Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)

    # Create the dataset on Hugging Face and get the dataset URL
    dataset_url = create_dataset(manager, dataset_name)

    # Create a dataset card to add metadata
    create_dataset_card(manager, dataset_name, username)

    # Convert the transformed data into a list of Record objects
    records = create_records(transformed_data)

    # Add the Record objects to the dataset
    add_records_to_dataset(manager, dataset_name, records)

    # Return the dataset URL
    return dataset_url

# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
    r"""Generates a default dataset name or validates and formats a user-provided name.

    Args:
        username (str): Hugging Face username.
        dataset_name (str, optional): User-provided custom dataset name.

    Returns:
        str: Formatted dataset name.
    """
    if dataset_name is None:
        # If no dataset name is provided, generate a default name with the username and current date
        dataset_name = f"{username}/qa-dataset-{datetime.now().strftime('%Y%m%d')}"
    else:
        # If a dataset name is provided, format it to include the username
        dataset_name = f"{username}/{dataset_name}"
    return dataset_name

# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
    r"""Creates a new dataset on Hugging Face and returns the dataset URL.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.

    Returns:
        str: URL of the created dataset.
    """
    print(f"Creating dataset: {dataset_name}")
    # Use HuggingFaceDatasetManager to create the dataset
    dataset_url = manager.create_dataset(name=dataset_name)
    print(f"Dataset created: {dataset_url}")
    return dataset_url

# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
    r"""Creates a dataset card to add metadata

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        username (str): Hugging Face username.
    """
    print("Creating dataset card...")
    # Use HuggingFaceDatasetManager to create the dataset card
    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="Question-Answer dataset generated by CAMEL CoTDataGenerator",  # Dataset description
        license="mit",  # Dataset license
        language=["en"],  # Dataset language
        size_category="<1MB",  # Dataset size category
        version="0.1.0",  # Dataset version
        tags=["camel", "question-answering"],  # Dataset tags
        task_categories=["question-answering"],  # Dataset task categories
        authors=[username]  # Dataset authors
    )
    print("Dataset card created successfully.")

# Convert transformed data into Record objects
def create_records(transformed_data):
    r"""Converts transformed data into a list of Record objects.

    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.

    Returns:
        list: List of Record objects.
    """
    records = []
    # Iterate through the transformed data and convert each dictionary into a Record object
    for item in transformed_data:
        record = Record(**item)  # Use the dictionary key-value pairs to create a Record object
        records.append(record)
    return records

# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
    r"""Adds a list of Record objects to the dataset.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        records (list): List of Record objects.
    """
    print("Adding records to the dataset...")
    # Use HuggingFaceDatasetManager to add the records to the dataset
    manager.add_records(dataset_name=dataset_name, records=records)
    print("Records added successfully.")

Config Access Token of Huggingface#

You can go to here to get API Key from Huggingface

[ ]:
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
Enter your HUGGING_FACE_TOKEN: ··········
[ ]:
# Set your personal huggingface config, then upload to HuggingFace
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter dataset name (press Enter to use default): ").strip()
if not dataset_name:
    dataset_name = None

try:
    dataset_url = upload_to_huggingface(transformed_data, username, dataset_name)
    print(f"\nData successfully uploaded to HuggingFace!")
    print(f"Dataset URL: {dataset_url}")
except Exception as e:
    print(f"Error uploading to HuggingFace: {str(e)}")

Enter your HuggingFace username: zjrwtxtechstudio
Enter dataset name (press Enter to use default): o1data88
Creating dataset: zjrwtxtechstudio/o1data88
Dataset created: https://huggingface.co/datasets/zjrwtxtechstudio/o1data88
Creating dataset card...
Dataset card created successfully.
Adding records to the dataset...
Records added successfully.

Data successfully uploaded to HuggingFace!
Dataset URL: https://huggingface.co/datasets/zjrwtxtechstudio/o1data88
image.png

🌟 Highlights#

This cookbook demonstrates the process of using CAMEL’s CoTDataGenerator to create high-quality question-answer pairs, similar to o1 thinking data. The notebook covers the following steps:

  1. Setup: Installation of the camel-ai library and configuration of the OpenAI API key.

  2. Data Generation: Utilization of the CoTDataGenerator to generate answers for predefined questions using llm model.

  3. Data Transformation: Conversion of the generated Q&A data into a format compliant with the Alpaca training data schema.

  4. Upload to Hugging Face: Integration with Hugging Face to upload the transformed dataset, including the creation of a dataset card and metadata.

The cookbook also includes detailed instructions for setting up the environment, handling API keys, and configuring the Hugging Face dataset upload process. The final output is a dataset uploaded to Hugging Face, ready for sharing and further use in AI training tasks.

Star the Repo

If you find CAMEL useful or interesting, please consider giving it a star on GitHub! Your stars help others find this project and motivate us to continue improving it.