CoT Data Generation and SFT Qwen With Unsloth#
You can also check this cookbook in colab here
To run this, press βRuntimeβ and press βRun allβ on a free Tesla T4 Google Colab instance!
This notebook demonstrates how to set up and leverage CAMELβs CoTDataGenerator for generating high-quality question-answer pairs like o1 thinking data, fine-tuning a language model using Unsloth, and uploading the results to Hugging Face.
In this notebook, youβll explore:
CAMEL: A powerful multi-agent framework that enables SFT data generation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.
CoTDataGenerator: A tool for generating like o1 thinking data.
Unsloth: An efficient library for fine-tuning large language models with LoRA (Low-Rank Adaptation) and other optimization techniques.
Hugging Face Integration: Uploading datasets and fine-tuned models to the Hugging Face platform for sharing.
π¦ Installation#
[ ]:
%%capture
!pip install camel-ai==0.2.16
Unsloth require GPU environment, To install Unsloth on your own computer, follow the installation instructions here.
[ ]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
[ ]:
import os
from datetime import datetime
import json
from camel.datagen.cotdatagen import CoTDataGenerator
π Setting Up API Keys#
First we will set the OPENAI_API_KEY that will be used to generate the data.
[ ]:
from getpass import getpass
[ ]:
openai_api_key = getpass('Enter your OpenAI API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key
Enter your OpenAI API key: Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·
Set ChatAgent#
Create a system message to define agentβs default role and behaviors.
[ ]:
sys_msg = 'You are a genius at slow-thinking data and code'
Use ModelFactory to set up the backend model for agent
CAMEL supports many other models. See here for a list.
[ ]:
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.configs import ChatGPTConfig
[ ]:
# Define the model, here in this case we use gpt-4o-mini
model = ModelFactory.create(
model_platform=ModelPlatformType.OPENAI,
model_type=ModelType.GPT_4O_MINI,
model_config_dict=ChatGPTConfig().as_dict(), # [Optional] the config for model
)
[ ]:
from camel.agents import ChatAgent
chat_agent = ChatAgent(
system_message=sys_msg,
model=model,
message_window_size=10,
)
Load Q&A data from a JSON file#
please prepare the qa data like below in json file:#
βββ { βquestion1β: βanswer1β, βquestion2β: βanswer2β, β¦ } βββ
The script fetches a example JSON file containing question-answer pairs from a GitHub repository and saves it locally. The JSON file is then loaded into the qa_data variable.
[ ]:
#get example json data
import requests
import json
# URL of the JSON file
url = 'https://raw.githubusercontent.com/zjrwtx/alldata/refs/heads/main/qa_data.json'
# Send a GET request to fetch the JSON file
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the response content as JSON
json_data = response.json()
# Specify the file path to save the JSON data
file_path = 'qa_data.json'
# Write the JSON data to the file
with open(file_path, 'w', encoding='utf-8') as json_file:
json.dump(json_data, json_file, ensure_ascii=False, indent=4)
print(f"JSON data successfully saved to {file_path}")
else:
print(f"Failed to retrieve JSON file. Status code: {response.status_code}")
JSON data successfully saved to qa_data.json
[ ]:
with open(file_path, 'r', encoding='utf-8') as f:
qa_data = json.load(f)
Create an instance of CoTDataGenerator#
[ ]:
# Create an instance of CoTDataGenerator
testo1 = CoTDataGenerator(chat_agent, golden_answers=qa_data)
[ ]:
# Record generated answers
generated_answers = {}
Test Q&A#
The script iterates through the questions, generates answers, and verifies their correctness. The generated answers are stored in a dictionary
[ ]:
# Test Q&A
for question in qa_data.keys():
print(f"Question: {question}")
# Get AI's thought process and answer
answer = testo1.get_answer(question)
generated_answers[question] = answer
print(f"AI's thought process and answer:\n{answer}")
# Verify the answer
is_correct = testo1.verify_answer(question, answer)
print(f"Answer verification result: {'Correct' if is_correct else 'Incorrect'}")
print("-" * 50)
print() # Add a new line at the end of each iteration
Question: What is the coefficient of $x^2y^6$ in the expansion of $\left(\frac{3}{5}x-\frac{y}{2}\right)^8$? Express your answer as a common fraction
AI's thought process and answer:
To find the coefficient of \(x^2y^6\) in the expansion of \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\), we will follow a systematic approach.
### Step 1: Analyze the Problem Requirements
We need to expand the expression \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\) and identify the coefficient of the term \(x^2y^6\). This requires us to use the binomial theorem, which states that:
\[
(a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^{n-k} b^k
\]
In our case, \(a = \frac{3}{5}x\), \(b = -\frac{y}{2}\), and \(n = 8\).
### Step 2: List the Steps to Solve the Problem
1. Identify the values of \(a\), \(b\), and \(n\).
2. Use the binomial theorem to express the expansion.
3. Determine the specific term that corresponds to \(x^2y^6\).
4. Calculate the coefficient of that term.
### Step 3: Execute the Solution Process
1. **Identify \(a\), \(b\), and \(n\)**:
- \(a = \frac{3}{5}x\)
- \(b = -\frac{y}{2}\)
- \(n = 8\)
2. **Use the Binomial Theorem**:
The general term in the expansion is given by:
\[
\binom{n}{k} a^{n-k} b^k
\]
Substituting our values, we have:
\[
\binom{8}{k} \left(\frac{3}{5}x\right)^{8-k} \left(-\frac{y}{2}\right)^k
\]
3. **Determine the specific term for \(x^2y^6\)**:
We need \(x^2\) and \(y^6\). This means:
- The power of \(x\) is \(2\), so \(8 - k = 2\) which gives \(k = 6\).
- The power of \(y\) is \(6\), which matches our \(k\).
4. **Calculate the coefficient**:
Now we substitute \(k = 6\) into the general term:
\[
\binom{8}{6} \left(\frac{3}{5}x\right)^{2} \left(-\frac{y}{2}\right)^{6}
\]
Calculating each part:
- \(\binom{8}{6} = \binom{8}{2} = \frac{8 \times 7}{2 \times 1} = 28\)
- \(\left(\frac{3}{5}\right)^{2} = \frac{9}{25}\)
- \(\left(-\frac{1}{2}\right)^{6} = \frac{1}{64}\)
Now, putting it all together:
\[
\text{Coefficient} = 28 \cdot \frac{9}{25} \cdot \frac{1}{64}
\]
Calculating this step-by-step:
1. Multiply \(28\) and \(9\):
\[
28 \cdot 9 = 252
\]
2. Now multiply by \(\frac{1}{25}\):
\[
\frac{252}{25}
\]
3. Finally, multiply by \(\frac{1}{64}\):
\[
\frac{252}{25 \cdot 64} = \frac{252}{1600}
\]
### Step 4: Simplify the Fraction
To simplify \(\frac{252}{1600}\), we find the greatest common divisor (GCD) of \(252\) and \(1600\). The prime factorization gives:
- \(252 = 2^2 \cdot 3^2 \cdot 7\)
- \(1600 = 2^6 \cdot 5^2\)
The GCD is \(4\). Dividing both the numerator and denominator by \(4\):
\[
\frac{252 \div 4}{1600 \div 4} = \frac{63}{400}
\]
### Final Answer
Thus, the coefficient of \(x^2y^6\) in the expansion of \(\left(\frac{3}{5}x - \frac{y}{2}\right)^8\) is:
\[
\boxed{\frac{63}{400}}
\]
Answer verification result: Correct
--------------------------------------------------
Question: how many a in banana?
AI's thought process and answer:
Sure! Let's break down the problem of counting how many times the letter "a" appears in the word "banana" step by step.
### Step 1: Analyze the Problem Requirements
The problem requires us to determine the frequency of the letter "a" in the word "banana." We need to:
- Identify the target letter, which is "a."
- Count how many times this letter appears in the given word.
### Step 2: List the Steps to Solve the Problem
To solve the problem, we can follow these steps:
1. Write down the word "banana."
2. Identify each letter in the word.
3. Count the occurrences of the letter "a."
4. Summarize the count.
### Step 3: Execute the Solution Process
Now, let's execute the steps we outlined:
1. The word we are analyzing is **"banana."**
2. The letters in "banana" are:
- b
- a
- n
- a
- n
- a
3. Now, we will count the occurrences of the letter "a":
- The first letter is **b** (not "a").
- The second letter is **a** (count = 1).
- The third letter is **n** (not "a").
- The fourth letter is **a** (count = 2).
- The fifth letter is **n** (not "a").
- The sixth letter is **a** (count = 3).
### Step 4: Provide the Final Answer
After counting, we find that the letter "a" appears **3 times** in the word "banana."
### Summary of Thought Process
- We started by understanding the requirement: counting a specific letter in a word.
- We broke down the problem into manageable steps, ensuring clarity in our approach.
- We executed the steps methodically, ensuring we counted each occurrence accurately.
- Finally, we summarized our findings to provide a clear answer.
Thus, the final answer is that there are **3 occurrences of the letter "a" in the word "banana."**
Answer verification result: Correct
--------------------------------------------------
Export the generated answers to a JSON file and transform these to Alpaca traing data format#
[ ]:
simplified_output = {
'timestamp': datetime.now().isoformat(),
'qa_pairs': generated_answers
}
simplified_file = f'generated_answers_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
with open(simplified_file, 'w', encoding='utf-8') as f:
json.dump(simplified_output, f, ensure_ascii=False, indent=2)
print(f"The generated answers have been exported to: {simplified_file}")
The generated answers have been exported to: generated_answers_20250111_114951.json
The script transforms the Q&A data into the Alpaca training data format, which is suitable for supervised fine-tuning (SFT). The transformed data is saved to a new JSON file.
[ ]:
import json
from datetime import datetime
def transform_qa_format(input_file):
# Read the input JSON file
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
# Transform the data
transformed_data = []
for question, answer in data['qa_pairs'].items():
transformed_pair = {
"instruction": question,
"input": "",
"output": answer
}
transformed_data.append(transformed_pair)
# Generate output filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f'transformed_qa_{timestamp}.json'
# Write the transformed data
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(transformed_data, f, ensure_ascii=False, indent=2)
return output_file, transformed_data
[ ]:
output_file, transformed_data = transform_qa_format(simplified_file)
print(f"Transformation complete. Output saved to: {output_file}")
Transformation complete. Output saved to: transformed_qa_20250111_115000.json
Upload the Data to Huggingface#
This defines a function upload_to_huggingface that uploads a dataset to Hugging Face. The script is modular, with helper functions handling specific tasks such as dataset name generation, dataset creation, metadata card creation, and record addition
[ ]:
# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record # Represents a single record in the dataset
from datetime import datetime # Handles date and time operations
# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
r"""Uploads transformed data to the Hugging Face dataset platform.
Args:
transformed_data (list): Transformed data, typically a list of dictionaries.
username (str): Hugging Face username.
dataset_name (str, optional): Custom dataset name.
Returns:
str: URL of the uploaded dataset.
"""
# Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
manager = HuggingFaceDatasetManager()
# Generate or validate the dataset name
dataset_name = generate_or_validate_dataset_name(username, dataset_name)
# Create the dataset on Hugging Face and get the dataset URL
dataset_url = create_dataset(manager, dataset_name)
# Create a dataset card to add metadata
create_dataset_card(manager, dataset_name, username)
# Convert the transformed data into a list of Record objects
records = create_records(transformed_data)
# Add the Record objects to the dataset
add_records_to_dataset(manager, dataset_name, records)
# Return the dataset URL
return dataset_url
# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
r"""Generates a default dataset name or validates and formats a user-provided name.
Args:
username (str): Hugging Face username.
dataset_name (str, optional): User-provided custom dataset name.
Returns:
str: Formatted dataset name.
"""
if dataset_name is None:
# If no dataset name is provided, generate a default name with the username and current date
dataset_name = f"{username}/qa-dataset-{datetime.now().strftime('%Y%m%d')}"
else:
# If a dataset name is provided, format it to include the username
dataset_name = f"{username}/{dataset_name}"
return dataset_name
# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
r"""Creates a new dataset on Hugging Face and returns the dataset URL.
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
Returns:
str: URL of the created dataset.
"""
print(f"Creating dataset: {dataset_name}")
# Use HuggingFaceDatasetManager to create the dataset
dataset_url = manager.create_dataset(name=dataset_name)
print(f"Dataset created: {dataset_url}")
return dataset_url
# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
r"""Creates a dataset card to add metadata
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
username (str): Hugging Face username.
"""
print("Creating dataset card...")
# Use HuggingFaceDatasetManager to create the dataset card
manager.create_dataset_card(
dataset_name=dataset_name,
description="Question-Answer dataset generated by CAMEL CoTDataGenerator", # Dataset description
license="mit", # Dataset license
language=["en"], # Dataset language
size_category="<1MB", # Dataset size category
version="0.1.0", # Dataset version
tags=["camel", "question-answering"], # Dataset tags
task_categories=["question-answering"], # Dataset task categories
authors=[username] # Dataset authors
)
print("Dataset card created successfully.")
# Convert transformed data into Record objects
def create_records(transformed_data):
r"""Converts transformed data into a list of Record objects.
Args:
transformed_data (list): Transformed data, typically a list of dictionaries.
Returns:
list: List of Record objects.
"""
records = []
# Iterate through the transformed data and convert each dictionary into a Record object
for item in transformed_data:
record = Record(**item) # Use the dictionary key-value pairs to create a Record object
records.append(record)
return records
# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
r"""Adds a list of Record objects to the dataset.
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
records (list): List of Record objects.
"""
print("Adding records to the dataset...")
# Use HuggingFaceDatasetManager to add the records to the dataset
manager.add_records(dataset_name=dataset_name, records=records)
print("Records added successfully.")
Config Access Token of Huggingface#
You can go to here to get API Key from Huggingface
[ ]:
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
Enter your HUGGING_FACE_TOKEN: Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·
[ ]:
# Set your personal huggingface config, then upload to HuggingFace
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter dataset name (press Enter to use default): ").strip()
if not dataset_name:
dataset_name = None
try:
dataset_url = upload_to_huggingface(transformed_data, username, dataset_name)
print(f"\nData successfully uploaded to HuggingFace!")
print(f"Dataset URL: {dataset_url}")
except Exception as e:
print(f"Error uploading to HuggingFace: {str(e)}")
Enter your HuggingFace username: zjrwtxtechstudio
Enter dataset name (press Enter to use default): cotdata01
Creating dataset: zjrwtxtechstudio/cotdata01
Dataset created: https://huggingface.co/datasets/zjrwtxtechstudio/cotdata01
Creating dataset card...
Dataset card created successfully.
Adding records to the dataset...
Records added successfully.
Data successfully uploaded to HuggingFace!
Dataset URL: https://huggingface.co/datasets/zjrwtxtechstudio/cotdata01