# Self-instruct Data Generation Using Qwen

<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
‚≠ê <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)
</div>


The self-instruct pipeline is a technique for automatically generating instructions for large language models (LLMs). Manually creating these datasets can be time-consuming and expensive. The self-instruct pipeline provides a way to automate this process and generate large numbers of instructions quickly and efficiently.

## Installation and Setup
First, install the CAMEL package with all its dependencies

In [None]:
!pip install "git+https://github.com/camel-ai/camel.git@c7bd39c898cb8d3bd434acd4219c3cb4f5f85ae2#egg=camel-ai[all]"

If you don‚Äôt have a Qwen API key, you can obtain one by following these steps:

Visit the Alibaba Cloud Model Studio Console (https://www.alibabacloud.com/en?_p_lc=1) and follow the on-screen instructions to activate the model services.

In the upper-right corner of the console, click on your account name and select API-KEY.

On the API Key management page, click on the Create API Key button to generate a new key.

In [2]:
import os
from getpass import getpass

qwen_api_key = getpass('Enter your Qwen API key: ')
os.environ["QWEN_API_KEY"] = qwen_api_key

Enter your Qwen API key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [3]:
from camel.configs import QwenConfig
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.agents import ChatAgent
from camel.messages import BaseMessage

qwen_model = ModelFactory.create(
    model_platform=ModelPlatformType.QWEN,
    model_type=ModelType.QWEN_TURBO,
    model_config_dict=QwenConfig(temperature=0.2).as_dict(),
)

## Basic Agent Setup

In [4]:
from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

agent = ChatAgent(
    model=qwen_model,
)

## Basic Pipeline Setup


The pipeline works by starting with a small set of seed (human-written) instructions and then using an LLM to generate new instructions based on those seeds.

- The seed instructions are typically stored in a JSON Lines (JSONL) file. Each line in the file represents a single instruction in JSON format.

- Like the seed file, the output is also stored in JSONL format, making it easy to parse and use for further tasks, such as training or fine-tuning language models.


Please replace `seed_path` with the path to your seed file, and replace `data_output_path` with your desired output location.

In [5]:
import os
import requests

# Create directory for local data
os.makedirs('local_data', exist_ok=True)

# Update the URL to the raw file content
url = "https://raw.githubusercontent.com/camel-ai/camel/master/examples/synthetic_datagen/self_instruct/seed_tasks.jsonl"

# Fetch the raw file
response = requests.get(url)

with open('local_data/seed_tasks.jsonl', 'wb') as file:
    file.write(response.content)


In [6]:
seed_path = 'local_data/seed_tasks.jsonl'
data_output_path = 'data_output.json'

The cell below shows some example instructions in the seed file. All seed files should follow this format.

In [7]:
with open('local_data/seed_tasks.jsonl', 'r') as file:
        for i, line in enumerate(file):
            print(line.strip())
            if i >= 9:
                break

{"id": "seed_task_0", "name": "breakfast_suggestion", "instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?", "instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}], "is_classification": false}
{"id": "seed_task_1", "name": "antonym_relation", "instruction": "What is the relation between the given pairs?", "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false}
{"id": "seed_task_2", "name": "one_sentence_description", "instruction": "Generate a one-sentence description for each of the following people.", "in

The self-instruct pipeline works iteratively. In each round:

1. It selects a certain number of human-written instructions (`num_human_sample`) from the `seed_path`.
2. It selects a certain number of machine-generated instructions (`num_machine_sample`) from previous rounds.
3. It uses these selected instructions to guide the language model in generating new instructions.
4. These new instructions are added to the pool of machine-generated instructions, and the process repeats until the desired number of instructions is generated.

The `human_to_machine_ratio` helps control the balance between human guidance and the model's creativity throughout this process. By adjusting this ratio, you can influence the quality and diversity of the generated instructions.

Feel free to alter `num_human_sample` and `num_machine_sample`, which both will be passed into `human_to_machine_ratio` later

In [8]:
num_human_sample = 6
num_machine_sample = 2

Please replace `target_num_instructions` with the number of machine instructions you want to generate


In [9]:
target_num_instructions = 5

Pass everything to our pipeline.

In [10]:
pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path=data_output_path,
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
)

Try generating it! You will see the generated data file being created at your desired location!

In [11]:
pipeline.generate()

Pretty print the generated data content

In [12]:
import json

with open(data_output_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))

[
    {
        "id": "machine_task_1",
        "instruction": "Design a simple logo that represents both unity and diversity for a community organization.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "A circular logo with a pattern composed of different colored puzzle pieces fitting together perfectly, symbolizing unity and diversity within the community."
            },
            {
                "input": "",
                "output": "An emblem featuring a tree with leaves of various shapes and colors, signifying growth, inclusion, and unity among diverse members of the community."
            }
        ]
    },
    {
        "id": "machine_task_2",
        "instruction": "Create a step-by-step guide explaining how to perform a specific magic trick.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "1. Begin by showin

## Filter functions

Newly generated instructions undergo filtering and evaluation before being added to the results. Only those meeting predefined standards are included. CAMEL provides some filter functions that can be passed in the self-instruct pipeline. Additionally, we also supports custom filters for tailored evaluation! Filter functions return `True` if the instruction is valid, `False` otherwise.

### Length Filter

`LengthFilter` filters out all the instructions which has a length less than `min_len` or greater than `max_len`.

In [13]:
from camel.datagen.self_instruct import LengthFilter

length_filter = LengthFilter(min_len=5, max_len=50)

instructions = [
    "Sort the numbers in ascending order.",
    "Calculate the sum.",
    "Create a report that details the monthly expenses and savings in a spreadsheet."
]

filtered_instructions = [instr for instr in instructions if length_filter.apply(instr)]
print(filtered_instructions)

['Sort the numbers in ascending order.', 'Create a report that details the monthly expenses and savings in a spreadsheet.']


### Keyword Filter

`KeywordFilter` filters instructions that contain specific undesirable keyword.

In [14]:
from camel.datagen.self_instruct import KeywordFilter

keyword_filter = KeywordFilter(keywords=["ban", "prohibit", "forbid"])

instructions = [
    "Ban the use of plastic bags.",
    "Encourage recycling programs.",
    "Prohibit smoking in public areas."
]

filtered_instructions = [instr for instr in instructions if keyword_filter.apply(instr)]
print(filtered_instructions)

['Encourage recycling programs.']


### Punctuation Filter

`PunctuationFilter` filters instructions that begin with a non-alphanumeric character.

In [15]:
from camel.datagen.self_instruct import PunctuationFilter

punctuation_filter = PunctuationFilter()

instructions = [
    "Sort the data by category.",
    "#Analyze the trends over time.",
    "*Create a summary of results."
]

filtered_instructions = [instr for instr in instructions if punctuation_filter.apply(instr)]
print(filtered_instructions)

['Sort the data by category.']


### Non-English Filter

`NonEnglishFilter` filters instructions that do not begin with English letters.

In [16]:
from camel.datagen.self_instruct import NonEnglishFilter

non_english_filter = NonEnglishFilter()

instructions = [
    "Analyze the performance metrics.",
    "ËÆ°ÁÆóÁªìÊûúÁöÑÁªüËÆ°Êï∞ÊçÆ.",
    "Test the new algorithm."
]

filtered_instructions = [instr for instr in instructions if non_english_filter.apply(instr)]
print(filtered_instructions)

['Analyze the performance metrics.', 'Test the new algorithm.']


### ROUGE Similarity Filter

`RougeSimilarityFilter` filters instructions that are too similar to existing instructions based on ROUGE scores.

In [17]:
from camel.datagen.self_instruct import RougeSimilarityFilter

existing_instructions = [
    "Summarize the article.",
    "Write a brief overview of the text."
]

similarity_filter = RougeSimilarityFilter(existing_instructions, threshold=0.5)

instructions = [
    "Summarize the content.",
    "Create a summary for the text.",
    "Provide an analysis of the text."
]

filtered_instructions = [instr for instr in instructions if similarity_filter.apply(instr)]
print(filtered_instructions)

['Create a summary for the text.', 'Provide an analysis of the text.']


### Custom Filter Function

Additionaly, you could implement your own filter function.

In [18]:
from camel.datagen.self_instruct import FilterFunction

class CustomFilter(FilterFunction):

    def apply(self, instruction: str) -> bool:
        # apply your logic here
        logic = ...
        return logic

## Instruction Filter

`InstructionFilter` manages all filter functions. And we can use a custom InstructionFilter to initialize the pipeline

Start by adding filter functions you want and configure them.

In [19]:
filter_config = {
  "length": {"min_len": 5, "max_len": 100},
  "keyword": {"keywords": ["image", "video"]},
  "non_english": {},
  "rouge_similarity": {
      "existing_instructions": ["Some existing instructions"],
      "threshold": 0.6
  }
}

Then, initialize an `InstructionFilter`

In [20]:
from camel.datagen.self_instruct import InstructionFilter
filters = InstructionFilter(filter_config)

## Pipeline Setup with Custom `InstructionFilter`
CAMEL has some default filter functions inside the pipeline, but you can also choose your own!

In [24]:
pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path=data_output_path,
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
    instruction_filter=filters,    # pass in your InstructionFilter
)

Or if you want to use the default function filters, but different configuration, you can also just pass in the filter configuration

Finally, generate!

In [25]:
pipeline.generate()

Pretty print the generated data content

In [26]:
import json

with open(data_output_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))

[
    {
        "id": "machine_task_1",
        "instruction": "Create a crossword puzzle with clues related to famous scientists and their discoveries.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "Across:\n1. Scientist known for the theory of relativity (6) - Einstein\n5. Unit of frequency, named after a German physicist (7) - Hertz\n\nDown:\n1. Father of genetics, studied pea plants (7) - Mendel\n2. Invented the telephone, Scottish-born (8) - Bell"
            },
            {
                "input": "",
                "output": "Across:\n1. Proposed the heliocentric model of the solar system (6) - Copernicus\n4. Formulated the laws of motion and universal gravitation (8) - Newton\n\nDown:\n1. Discovered the structure of DNA, worked with Watson (6) - Crick\n2. Proposed the theory of evolution by natural selection (9) - Darwin"
            }
        ]
    },
    {
        "id": "machine_task_2",
  

That's everything: Got questions about üê´ CAMEL-AI? Join us on [Discord](https://discord.camel-ai.org)! Whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we‚Äôd love to have you in the community! ü§ù

Check out some of our other work:

1. üê´ Creating Your First CAMEL Agent [free Colab](https://docs.camel-ai.org/cookbooks/create_your_first_agent.html)

2.  Graph RAG Cookbook [free Colab](https://colab.research.google.com/drive/1uZKQSuu0qW6ukkuSv9TukLB9bVaS1H0U?usp=sharing)

3. üßë‚Äç‚öñÔ∏è Create A Hackathon Judge Committee with Workforce [free Colab](https://colab.research.google.com/drive/18ajYUMfwDx3WyrjHow3EvUMpKQDcrLtr?usp=sharing)

4. üî• 3 ways to ingest data from websites with Firecrawl & CAMEL [free Colab](https://colab.research.google.com/drive/1lOmM3VmgR1hLwDKdeLGFve_75RFW0R9I?usp=sharing)

5. ü¶• Agentic SFT Data Generation with CAMEL and Mistral Models, Fine-Tuned with Unsloth [free Colab](https://colab.research.google.com/drive/1lYgArBw7ARVPSpdwgKLYnp_NEXiNDOd-?usp=sharingg)

Thanks from everyone at üê´ CAMEL-AI


<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
‚≠ê <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)
</div>
