Self-instruct Data Generation Using Qwen#

61f2bb9129484150b9ac1ab334a9b5bc 718c7ee7a65740b6b51e6a1a231b3cba

⭐ Star us on Github, join our Discord or follow our X

The self-instruct pipeline is a technique for automatically generating instructions for large language models (LLMs). Manually creating these datasets can be time-consuming and expensive. The self-instruct pipeline provides a way to automate this process and generate large numbers of instructions quickly and efficiently.

Installation and Setup#

First, install the CAMEL package with all its dependencies

[ ]:
!pip install "git+https://github.com/camel-ai/camel.git@c7bd39c898cb8d3bd434acd4219c3cb4f5f85ae2#egg=camel-ai[all]"

If you don’t have a Qwen API key, you can obtain one by following these steps:

Visit the Alibaba Cloud Model Studio Console (https://www.alibabacloud.com/en?_p_lc=1) and follow the on-screen instructions to activate the model services.

In the upper-right corner of the console, click on your account name and select API-KEY.

On the API Key management page, click on the Create API Key button to generate a new key.

[2]:
import os
from getpass import getpass

qwen_api_key = getpass('Enter your Qwen API key: ')
os.environ["QWEN_API_KEY"] = qwen_api_key
Enter your Qwen API key: ··········
[3]:
from camel.configs import QwenConfig
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.agents import ChatAgent
from camel.messages import BaseMessage

qwen_model = ModelFactory.create(
    model_platform=ModelPlatformType.QWEN,
    model_type=ModelType.QWEN_TURBO,
    model_config_dict=QwenConfig(temperature=0.2).as_dict(),
)

Basic Agent Setup#

[4]:
from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

agent = ChatAgent(
    model=qwen_model,
)

Basic Pipeline Setup#

The pipeline works by starting with a small set of seed (human-written) instructions and then using an LLM to generate new instructions based on those seeds.

  • The seed instructions are typically stored in a JSON Lines (JSONL) file. Each line in the file represents a single instruction in JSON format.

  • Like the seed file, the output is also stored in JSONL format, making it easy to parse and use for further tasks, such as training or fine-tuning language models.

Please replace seed_path with the path to your seed file, and replace data_output_path with your desired output location.

[5]:
import os
import requests

# Create directory for local data
os.makedirs('local_data', exist_ok=True)

# Update the URL to the raw file content
url = "https://raw.githubusercontent.com/camel-ai/camel/master/examples/synthetic_datagen/self_instruct/seed_tasks.jsonl"

# Fetch the raw file
response = requests.get(url)

with open('local_data/seed_tasks.jsonl', 'wb') as file:
    file.write(response.content)

[6]:
seed_path = 'local_data/seed_tasks.jsonl'
data_output_path = 'data_output.json'

The cell below shows some example instructions in the seed file. All seed files should follow this format.

[7]:
with open('local_data/seed_tasks.jsonl', 'r') as file:
        for i, line in enumerate(file):
            print(line.strip())
            if i >= 9:
                break
{"id": "seed_task_0", "name": "breakfast_suggestion", "instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?", "instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}], "is_classification": false}
{"id": "seed_task_1", "name": "antonym_relation", "instruction": "What is the relation between the given pairs?", "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false}
{"id": "seed_task_2", "name": "one_sentence_description", "instruction": "Generate a one-sentence description for each of the following people.", "instances": [{"input": "- Brack Obama\n- Elon Musk\n- Taylor Swift", "output": "- Barack Hussein Obama II is an American politician who served as the 44th president of the United States from 2009 to 2017.\n- Elon Musk is the founder, CEO, and chief engineer of SpaceX; angel investor, CEO and product architect of Tesla, Inc.; founder of The Boring Company; co-founder of Neuralink and OpenAI; president of the Musk Foundation; and owner and CEO of Twitter, Inc.\n- Taylor Alison Swift is an American singer-songwriter."}], "is_classification": false}
{"id": "seed_task_3", "name": "harmful_stereotype_example", "instruction": "Describe a situation in which the given stereotype can harm you.", "instances": [{"input": "\"All Asians are smart!\"", "output": "Imagine you're a small child of Asian descent. You hear your whole life how \"all Asians are smart,\" but you know that does not feel like it's true for you. You have a hard time in school. Reading is difficult. Math is impossible. Everyone around you seems to pressure you because you are supposed to be \"smart.\" Every time you do not bring home that \"A\" assignment, you feel like a failure. Like you alone missed out on the gene that makes \"all Asians smart.\"\nAs educational opportunities like scholarships and colleges close their door on you, your future looks bleak. You have spent years trying to be the stereotypically smart Asian instead of finding the talents that could have taken you to success in other areas. As you can see, although the stereotype doesn't sound like a bad thing at first, it can put a lot of weight on the people who do not fit into it."}], "is_classification": false}
{"id": "seed_task_4", "name": "email_subject_generation", "instruction": "Generate an appropriate subjective title for the following email:", "instances": [{"input": "Hi [person name],\n\nI'm writing to ask you if you are happy to be a panelist of our workshop on multimodality at CVPR. The workshop will be held on June 20, 2023. \n\nBest,\n[my name]", "output": "Invitition to be a panelist for CVPR 2023 workshop on Multimodality"}], "is_classification": false}
{"id": "seed_task_5", "name": "interview_question_answering", "instruction": "How do you answer this question in a job interview?", "instances": [{"input": "What do you think about when you are alone in a car?", "output": "If it is in the evening, I typically think about what good I have accomplished that day, and if it is in the morning, I think about how I can be my best that day. I also try to practice gratitude and live in the present moment while I'm driving to avoid distractions."}], "is_classification": false}
{"id": "seed_task_6", "name": "new_year's_resolutions", "instruction": "Brainstorm a list of possible New Year's resolutions.", "instances": [{"input": "", "output": "- Lose weight\n- Exercise more\n- Eat healthier\n- Quit smoking\n- Drink less alcohol\n- Spend less money\n- Save more money\n- Get out of debt\n- Take a trip\n- Volunteer more\n- Spend more time with family and friends\n- Read more\n- Start a hobby\n- Get organized\n- Simplify your life\n- Meditate or pray more\n- Get a new job\n- Move to a new city\n- Make new friends\n- Go back to school\n- Learn a new language\n- Start a business"}], "is_classification": false}
{"id": "seed_task_7", "name": "idiom_meaning", "instruction": "Explain the following idiom to me, and try to give me some examples.", "instances": [{"input": "black sheep", "output": "Meaning: An outcast. Someone who doesn\u2019t fit in with the rest of the crowd. They take pride in being different. Thinks for themselves and doesn\u2019t care what no one else has to say. They tend to ride their own wave and are usually loners because no one understands them, but its okay because they like it that way.\nExample: He\u2019s the black sheep of the family."}], "is_classification": false}
{"id": "seed_task_8", "name": "english_haiku_generation", "instruction": "Generate a haiku using the following word:", "instances": [{"input": "summer", "output": "The chill, worming in\nShock, pleasure, bursting within\nSummer tongue awakes"}], "is_classification": false}
{"id": "seed_task_9", "name": "movie_recommendation", "instruction": "Recommend a movie for me to watch during the weekend and explain the reason.", "instances": [{"input": "", "output": "I would recommend the movie \"The Shawshank Redemption\" because it is an excellent movie that is both moving and inspiring. It is the story of a man who is unjustly imprisoned and his struggle to maintain hope and dignity. It is a great film to watch over the weekend because it will make you think about the human capacity for resilience and hope."}], "is_classification": false}

The self-instruct pipeline works iteratively. In each round:

  1. It selects a certain number of human-written instructions (num_human_sample) from the seed_path.

  2. It selects a certain number of machine-generated instructions (num_machine_sample) from previous rounds.

  3. It uses these selected instructions to guide the language model in generating new instructions.

  4. These new instructions are added to the pool of machine-generated instructions, and the process repeats until the desired number of instructions is generated.

The human_to_machine_ratio helps control the balance between human guidance and the model’s creativity throughout this process. By adjusting this ratio, you can influence the quality and diversity of the generated instructions.

Feel free to alter num_human_sample and num_machine_sample, which both will be passed into human_to_machine_ratio later

[8]:
num_human_sample = 6
num_machine_sample = 2

Please replace target_num_instructions with the number of machine instructions you want to generate

[9]:
target_num_instructions = 5

Pass everything to our pipeline.

[10]:
pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path=data_output_path,
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
)

Try generating it! You will see the generated data file being created at your desired location!

[11]:
pipeline.generate()

Pretty print the generated data content

[12]:
import json

with open(data_output_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))
[
    {
        "id": "machine_task_1",
        "instruction": "Design a simple logo that represents both unity and diversity for a community organization.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "A circular logo with a pattern composed of different colored puzzle pieces fitting together perfectly, symbolizing unity and diversity within the community."
            },
            {
                "input": "",
                "output": "An emblem featuring a tree with leaves of various shapes and colors, signifying growth, inclusion, and unity among diverse members of the community."
            }
        ]
    },
    {
        "id": "machine_task_2",
        "instruction": "Create a step-by-step guide explaining how to perform a specific magic trick.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "1. Begin by showing the audience a regular deck of cards.\n2. Shuffle the deck and ask a volunteer from the audience to draw a card and remember it.\n3. Have the volunteer place the card back into the deck without showing anyone.\n4. Shuffle the deck again and spread the cards face down on the table.\n5. Slowly reveal each card, until you find the selected card."
            },
            {
                "input": "",
                "output": "1. Ask a participant to choose a coin and hold it in their closed hand.\n2. Cover the participant's hand with a cloth.\n3. While covered, secretly switch the coin with a duplicate coin.\n4. Lift the cloth to reveal the coin has seemingly disappeared and then reappeared elsewhere."
            }
        ]
    },
    {
        "id": "machine_task_3",
        "instruction": "Write a short story that incorporates the given three random words as key elements.",
        "is_classification": false,
        "instances": [
            {
                "input": "[\"mystery\", \"river\", \"midnight\"",
                "output": "Late one \"midnight\" under the glow of a full moon, Sarah found herself drawn to the mysterious \"river\" that flowed near her cottage. Legend had it that the river held many \"mysteries\" within its depths. As she approached the water's edge, she noticed a faint light flickering across the water. Her curiosity piqued, she ventured forth, only to discover an old boat half-submerged in the river. Determined to uncover the secrets it held, she spent hours exploring the boat and the riverbank, until the first light of dawn broke the darkness."
            },
            {
                "input": "[\"castle\", \"dragon\", \"princess\"",
                "output": "In a faraway kingdom, Princess Amelia lived in a grand \"castle\" atop a hill. One day, a fearsome \"dragon\" began terrorizing the nearby villages. The king offered half his kingdom and Amelia's hand in marriage to whoever could slay the beast. Many knights tried and failed, but a humble farmer's son named Thomas succeeded. Instead of claiming the reward, he asked only for passage to the castle. There, he and Amelia discovered a shared love for adventure and soon embarked on a quest together, with the \"dragon\" as their guide, leaving behind the life of luxury in the \"castle\"."
            }
        ]
    },
    {
        "id": "machine_task_4",
        "instruction": "Explain the cultural significance of a traditional dance in a few paragraphs.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "The Flamenco dance, originating from the Andalusian region of Spain, holds deep cultural significance. It is a passionate and expressive art form that combines singing, guitar playing, dancing, and handclaps. Flamenco embodies the spirit of Andalusia, reflecting the region's history of diverse cultural influences including Roman, Jewish, Moorish, and Christian traditions. This dance form often conveys intense emotions such as sorrow, joy, and longing, mirroring the struggles and triumphs of the Andalusian people throughout history. Flamenco is not just a dance; it is a means of storytelling, preserving the collective memory and identity of the Andalusian community."
            },
            {
                "input": "",
                "output": "The Haka, a traditional war dance of the M\u0101ori people of New Zealand, is a powerful display of strength, unity, and cultural pride. Originally performed by warriors before a battle to intimidate opponents and boost morale, today it is performed on various occasions such as weddings, funerals, and sporting events. The Haka involves rhythmic stamping, aggressive facial expressions, and synchronized movements, making it a visually striking performance. It serves as a way to honor ancestors, express grief, and celebrate achievements, reinforcing the M\u0101ori community's connection to their heritage and each other."
            }
        ]
    },
    {
        "id": "machine_task_5",
        "instruction": "Compose a poem that captures the essence of a serene countryside landscape.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "In the quiet valley where the river flows,\nGolden fields sway gently in the breeze.\nMountains stand guard, in eternal repose,\nWhile wildflowers dance beneath whispering trees.\n\nThe sun sets slow, painting skies anew,\nA canvas vast, in hues of pink and gold.\nUnder the stars, the night whispers true,\nOf peace that the waking world can't hold."
            },
            {
                "input": "",
                "output": "Beneath the wide and open sky so blue,\nA lone barn stands against the green.\nThe lazy brook meanders through,\nIts waters clear, its banks unseen.\n\nBirdsong fills the air with sweet refrain,\nAs shadows lengthen, cool and long.\nIn this tranquil scene, no pain,\nJust nature's calm, a soothing song.\n\nThe gentle rustle of the leaves above,\nA soft caress from the summer wind.\nHere, worries fade, and cares dissolve,\nIn this serene and peaceful bind."
            }
        ]
    }
]

Filter functions#

Newly generated instructions undergo filtering and evaluation before being added to the results. Only those meeting predefined standards are included. CAMEL provides some filter functions that can be passed in the self-instruct pipeline. Additionally, we also supports custom filters for tailored evaluation! Filter functions return True if the instruction is valid, False otherwise.

Length Filter#

LengthFilter filters out all the instructions which has a length less than min_len or greater than max_len.

[13]:
from camel.datagen.self_instruct import LengthFilter

length_filter = LengthFilter(min_len=5, max_len=50)

instructions = [
    "Sort the numbers in ascending order.",
    "Calculate the sum.",
    "Create a report that details the monthly expenses and savings in a spreadsheet."
]

filtered_instructions = [instr for instr in instructions if length_filter.apply(instr)]
print(filtered_instructions)
['Sort the numbers in ascending order.', 'Create a report that details the monthly expenses and savings in a spreadsheet.']

Keyword Filter#

KeywordFilter filters instructions that contain specific undesirable keyword.

[14]:
from camel.datagen.self_instruct import KeywordFilter

keyword_filter = KeywordFilter(keywords=["ban", "prohibit", "forbid"])

instructions = [
    "Ban the use of plastic bags.",
    "Encourage recycling programs.",
    "Prohibit smoking in public areas."
]

filtered_instructions = [instr for instr in instructions if keyword_filter.apply(instr)]
print(filtered_instructions)
['Encourage recycling programs.']

Punctuation Filter#

PunctuationFilter filters instructions that begin with a non-alphanumeric character.

[15]:
from camel.datagen.self_instruct import PunctuationFilter

punctuation_filter = PunctuationFilter()

instructions = [
    "Sort the data by category.",
    "#Analyze the trends over time.",
    "*Create a summary of results."
]

filtered_instructions = [instr for instr in instructions if punctuation_filter.apply(instr)]
print(filtered_instructions)
['Sort the data by category.']

Non-English Filter#

NonEnglishFilter filters instructions that do not begin with English letters.

[16]:
from camel.datagen.self_instruct import NonEnglishFilter

non_english_filter = NonEnglishFilter()

instructions = [
    "Analyze the performance metrics.",
    "计算结果的统计数据.",
    "Test the new algorithm."
]

filtered_instructions = [instr for instr in instructions if non_english_filter.apply(instr)]
print(filtered_instructions)
['Analyze the performance metrics.', 'Test the new algorithm.']

ROUGE Similarity Filter#

RougeSimilarityFilter filters instructions that are too similar to existing instructions based on ROUGE scores.

[17]:
from camel.datagen.self_instruct import RougeSimilarityFilter

existing_instructions = [
    "Summarize the article.",
    "Write a brief overview of the text."
]

similarity_filter = RougeSimilarityFilter(existing_instructions, threshold=0.5)

instructions = [
    "Summarize the content.",
    "Create a summary for the text.",
    "Provide an analysis of the text."
]

filtered_instructions = [instr for instr in instructions if similarity_filter.apply(instr)]
print(filtered_instructions)
['Create a summary for the text.', 'Provide an analysis of the text.']

Custom Filter Function#

Additionaly, you could implement your own filter function.

[18]:
from camel.datagen.self_instruct import FilterFunction

class CustomFilter(FilterFunction):

    def apply(self, instruction: str) -> bool:
        # apply your logic here
        logic = ...
        return logic

Instruction Filter#

InstructionFilter manages all filter functions. And we can use a custom InstructionFilter to initialize the pipeline

Start by adding filter functions you want and configure them.

[19]:
filter_config = {
  "length": {"min_len": 5, "max_len": 100},
  "keyword": {"keywords": ["image", "video"]},
  "non_english": {},
  "rouge_similarity": {
      "existing_instructions": ["Some existing instructions"],
      "threshold": 0.6
  }
}

Then, initialize an InstructionFilter

[20]:
from camel.datagen.self_instruct import InstructionFilter
filters = InstructionFilter(filter_config)

Pipeline Setup with Custom InstructionFilter#

CAMEL has some default filter functions inside the pipeline, but you can also choose your own!

[24]:
pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path=data_output_path,
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
    instruction_filter=filters,    # pass in your InstructionFilter
)

Or if you want to use the default function filters, but different configuration, you can also just pass in the filter configuration

Finally, generate!

[25]:
pipeline.generate()

Pretty print the generated data content

[26]:
import json

with open(data_output_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))
[
    {
        "id": "machine_task_1",
        "instruction": "Create a crossword puzzle with clues related to famous scientists and their discoveries.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "Across:\n1. Scientist known for the theory of relativity (6) - Einstein\n5. Unit of frequency, named after a German physicist (7) - Hertz\n\nDown:\n1. Father of genetics, studied pea plants (7) - Mendel\n2. Invented the telephone, Scottish-born (8) - Bell"
            },
            {
                "input": "",
                "output": "Across:\n1. Proposed the heliocentric model of the solar system (6) - Copernicus\n4. Formulated the laws of motion and universal gravitation (8) - Newton\n\nDown:\n1. Discovered the structure of DNA, worked with Watson (6) - Crick\n2. Proposed the theory of evolution by natural selection (9) - Darwin"
            }
        ]
    },
    {
        "id": "machine_task_2",
        "instruction": "Design a workout plan tailored to someone with a specific fitness goal, such as weight loss or muscle gain.",
        "is_classification": false,
        "instances": [
            {
                "input": "Fitness Goal - Weight Loss",
                "output": "1. Monday: Cardio (30 min jog) and Light Weights\n2. Tuesday: Yoga (60 min session)\n3. Wednesday: HIIT (30 min session)\n4. Thursday: Swimming (30 min)\n5. Friday: Cycling (45 min session)\n6. Saturday: Rest or Light Activity (Walk, Stretch)\n7. Sunday: Rest or Light Activity (Walk, Stretch)"
            },
            {
                "input": "Fitness Goal - Muscle Gain",
                "output": "1. Monday: Chest and Triceps\n2. Tuesday: Back and Biceps\n3. Wednesday: Rest or Light Activity (Stretching)\n4. Thursday: Legs\n5. Friday: Shoulders and Abs\n6. Saturday: Full Body Circuit Training\n7. Sunday: Rest or Light Activity (Stretching)"
            }
        ]
    },
    {
        "id": "machine_task_3",
        "instruction": "Compose a short story that incorporates the given three random words as key elements.",
        "is_classification": false,
        "instances": [
            {
                "input": "[\"river\", \"midnight\", \"mystery\"]",
                "output": "At midnight, the river seemed to whisper secrets as thick fog enveloped its banks. Along its shore, Sarah stood, her heart pounding with the thrill of the mystery that awaited her. Legend had it that the river held many mysteries within its depths, and tonight, she planned to uncover them. As she approached the water's edge, she noticed a faint light flickering across the water. Her curiosity piqued, she ventured forth, only to discover an old boat half-submerged in the river. Determined to unveil the secrets it held, she spent hours exploring the boat and the riverbank, until the first light of dawn broke the darkness."
            },
            {
                "input": "[\"castle\", \"dragon\", \"princess\"]",
                "output": "In a faraway kingdom, Princess Amelia lived in a grand castle atop a hill. One day, a fearsome dragon began terrorizing the nearby villages. The king offered half his kingdom and Amelia's hand in marriage to whoever could slay the beast. Many knights tried and failed, but a humble farmer's son named Thomas succeeded. Instead of claiming the reward, he asked only for passage to the castle. There, he and Amelia discovered a shared love for adventure and soon embarked on a quest together, with the dragon as their guide, leaving behind the life of luxury in the castle."
            }
        ]
    },
    {
        "id": "machine_task_4",
        "instruction": "Generate a list of potential names for a new restaurant, focusing on themes of nature and sustainability.",
        "is_classification": false,
        "instances": [
            {
                "input": "",
                "output": "- Green Haven\n- Eco Eats\n- Nature's Nook\n- Sustainable Spoon\n- Leafy Lounge"
            },
            {
                "input": "",
                "output": "- The Verdant Table\n- Organic Oasis\n- Farm to Fork Feast\n- Earthy Eateria\n- Clean Cuisine Corner"
            }
        ]
    },
    {
        "id": "machine_task_5",
        "instruction": "Design a logo for a local business that reflects its industry and values.",
        "is_classification": false,
        "instances": [
            {
                "input": "Business - \"Green Garden Cafe\", Industry - \"Caf\u00e9\", Values - \"Sustainability, Community\"",
                "output": "A green leaf intertwined with a coffee cup, symbolizing the caf\u00e9's commitment to sustainability. Underneath, the words \"Green Garden Cafe\" in a friendly, inviting font, with a small icon of people gathered around a table to represent community."
            },
            {
                "input": "Business - \"Tech Innovators Inc.\", Industry - \"Technology\", Values - \"Innovation, Leadership, Cutting-edge\"",
                "output": "A modern, sleek design featuring a lightbulb made up of interconnected circuit boards, symbolizing innovation and cutting-edge technology. The words \"Tech Innovators Inc.\" in bold, contemporary font underneath, with a small icon of a leader's profile to represent leadership."
            }
        ]
    }
]

That’s everything: Got questions about 🐫 CAMEL-AI? Join us on Discord! Whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we’d love to have you in the community! 🤝

Check out some of our other work:

  1. 🐫 Creating Your First CAMEL Agent free Colab

  2. Graph RAG Cookbook free Colab

  3. 🧑‍⚖️ Create A Hackathon Judge Committee with Workforce free Colab

  4. 🔥 3 ways to ingest data from websites with Firecrawl & CAMEL free Colab

  5. 🦥 Agentic SFT Data Generation with CAMEL and Mistral Models, Fine-Tuned with Unsloth free Colab

Thanks from everyone at 🐫 CAMEL-AI

cae085e348054a51b9074b86339358d6 4887ab9d8a344047a1f3e5508338734e

⭐ Star us on Github, join our Discord or follow our X