camel.datagen.self_instruct package

On this page

camel.datagen.self_instruct package#

Subpackages#

Submodules#

camel.datagen.self_instruct.self_instruct module#

class camel.datagen.self_instruct.self_instruct.AgentResponse(*, answer: bool)[source]#

Bases: BaseModel

answer: bool#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=bool, required=True, description='Indicates whether the task is classification (True/False).')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class camel.datagen.self_instruct.self_instruct.SelfInstructPipeline(agent: ChatAgent, seed: str, num_machine_instructions: int = 5, data_output_path: str | None = './data_output.json', human_to_machine_ratio: tuple = (6, 2), instruction_filter: InstructionFilter | None = None, filter_config: Dict[str, Dict[str, Any]] | None = None, stop_on_first_failure: bool = False)[source]#

Bases: object

A pipeline to generate and manage machine-generated instructions for tasks, combining human and machine task samples.

Parameters:
  • agent (ChatAgent) – The agent used to interact and generate instructions.

  • seed (str) – The path to the human-written instructions.

  • num_machine_instructions (int) – Number of machine-generated instructions to generate. (default:5)

  • data_output_path (Optional[str]) – Path to save the generated data. (default:/data_output.json)

  • human_to_machine_ratio (tuple) – Ratio of human to machine tasks used for instruction generation. (default:(6, 2))

  • instruction_filter (InstructionFilter) – A filter to validate generated instructions. (default:None)

  • filter_config (Optional[Dict[str, Dict[str, Any]]]) – configuration for the filter functions registered in FILE_REGISTRY. (default:None)

  • stop_on_first_failure (bool) – If True, stops checking filters after the first failure.

construct_data()[source]#

Save the machine-generated tasks to the specified output path in JSON format.

generate(timeout_minutes=600)[source]#

Execute the entire pipeline to generate machine instructions and instances.

Parameters:

timeout_minutes (int) – Maximum time in minutes to run the generation process before timing out. (default: 600)

generate_machine_instance(instruction: str, classification: bool) list[dict][source]#

Generate instances for a given instruction.

Parameters:
  • instruction (str) – The instruction to create instances for.

  • classification (bool) – Whether the instruction is a classification task.

Returns:

A list of generated instances in input-output format.

Return type:

List[dict]

generate_machine_instances()[source]#

Generate instances for each machine task based on its classification status.

generate_machine_instruction() List[source]#

Generate a machine instruction using the agent.

Combines human and machine tasks based on the configured ratio to

create a prompt for instruction generation.

Returns:

The prompt and a machine-generated instruction.

Return type:

List

identify_instruction(instruction: str) bool[source]#

Determine if the given instruction is a classification task.

Parameters:

instruction (str) – The instruction to classify.

Returns:

True if the instruction is a classification task,

otherwise False.

Return type:

bool

load_seed(path: str)[source]#

Load seed tasks from a file. Defaults to a predefined seed file if no path is provided.

Parameters:

path (str) – Path to the seed file.

Raises:

FileNotFoundError – If the seed file does not exist.

parse_classification_output(generated_text: str) List[Dict[str, str]][source]#

Parse the generated text for classification tasks into input-output pairs.

Parameters:

generated_text (str) – The raw text generated by the agent for classification tasks.

Returns:

A list of dictionaries with β€˜input’ and

’output’ keys.

Return type:

List[Dict[str, str]]

parse_non_classification_output(generated_text: str) List[Dict[str, str]][source]#

Parse the generated text for non-classification tasks into input-output pairs.

Parameters:

generated_text (str) – The raw text generated by the agent for non-classification tasks.

Returns:

A list of dictionaries with β€˜input’ and

’output’ keys.

Return type:

List[Dict[str, str]]

sample_human_tasks(count: int) List[dict][source]#

Sample a specified number of human tasks from the loaded seed.

Parameters:

count (int) – Number of human tasks to sample.

Returns:

A list of sampled human tasks.

Return type:

List[dict]

sample_machine_tasks(count: int) List[dict][source]#

Sample a specified number of machine tasks.

Parameters:

count (int) – Number of machine tasks to sample.

Returns:

A list of sampled machine tasks, with placeholders if

insufficient tasks are available.

Return type:

List[dict]

camel.datagen.self_instruct.templates module#

class camel.datagen.self_instruct.templates.SelfInstructTemplates[source]#

Bases: object

Contains templates prompts for self-instruct data generation

clf_template = ' \'\'\'Can the following task be regarded as a classification task with finite output labels?\n\n    Task: Given my personality and the job, tell me if I would be suitable.\n    Is it classification? Yes\n    \n    Task: Give me an example of a time when you had to use your sense of humor.\n    Is it classification? No\n    \n    Task: Replace the placeholders in the given text with appropriate named entities.\n    Is it classification? No\n    \n    Task: Fact checking - tell me if the statement is true, false, or unknown, based on your knowledge and common sense.\n    Is it classification? Yes\n    \n    Task: Return the SSN number for the person.\n    Is it classification? No\n    \n    Task: Detect if the Reddit thread contains hate speech.\n    Is it classification? Yes\n    \n    Task: Analyze the sentences below to identify biases.\n    Is it classification? No\n    \n    Task: Select the longest sentence in terms of the number of words in the paragraph, output the sentence index.\n    Is it classification? Yes\n    \n    Task: Find out the toxic word or phrase in the sentence.\n    Is it classification? No\n    \n    Task: Rank these countries by their population.\n    Is it classification? No\n    \n    Task: You are provided with a news article, and you need to identify all the categories that this article belongs to. Possible categories include: Music, Sports, Politics, Tech, Finance, Basketball, Soccer, Tennis, Entertainment, Digital Game, World News. Output its categories one by one, seperated by comma.\n    Is it classification? Yes\n    \n    Task: Given the name of an exercise, explain how to do it.\n    Is it classification? No\n    \n    Task: Select the oldest person from the list.\n    Is it classification? Yes\n    \n    Task: Find the four smallest perfect numbers.\n    Is it classification? No\n    \n    Task: Does the information in the document supports the claim? You can answer "Support" or "Unsupport".\n    Is it classification? Yes\n    \n    Task: Create a detailed budget for the given hypothetical trip.\n    Is it classification? No\n    \n    Task: Given a sentence, detect if there is any potential stereotype in it. If so, you should explain the stereotype. Else, output no.\n    Is it classification? No\n    \n    Task: Explain the following idiom to me, and try to give me some examples.\n    Is it classification? No\n    \n    Task: Is there anything I can eat for a breakfast that doesn\'t include eggs, yet includes protein, and has roughly 700-1000 calories?\n    Is it classification? No\n    \n    Task: Answer the following multiple choice question. Select A, B, C, or D for the final answer.\n    Is it classification? Yes\n    \n    Task: Decide whether the syllogism is logically sound.\n    Is it classification? Yes\n    \n    Task: How can individuals and organizations reduce unconscious bias?\n    Is it classification? No\n    \n    Task: What are some things you can do to de-stress?\n    Is it classification? No\n    \n    Task: Find out the largest one from a set of numbers. Output the number directly.\n    Is it classification? Yes\n    \n    Task: Replace the <mask> token in the text with proper words that are consistent with the context. You can use multiple words for each <mask> token.\n    Is it classification? No\n    \n    Task: Write a cover letter based on the given facts.\n    Is it classification? No\n    \n    Task: Identify the pos tag of the word in the given sentence.\n    Is it classification? Yes\n    \n    Task: Write a program to compute the sum of integers from k to n.\n    Is it classification? No\n    \n    Task: In this task, you need to compare the meaning of the two sentences and tell if they are the same. Output yes or no.\n    Is it classification? Yes\n    \n    Task: To make the pairs have the same analogy, write the fourth word.\n    Is it classification? No\n    \n    Task: Given a set of numbers, find all possible subsets that sum to a given number.\n    Is it classification? No\n    \n    '#
input_first_template_for_gen = "You will be given a task, \n    Your job is to generate at most two example instances demonstrating how to \n    perform this task. For each instance:\n    - If the task requires input (as an actual example of the task), provide it.\n    - If the task can be answered directly without requiring input, omit the input section.\n    \n    Example 1\n    Input: [Provide input here if needed, otherwise omit this section]\n    Output: [Provide the correct output]\n    \n    Example 2\n    Input: [Provide input here if needed, otherwise omit this section]\n    Output: [Provide the correct output]\n\n    Do not include any additional commentary, explanations, or more than two instances.\n        \n    Below are some examples:\n\n    Task: Which exercises are best for reducing belly fat at home?\n    Output:\n    - Lying Leg Raises\n    - Leg In And Out\n    - Plank\n    - Side Plank\n    - Sit-ups\n\n    Task: Extract all the country names in the paragraph, list them separated by commas.\n    Example 1\n    Paragraph: Dr. No is the sixth novel by the English author Ian Fleming to feature his British Secret Service agent James Bond. Written at Fleming's Goldeneye estate in Jamaica, it was first published in the United Kingdom by Jonathan Cape in 1958. In the novel Bond looks into the disappearance in Jamaica of two fellow MI6 operatives who had been investigating Doctor No. Bond travels to No's Caribbean island and meets Honeychile Rider, who is there to collect shells. They are captured and taken to a luxurious facility carved into a mountain. The character of Doctor No, the son of a German missionary and a Chinese woman, was influenced by Sax Rohmer's Fu Manchu stories. Dr. No was the first of Fleming's novels to face widespread negative reviews in Britain, but it was received more favourably in the United States.\n    Output: English, British, Jamaica, the United Kingdom, German, Chinese, Britain, the United States.\n\n    Task: Converting 85 F to Celsius.\n    Output: 85Β°F = 29.44Β°C\n\n    Task: Sort the given list ascendingly. \n    Example 1\n    List: [10, 92, 2, 5, -4, 92, 5, 101]\n    Output: [-4, 2, 5, 5, 10, 92, 92, 101]\n    Example 2\n    Input 2 - List: [9.99, 10, -5, -1000, 5e6, 999]\n    Output: [-1000, -5, 9.99, 10, 999, 5e6]\n\n    Task: Suggest a better and more professional rephrasing of the following sentence.\n    Example 1\n    Sentence: This house is surprisingly not constructed very well, and you probably need more money to fix it after you buy it. If you ask me, I would suggest you to consider other candidates.\n    Output: This house does not seem to be constructed well, so you may need to spend more money to fix it after you purchase it. I would suggest that you look at other properties.\n    Example 2\n    Sentence: Just so you know, we did an experiment last week and found really surprising results - language model can improve itself!\n    Output: Our experiments last week demonstrated surprising results, proving that the language model can improve itself.\n\n    Task: Read the following paragraph and answer a math question about the paragraph. You need to write out the calculation for getting the final answer.\n    Example 1\n    Paragraph: Gun violence in the United States results in tens of thousands of deaths and injuries annually, and was the leading cause of death for children 19 and younger in 2020. In 2018, the most recent year for which data are available as of 2021, the Centers for Disease Control and Prevention's (CDC) National Center for Health Statistics reports 38,390 deaths by firearm, of which 24,432 were by suicide. The rate of firearm deaths per 100,000 people rose from 10.3 per 100,000 in 1999 to 12 per 100,000 in 2017, with 109 people dying per day or about 14,542 homicides in total, being 11.9 per 100,000 in 2018. In 2010, there were 19,392 firearm-related suicides, and 11,078 firearm-related homicides in the U.S. In 2010, 358 murders were reported involving a rifle while 6,009 were reported involving a handgun; another 1,939 were reported with an unspecified type of firearm. In 2011, a total of 478,400 fatal and nonfatal violent crimes were committed with a firearm.\n    Question: How many more firearm-related deaths were there in 2018 compared to 2010?\n    Output:\n    38390 - (19392 + 11078) = 38390 - 30470 = 7920. \n    So, in 2018, there were 7920 more deaths by firearm than in 2010.\n\n    Task: Write Python code to solve this leetcode problem.\n    Example 1\n    Problem: You are given two non-empty linked lists representing two non-negative integers. The digits are stored in reverse order, and each of their nodes contains a single digit. Add the two numbers and return the sum as a linked list. You may assume the two numbers do not contain any leading zero, except the number 0 itself.\n    Output:\n    class Solution(object):\n        def addTwoNumbers(self, l1, l2):\n            carry = 0\n            root = n = ListNode(0)\n            while l1 or l2 or carry:\n                v1 = v2 = 0\n                if l1:\n                    v1 = l1.val\n                    l1 = l1.next\n                if l2:\n                    v2 = l2.val\n                    l2 = l2.next\n                carry, val = divmod(v1+v2+carry, 10)\n                n.next = ListNode(val)\n                n = n.next\n            return root.next\n\n    Task: Solve the equation and find the value of X. Show your steps.\n    Example 1\n    Equation: 10X + 5 = 10\n    Output: 10X = 5,  X = 0.5\n    Example 2\n    Equation: X + Y + 120 = 100\n    Output: X + Y = -20, X = -20 - Y\n\n    Task: Write a program to compute the sum of integers from k to n.\n    Output:\n    def sum(k, n):\n        sum = 0\n        for i in range(k, n+1):\n            sum += i\n        return sum\n\n    Task: Select the oldest person from the given list.\n    Example 1\n    List: George Washington, Confucius, Michael Jordan, Michelangelo\n    Output: Confucious\n    Example 2\n    List: Alan Turing, Geoffrey Hinton, Yann LeCun, Yoshua Bengio\n    Output: Alan Turing\n\n    Task: Turn down a job offer by sending an email to a recruiter explaining the reason.\n    Output: Hi  [Recruiter],\n    Thank you so much for the generous offer to join your team. As we discussed, I’ve admired the company for a number of years, and am a proud endorser of its products. However, after further consideration of where I currently am in my career, I’ve decided to accept an offer at another company.\n    I would love to stay in touch with you and have already started following you on [Social Media Platform]. Again, thank you so much for your time and consideration.\n    Thanks again,\n    [Your Name]\n\n    Task: {instruction}\n    "#
output_first_template_for_clf = 'You are given a classification instruction. \n    \n    Produce multiple labeled examples following the format below. For each example:\n    - Begin with a "Class label:" line identifying one possible category.\n    - Follow that with one line specifying the example input (e.g., "Sentence:", "Dialogue:", "Opinion:", or "Email:").\n    - The content after these lines should serve as an illustrative example of that label.\n    \n    Do not restate or include the "Task:" line. Do not add additional commentary. Just produce the labeled examples.\n\n    Example format (no initial task line, task will be provided) when task is Task: Classify the sentiment of the sentence into positive, negative, or mixed.:\n        Class label: mixed\n        Sentence: I enjoy the flavor of the restaurant but their service is too slow.\n        Class label: Positive\n        Sentence: I had a great day today. The weather was beautiful and I spent time with friends and family.\n        Class label: Negative\n        Sentence: I was really disappointed by the latest superhero movie. I would not recommend it to anyone.\n    \n    Below are more examples:\n    \n    Task: Given a dialogue, classify whether the user is satisfied with the service. You should respond with "Satisfied" or "Unsatisfied".\n    Class label: Satisfied\n    Dialogue:\n    - Agent: Thank you for your feedback. We will work to improve our service in the future.\n    - Customer: I am happy with the service you provided. Thank you for your help.\n    Class label: Unsatisfied\n    Dialogue:\n    - Agent: I am sorry we will cancel that order for you, and you will get a refund within 7 business days.\n    - Customer: oh that takes too long. I want you to take quicker action on this.\n\n    Task: Given some political opinions, classify whether the person belongs to Democrats or Republicans.\n    Class label: Democrats\n    Opinion: I believe that everyone should have access to quality healthcare regardless of their income level.\n    Class label: Republicans\n    Opinion: I believe that people should be able to keep more of their hard-earned money and should not be taxed at high rates.\n\n    Task: Tell me if the following email is a promotion email or not.\n    Class label: Promotion\n    Email: Check out our amazing new sale! We\'ve got discounts on all of your favorite products.\n    Class label: Not Promotion\n    Email: We hope you are doing well. Let us know if you need any help.\n\n    Task: Detect if the Reddit thread contains hate speech.\n    Class label: Hate Speech\n    Thread: All people of color are stupid and should not be allowed to vote.\n    Class label: Not Hate Speech\n    Thread: The best way to cook a steak on the grill.\n\n    Task:  Does the information in the document supports the claim? You can answer "Support" or "Unsupport".\n    Class label: Unsupport\n    Document: After a record-breaking run that saw mortgage rates plunge to all-time lows and home prices soar to new highs, the U.S. housing market finally is slowing. While demand and price gains are cooling, any correction is likely to be a modest one, housing economists and analysts say. No one expects price drops on the scale of the declines experienced during the Great Recession.\n    Claim: The US housing market is going to crash soon.\n    Class label: Support\n    Document: The U.S. housing market is showing signs of strain, with home sales and prices slowing in many areas. Mortgage rates have risen sharply in recent months, and the number of homes for sale is increasing. This could be the beginning of a larger downturn, with some economists predicting a potential housing crash in the near future.\n    Claim: The US housing market is going to crash soon.\n\n    Task: Answer the following multiple-choice question. Select A, B, C, or D for the final answer.\n    Class label: C\n    Question: What is the capital of Germany?\n    A. London\n    B. Paris\n    C. Berlin\n    D. Rome\n    Class label: D\n    Question: What is the largest planet in our solar system?\n    A) Earth\n    B) Saturn\n    C) Mars\n    D) Jupiter\n    Class label: A\n    Question: What is the process by which plants make their own food through photosynthesis?\n    A) Respiration\n    B) Fermentation\n    C) Digestion\n    D) Metabolism\n    Class label: B\n    Question: Who wrote the novel "The Great Gatsby"?\n    A) Ernest Hemingway\n    B) F. Scott Fitzgerald\n    C) J.D. Salinger\n    D) Mark Twain\n\n    Task: You need to read a code and detect if there is a syntax error or not. Output true if there is an error, output false if there is not.\n    Class label: true\n    Code:\n    def quick_sort(arr):\n        if len(arr) < 2\n            return arr\n    Class label: False\n    Code:\n    def calculate_average(numbers):\n        total = 0\n        for number in numbers:\n            total += number\n        return total / len(numbers)\n\n    Task: You are provided with a news article, and you need to identify all the categories that this article belongs to. Possible categories include Sports and Politics. Output its categories one by one, separated by a comma.\n    Class label: Sports\n    Article: The Golden State Warriors have won the NBA championship for the second year in a row.\n    Class label: Politics\n    Article: The United States has withdrawn from the Paris Climate Agreement.\n    Class label: Politics, Sports\n    Article: The government has proposed cutting funding for youth sports programs.\n\n    Task: Given a credit card statement, the cardholder\'s spending habits, and the account balance, classify whether the cardholder is at risk of defaulting on their payments or not.\n    Class label: At risk\n    Credit card statement: Purchases at high-end clothing stores and luxury hotels.\n    Cardholder\'s spending habits: Frequent purchases at luxury brands and high-end establishments.\n    Account balance: Over the credit limit and multiple missed payments.\n    Class label: Not at risk\n    Credit card statement: Purchases at grocery stores and gas stations.\n    Cardholder\'s spending habits: Regular purchases for necessary expenses and occasional dining out.\n    Account balance: Slightly below the credit limit and no missed payments.\n\n    Task: Given a social media post, the hashtags used, and a topic. classify whether the post is relevant to the topic or not.\n    Class label: Relevant\n    Post: I can\'t believe the government is still not taking action on climate change. It\'s time for us to take matters into our own hands.\n    Hashtags: #climatechange #actnow\n    Topic: Climate change\n    Class label: Not relevant \n    Post: I just bought the new iPhone and it is amazing!\n    Hashtags: #apple #technology\n    Topic: Travel\n\n    Task: The answer will be \'yes\' if the provided sentence contains an explicit mention that answers the given question. Otherwise, answer \'no\'. \n    Class label: Yes\n    Sentence: Jack played basketball for an hour after school.\n    Question: How long did Jack play basketball?\n    Class label: No\n    Sentence: The leaders of the Department of Homeland Security now appear before 88 committees and subcommittees of Congress.\n    Question: How often are they required to appear?\n\n    Task: Tell me what\'s the second largest city by population in Canada.\n    Class label: Montreal\n\n    Task: Classifying different types of mathematical equations, such as linear, and quadratic equations, based on the coefficients and terms in the equation.\n    Class label: Linear equation\n    Equation: y = 2x + 5\n    Class label: Quadratic equation\n    Equation: y = x^2 - 4x + 3\n\n    Task: Tell me the first number of the given list.\n    Class label: 1\n    List: 1, 2, 3\n    Class label: 2\n    List: 2, 9, 10\n\n    Task: Which of the following is not an input type? (a) number (b) date (c) phone number (d) email address (e) all of these are valid inputs.\n    Class label: (e)\n\n    Now, using the given instruction, produce several formatted examples accordingly:\n    Task: {instruction}\n    '#

Module contents#

class camel.datagen.self_instruct.FilterFunction[source]#

Bases: ABC

A base abstract class for filter functions.

Subclasses must implement the apply method, which determines whether a given instruction passes the filter criteria.

abstract apply(instruction: str) bool[source]#

Evaluate the given instruction based on the filter’s criteria.

Parameters:

instruction (str) – The instruction to evaluate.

Returns:

True if the instruction passes the filter, False otherwise.

Return type:

bool

class camel.datagen.self_instruct.InstructionFilter(filters_config: Dict[str, Dict[str, Any]], stop_on_first_failure: bool = False)[source]#

Bases: object

add_filter(filter_function: FilterFunction)[source]#

Add a custom filter function to the InstructionFilter. This allows adding filters that are not in the registry.

Parameters:

filter_function (FilterFunction) – The filter function to be added

filter(prompt: str, instruction: str, return_details: bool = False) bool | Tuple[bool, List[str]][source]#

Check if the given instruction passes all filter functions.

Parameters:
  • prompt (str) – The prompt of generating the instruction.

  • instruction (str) – The instruction to evaluate.

  • return_details (bool) – If True, returns a tuple (bool, List[str]) where the list contains the names of filters that failed. (default:False)

Returns:

True if the instruction passes all filters, False otherwise.

OR (bool, List[str]) if return_details is True.

Return type:

bool

class camel.datagen.self_instruct.KeywordFilter(keywords: List[str])[source]#

Bases: FilterFunction

Filters instructions that contain specific undesirable keywords.

Parameters:

keywords (List[str]) – A list of keywords to filter out.

apply(instruction: str) bool[source]#

Filter the instruction

Parameters:

instruction (str) – the instruction to be filtered.

Returns:

True Instruction must NOT contain any of the keywords.

Return type:

bool

class camel.datagen.self_instruct.LengthFilter(min_len: int = 5, max_len: int = 200)[source]#

Bases: FilterFunction

Filters instructions based on their word count.

Parameters:
  • min_len (int) – The minimum word count required for an instruction. (default:5)

  • max_len (int) – The maximum word count allowed for an instruction. (default:200)

apply(instruction: str) bool[source]#

Filter the instruction

Parameters:

instruction (str) – the instruction to be filtered.

Returns:

True if the length of the instruction is within the range

of [min_len, max_len]

Return type:

bool

class camel.datagen.self_instruct.NonEnglishFilter[source]#

Bases: FilterFunction

Filters instructions that do not begin with English letters.

apply(instruction: str) bool[source]#

Filter the instruction

Parameters:

instruction (str) – the instruction to be filtered.

Returns:

True if the instruction starts with an English letter.

Return type:

bool

class camel.datagen.self_instruct.PunctuationFilter[source]#

Bases: FilterFunction

Filters instructions that begin with a non-alphanumeric character.

apply(instruction: str) bool[source]#

Filter the instruction

Parameters:

instruction (str) – the instruction to be filtered.

Returns:

True if the instruction does not start with punctuation.

Return type:

bool

class camel.datagen.self_instruct.RougeSimilarityFilter(existing_instructions: List[str], threshold: float = 0.7)[source]#

Bases: FilterFunction

Filters instructions that are too similar to existing instructions based on ROUGE scores.

Parameters:
  • existing_instructions (List[str]) – A list of existing instructions to compare against.

  • threshold (float) – The similarity threshold for filtering. (default:0.7)

apply(instruction: str) bool[source]#

Filter the instruction

Parameters:

instruction (str) – the instruction to be filtered.

Returns:

True if the instruction’s similarity to any existing

instruction is below the threshold.

Return type:

bool

class camel.datagen.self_instruct.SelfInstructPipeline(agent: ChatAgent, seed: str, num_machine_instructions: int = 5, data_output_path: str | None = './data_output.json', human_to_machine_ratio: tuple = (6, 2), instruction_filter: InstructionFilter | None = None, filter_config: Dict[str, Dict[str, Any]] | None = None, stop_on_first_failure: bool = False)[source]#

Bases: object

A pipeline to generate and manage machine-generated instructions for tasks, combining human and machine task samples.

Parameters:
  • agent (ChatAgent) – The agent used to interact and generate instructions.

  • seed (str) – The path to the human-written instructions.

  • num_machine_instructions (int) – Number of machine-generated instructions to generate. (default:5)

  • data_output_path (Optional[str]) – Path to save the generated data. (default:/data_output.json)

  • human_to_machine_ratio (tuple) – Ratio of human to machine tasks used for instruction generation. (default:(6, 2))

  • instruction_filter (InstructionFilter) – A filter to validate generated instructions. (default:None)

  • filter_config (Optional[Dict[str, Dict[str, Any]]]) – configuration for the filter functions registered in FILE_REGISTRY. (default:None)

  • stop_on_first_failure (bool) – If True, stops checking filters after the first failure.

construct_data()[source]#

Save the machine-generated tasks to the specified output path in JSON format.

generate(timeout_minutes=600)[source]#

Execute the entire pipeline to generate machine instructions and instances.

Parameters:

timeout_minutes (int) – Maximum time in minutes to run the generation process before timing out. (default: 600)

generate_machine_instance(instruction: str, classification: bool) list[dict][source]#

Generate instances for a given instruction.

Parameters:
  • instruction (str) – The instruction to create instances for.

  • classification (bool) – Whether the instruction is a classification task.

Returns:

A list of generated instances in input-output format.

Return type:

List[dict]

generate_machine_instances()[source]#

Generate instances for each machine task based on its classification status.

generate_machine_instruction() List[source]#

Generate a machine instruction using the agent.

Combines human and machine tasks based on the configured ratio to

create a prompt for instruction generation.

Returns:

The prompt and a machine-generated instruction.

Return type:

List

identify_instruction(instruction: str) bool[source]#

Determine if the given instruction is a classification task.

Parameters:

instruction (str) – The instruction to classify.

Returns:

True if the instruction is a classification task,

otherwise False.

Return type:

bool

load_seed(path: str)[source]#

Load seed tasks from a file. Defaults to a predefined seed file if no path is provided.

Parameters:

path (str) – Path to the seed file.

Raises:

FileNotFoundError – If the seed file does not exist.

parse_classification_output(generated_text: str) List[Dict[str, str]][source]#

Parse the generated text for classification tasks into input-output pairs.

Parameters:

generated_text (str) – The raw text generated by the agent for classification tasks.

Returns:

A list of dictionaries with β€˜input’ and

’output’ keys.

Return type:

List[Dict[str, str]]

parse_non_classification_output(generated_text: str) List[Dict[str, str]][source]#

Parse the generated text for non-classification tasks into input-output pairs.

Parameters:

generated_text (str) – The raw text generated by the agent for non-classification tasks.

Returns:

A list of dictionaries with β€˜input’ and

’output’ keys.

Return type:

List[Dict[str, str]]

sample_human_tasks(count: int) List[dict][source]#

Sample a specified number of human tasks from the loaded seed.

Parameters:

count (int) – Number of human tasks to sample.

Returns:

A list of sampled human tasks.

Return type:

List[dict]

sample_machine_tasks(count: int) List[dict][source]#

Sample a specified number of machine tasks.

Parameters:

count (int) – Number of machine tasks to sample.

Returns:

A list of sampled machine tasks, with placeholders if

insufficient tasks are available.

Return type:

List[dict]