Camel.datagen.self instruct.self instruct

SelfInstructPipeline

class SelfInstructPipeline:

A pipeline to generate and manage machine-generated instructions for tasks, combining human and machine task samples. Parameters:

agent (ChatAgent): The agent used to interact and generate instructions.
seed (str): The path to the human-written instructions.
num_machine_instructions (int): Number of machine-generated instructions to generate. (default::obj:5)
data_output_path (Optional[str]): Path to save the generated data. (default::obj:./data_output.json)
human_to_machine_ratio (tuple): Ratio of human to machine tasks used for instruction generation. (default::obj:(6, 2))
instruction_filter (InstructionFilter): A filter to validate generated instructions. (default::obj:None)
filter_config (Optional[Dict[str, Dict[str, Any]]]): configuration for the filter functions registered in FILE_REGISTRY. (default::obj:None)
stop_on_first_failure (bool): If True, stops checking filters after the first failure.

init

def __init__(
    self,
    agent: ChatAgent,
    seed: str,
    num_machine_instructions: int = 5,
    data_output_path: Optional[str] = './data_output.json',
    human_to_machine_ratio: tuple = (6, 2),
    instruction_filter: Optional[InstructionFilter] = None,
    filter_config: Optional[Dict[str, Dict[str, Any]]] = None,
    stop_on_first_failure: bool = False
):

load_seed

def load_seed(self, path: str):

Load seed tasks from a file. Defaults to a predefined seed file if no path is provided. Parameters:

path (str): Path to the seed file.

sample_human_tasks

def sample_human_tasks(self, count: int):

Sample a specified number of human tasks from the loaded seed. Parameters:

count (int): Number of human tasks to sample.

Returns: List[dict]: A list of sampled human tasks.

sample_machine_tasks

def sample_machine_tasks(self, count: int):

Sample a specified number of machine tasks. Parameters:

count (int): Number of machine tasks to sample.

Returns: List[dict]: A list of sampled machine tasks, with placeholders if insufficient tasks are available.

generate_machine_instruction

def generate_machine_instruction(self):

Returns: List: The prompt and a machine-generated instruction.

identify_instruction

def identify_instruction(self, instruction: str):

Determine if the given instruction is a classification task. Parameters:

instruction (str): The instruction to classify.

Returns: bool: True if the instruction is a classification task, otherwise False.

generate_machine_instances

def generate_machine_instances(self):

Generate instances for each machine task based on its classification status.

generate_machine_instance

def generate_machine_instance(self, instruction: str, classification: bool):

Generate instances for a given instruction. Parameters:

instruction (str): The instruction to create instances for.
classification (bool): Whether the instruction is a classification task.

Returns: List[dict]: A list of generated instances in input-output format.

parse_classification_output

def parse_classification_output(self, generated_text: str):

Parse the generated text for classification tasks into input-output pairs. Parameters:

generated_text (str): The raw text generated by the agent for classification tasks.

Returns: List[Dict[str, str]]: A list of dictionaries with ‘input’ and ‘output’ keys.

parse_non_classification_output

def parse_non_classification_output(self, generated_text: str):

Parse the generated text for non-classification tasks into input-output pairs. Parameters:

generated_text (str): The raw text generated by the agent for non-classification tasks.

Returns: List[Dict[str, str]]: A list of dictionaries with ‘input’ and ‘output’ keys.

construct_data

def construct_data(self):

Save the machine-generated tasks to the specified output path in JSON format.

generate

def generate(self, timeout_minutes = 600):

Execute the entire pipeline to generate machine instructions and instances. Parameters:

timeout_minutes (int): Maximum time in minutes to run the generation process before timing out. (default: :obj:600)

Overview

Agents

Configs

Data Generation

Datasets

Embeddings

Models

Interpreters

Memory

Messages

Prompts

Responses

Retrievers

Societies

Storage

Tasks

Terminators

Toolkits

Types

Verifiers

Bots

Utilities

Environments

Extractors

Personas

Benchmarks

Data Collectors

Datahubs

Loaders

Runtimes

Schemas

​SelfInstructPipeline

​init

​load_seed

​sample_human_tasks

​sample_machine_tasks

​generate_machine_instruction

​identify_instruction

​generate_machine_instances

​generate_machine_instance

​parse_classification_output

​parse_non_classification_output

​construct_data

​generate