SelfInstructPipeline

class SelfInstructPipeline:

A pipeline to generate and manage machine-generated instructions for tasks, combining human and machine task samples.

Parameters:

  • agent (ChatAgent): The agent used to interact and generate instructions.
  • seed (str): The path to the human-written instructions.
  • num_machine_instructions (int): Number of machine-generated instructions to generate. (default::obj:5)
  • data_output_path (Optional[str]): Path to save the generated data. (default::obj:./data_output.json)
  • human_to_machine_ratio (tuple): Ratio of human to machine tasks used for instruction generation. (default::obj:(6, 2))
  • instruction_filter (InstructionFilter): A filter to validate generated instructions. (default::obj:None)
  • filter_config (Optional[Dict[str, Dict[str, Any]]]): configuration for the filter functions registered in FILE_REGISTRY. (default::obj:None)
  • stop_on_first_failure (bool): If True, stops checking filters after the first failure.

init

def __init__(
    self,
    agent: ChatAgent,
    seed: str,
    num_machine_instructions: int = 5,
    data_output_path: Optional[str] = './data_output.json',
    human_to_machine_ratio: tuple = (6, 2),
    instruction_filter: Optional[InstructionFilter] = None,
    filter_config: Optional[Dict[str, Dict[str, Any]]] = None,
    stop_on_first_failure: bool = False
):

load_seed

def load_seed(self, path: str):

Load seed tasks from a file. Defaults to a predefined seed file if no path is provided.

Parameters:

  • path (str): Path to the seed file.

sample_human_tasks

def sample_human_tasks(self, count: int):

Sample a specified number of human tasks from the loaded seed.

Parameters:

  • count (int): Number of human tasks to sample.

Returns:

List[dict]: A list of sampled human tasks.

sample_machine_tasks

def sample_machine_tasks(self, count: int):

Sample a specified number of machine tasks.

Parameters:

  • count (int): Number of machine tasks to sample.

Returns:

List[dict]: A list of sampled machine tasks, with placeholders if insufficient tasks are available.

generate_machine_instruction

def generate_machine_instruction(self):

Returns:

List: The prompt and a machine-generated instruction.

identify_instruction

def identify_instruction(self, instruction: str):

Determine if the given instruction is a classification task.

Parameters:

  • instruction (str): The instruction to classify.

Returns:

bool: True if the instruction is a classification task, otherwise False.

generate_machine_instances

def generate_machine_instances(self):

Generate instances for each machine task based on its classification status.

generate_machine_instance

def generate_machine_instance(self, instruction: str, classification: bool):

Generate instances for a given instruction.

Parameters:

  • instruction (str): The instruction to create instances for.
  • classification (bool): Whether the instruction is a classification task.

Returns:

List[dict]: A list of generated instances in input-output format.

parse_classification_output

def parse_classification_output(self, generated_text: str):

Parse the generated text for classification tasks into input-output pairs.

Parameters:

  • generated_text (str): The raw text generated by the agent for classification tasks.

Returns:

List[Dict[str, str]]: A list of dictionaries with ‘input’ and ‘output’ keys.

parse_non_classification_output

def parse_non_classification_output(self, generated_text: str):

Parse the generated text for non-classification tasks into input-output pairs.

Parameters:

  • generated_text (str): The raw text generated by the agent for non-classification tasks.

Returns:

List[Dict[str, str]]: A list of dictionaries with ‘input’ and ‘output’ keys.

construct_data

def construct_data(self):

Save the machine-generated tasks to the specified output path in JSON format.

generate

def generate(self, timeout_minutes = 600):

Execute the entire pipeline to generate machine instructions and instances.

Parameters:

  • timeout_minutes (int): Maximum time in minutes to run the generation process before timing out. (default: :obj:600)