CodeChunker

class CodeChunker(BaseChunker):

A class for chunking code or text while respecting structure and token limits.

This class ensures that structured elements such as functions, classes, and regions are not arbitrarily split across chunks. It also handles oversized lines and Base64-encoded images.

Attributes: chunk_size (int, optional): The maximum token size per chunk. (default: :obj:8192) remove_image: (bool, optional): If the chunker should skip the images. model_name (str, optional): The tokenizer model name used for token counting. (default: :obj:"cl100k_base")

init

def __init__(
    self,
    chunk_size: int = 8192,
    model_name: str = 'cl100k_base',
    remove_image: Optional[bool] = True
):

count_tokens

def count_tokens(self, text: str):

Counts the number of tokens in the given text.

Parameters:

  • text (str): The input text to be tokenized.

Returns:

int: The number of tokens in the input text.

_split_oversized

def _split_oversized(self, line: str):

Splits an oversized line into multiple chunks based on token limits

Parameters:

  • line (str): The oversized line to be split.

Returns:

List[str]: A list of smaller chunks after splitting the oversized line.

chunk

def chunk(self, content: List[str]):

Splits the content into smaller chunks while preserving structure and adhering to token constraints.

Parameters:

  • content (List[str]): The content to be chunked.

Returns:

List[str]: A list of chunked text segments.