BM25Retriever

class BM25Retriever(BaseRetriever):

An implementation of the BaseRetriever using the BM25 model.

This class facilitates the retriever of relevant information using a query-based approach, it ranks documents based on the occurrence and frequency of the query terms.

Attributes: bm25 (BM25Okapi): An instance of the BM25Okapi class used for calculating document scores. content_input_path (str): The path to the content that has been processed and stored. unstructured_modules (UnstructuredIO): A module for parsing files and URLs and chunking content based on specified parameters.

References: https://github.com/dorianbrown/rank_bm25

init

def __init__(self):

Initializes the BM25Retriever.

process

def process(
    self,
    content_input_path: str,
    chunk_type: str = 'chunk_by_title',
    **kwargs: Any
):

Processes content from a file or URL, divides it into chunks by using Unstructured IO,then stored internally. This method must be called before executing queries with the retriever.

Parameters:

  • content_input_path (str): File path or URL of the content to be processed.
  • chunk_type (str): Type of chunking going to apply. Defaults to “chunk_by_title”. **kwargs (Any): Additional keyword arguments for content parsing.

query

def query(self, query: str, top_k: int = DEFAULT_TOP_K_RESULTS):

Executes a query and compiles the results.

Parameters:

  • query (str): Query string for information retriever.
  • top_k (int, optional): The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.

Returns:

List[Dict[str]]: Concatenated list of the query results.