HybridRetriever

class HybridRetriever(BaseRetriever):

init

def __init__(
    self,
    embedding_model: Optional[BaseEmbedding] = None,
    vector_storage: Optional[BaseVectorStorage] = None
):

Initializes the HybridRetriever with optional embedding model and vector storage.

Parameters:

  • embedding_model (Optional[BaseEmbedding]): An optional embedding model used by the VectorRetriever. Defaults to None.
  • vector_storage (Optional[BaseVectorStorage]): An optional vector storage used by the VectorRetriever. Defaults to None.

process

def process(self, content_input_path: str):

Processes the content input path for both vector and BM25 retrievers.

Parameters:

  • content_input_path (str): File path or URL of the content to be processed.

_sort_rrf_scores

def _sort_rrf_scores(
    self,
    vector_retriever_results: List[Dict[str, Any]],
    bm25_retriever_results: List[Dict[str, Any]],
    top_k: int,
    vector_weight: float,
    bm25_weight: float,
    rank_smoothing_factor: float
):

Sorts and combines results from vector and BM25 retrievers using Reciprocal Rank Fusion (RRF).

Parameters:

  • vector_retriever_results: A list of dictionaries containing the results from the vector retriever, where each dictionary contains a ‘text’ entry.
  • bm25_retriever_results: A list of dictionaries containing the results from the BM25 retriever, where each dictionary contains a ‘text’ entry.
  • top_k: The number of top results to return after sorting by RRF score.
  • vector_weight: The weight to assign to the vector retriever results in the RRF calculation.
  • bm25_weight: The weight to assign to the BM25 retriever results in the RRF calculation.
  • rank_smoothing_factor: A hyperparameter for the RRF calculation that helps smooth the rank positions.

Returns:

List[Dict[str, Union[str, float]]]: A list of dictionaries representing the sorted results. Each dictionary contains the ‘text’from the retrieved items and their corresponding ‘rrf_score’.

query

def query(
    self,
    query: str,
    top_k: int = 20,
    vector_weight: float = 0.8,
    bm25_weight: float = 0.2,
    rank_smoothing_factor: int = 60,
    vector_retriever_top_k: int = 50,
    vector_retriever_similarity_threshold: float = 0.5,
    bm25_retriever_top_k: int = 50,
    return_detailed_info: bool = False
):

Executes a hybrid retrieval query using both vector and BM25 retrievers.

Parameters:

  • query (str): The search query.
  • top_k (int): Number of top results to return (default 20).
  • vector_weight (float): Weight for vector retriever results in RRF.
  • bm25_weight (float): Weight for BM25 retriever results in RRF.
  • rank_smoothing_factor (int): RRF hyperparameter for rank smoothing.
  • vector_retriever_top_k (int): Top results from vector retriever.
  • vector_retriever_similarity_threshold (float): Similarity threshold for vector retriever.
  • bm25_retriever_top_k (int): Top results from BM25 retriever.
  • return_detailed_info (bool): Return detailed info if True.

Returns:

Union[ dict[str, Sequence[Collection[str]]], dict[str, Sequence[Union[str, float]]] ]: By default, returns only the text information. If return_detailed_info is True, return detailed information including rrf scores.