VectorRetriever

class VectorRetriever(BaseRetriever):

An implementation of the BaseRetriever by using vector storage and embedding model.

This class facilitates the retriever of relevant information using a query-based approach, backed by vector embeddings.

Attributes: embedding_model (BaseEmbedding): Embedding model used to generate vector embeddings. storage (BaseVectorStorage): Vector storage to query. unstructured_modules (UnstructuredIO): A module for parsing files and URLs and chunking content based on specified parameters.

init

def __init__(
    self,
    embedding_model: Optional[BaseEmbedding] = None,
    storage: Optional[BaseVectorStorage] = None
):

Initializes the retriever class with an optional embedding model.

Parameters:

  • embedding_model (Optional[BaseEmbedding]): The embedding model instance. Defaults to OpenAIEmbedding if not provided.
  • storage (BaseVectorStorage): Vector storage to query.

process

def process(
    self,
    content: Union[str, 'Element', IO[bytes]],
    chunk_type: str = 'chunk_by_title',
    max_characters: int = 500,
    embed_batch: int = 50,
    should_chunk: bool = True,
    extra_info: Optional[dict] = None,
    metadata_filename: Optional[str] = None,
    chunker: Optional[BaseChunker] = None,
    **kwargs: Any
):

Processes content from local file path, remote URL, string content, Element object, or a binary file object, divides it into chunks by using Unstructured IO, and stores their embeddings in the specified vector storage.

Parameters:

  • content (Union[str, Element, IO[bytes]]): Local file path, remote URL, string content, Element object, or a binary file object.
  • chunk_type (str): Type of chunking going to apply. Defaults to “chunk_by_title”.
  • max_characters (int): Max number of characters in each chunk. Defaults to 500.
  • embed_batch (int): Size of batch for embeddings. Defaults to 50. (default: 50)
  • should_chunk (bool): If True, divide the content into chunks, otherwise skip chunking. Defaults to True.
  • extra_info (Optional[dict]): Extra information to be added to the payload. Defaults to None.
  • metadata_filename (Optional[str]): The metadata filename to be used for storing metadata. Defaults to None. **kwargs (Any): Additional keyword arguments for content parsing.

query

def query(
    self,
    query: str,
    top_k: int = Constants.DEFAULT_TOP_K_RESULTS,
    similarity_threshold: float = Constants.DEFAULT_SIMILARITY_THRESHOLD
):

Executes a query in vector storage and compiles the retrieved results into a dictionary.

Parameters:

  • query (str): Query string for information retriever.
  • similarity_threshold (float, optional): The similarity threshold for filtering results. Defaults to DEFAULT_SIMILARITY_THRESHOLD.
  • top_k (int, optional): The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.

Returns:

List[Dict[str, Any]]: Concatenated list of the query results.