camel.retrievers package#
Submodules#
camel.retrievers.auto_retriever module#
- class camel.retrievers.auto_retriever.AutoRetriever(url_and_api_key: Tuple[str, str] | None = None, vector_storage_local_path: str | None = None, storage_type: StorageType | None = None, embedding_model: BaseEmbedding | None = None)[source]#
Bases:
object
Facilitates the automatic retrieval of information using a query-based approach with pre-defined elements.
- url_and_api_key#
URL and API key for accessing the vector storage remotely.
- Type:
Optional[Tuple[str, str]]
- vector_storage_local_path#
Local path for vector storage, if applicable.
- Type:
Optional[str]
- storage_type#
The type of vector storage to use. Defaults to StorageType.QDRANT.
- Type:
Optional[StorageType]
- embedding_model#
Model used for embedding queries and documents. Defaults to OpenAIEmbedding().
- Type:
Optional[BaseEmbedding]
- run_vector_retriever(query: str, contents: str | List[str] | Element | List[Element], top_k: int = 1, similarity_threshold: float = 0.7, return_detailed_info: bool = False, max_characters: int = 500) dict[str, Sequence[Collection[str]]] [source]#
Executes the automatic vector retriever process using vector storage.
- Parameters:
query (str) – Query string for information retriever.
contents (Union[str, List[str], Element, List[Element]]) – Local file paths, remote URLs, string contents or Element objects.
top_k (int, optional) – The number of top results to return during retrieve. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
similarity_threshold (float, optional) – The similarity threshold for filtering results. Defaults to DEFAULT_SIMILARITY_THRESHOLD.
return_detailed_info (bool, optional) – Whether to return detailed information including similarity score, content path and metadata. Defaults to False.
max_characters (int) – Max number of characters in each chunk. Defaults to 500.
- Returns:
- By default, returns
only the text information. If return_detailed_info is True, return detailed information including similarity score, content path and metadata.
- Return type:
dict[str, Sequence[Collection[str]]]
- Raises:
ValueError – If there’s an vector storage existing with content name in the vector path but the payload is None. If contents is empty.
RuntimeError – If any errors occur during the retrieve process.
camel.retrievers.base module#
- class camel.retrievers.base.BaseRetriever[source]#
Bases:
ABC
Abstract base class for implementing various types of information retrievers.
- process(*input: Any) None #
Defines the process behavior performed at every call.
- Processes content from a file or URL, divides it into chunks by
using Unstructured IO,then stored internally. This method must be called before executing queries with the retriever.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
BaseRetriever
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- query(*input: Any) None #
Defines the query behavior performed at every call.
- Query the results. Subclasses should implement this
method according to their specific needs.
It should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
BaseRetriever
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
camel.retrievers.bm25_retriever module#
- class camel.retrievers.bm25_retriever.BM25Retriever[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever using the BM25 model.
This class facilitates the retriever of relevant information using a query-based approach, it ranks documents based on the occurrence and frequency of the query terms.
- bm25#
An instance of the BM25Okapi class used for calculating document scores.
- Type:
BM25Okapi
- content_input_path#
The path to the content that has been processed and stored.
- Type:
str
- unstructured_modules#
A module for parsing files and URLs and chunking content based on specified parameters.
- Type:
References
- process(content_input_path: str, chunk_type: str = 'chunk_by_title', **kwargs: Any) None [source]#
Processes content from a file or URL, divides it into chunks by using Unstructured IO,then stored internally. This method must be called before executing queries with the retriever.
- Parameters:
content_input_path (str) – File path or URL of the content to be processed.
chunk_type (str) – Type of chunking going to apply. Defaults to “chunk_by_title”.
**kwargs (Any) – Additional keyword arguments for content parsing.
- query(query: str, top_k: int = 1) List[Dict[str, Any]] [source]#
Executes a query and compiles the results.
- Parameters:
query (str) – Query string for information retriever.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str]]
- Raises:
ValueError – If top_k is less than or equal to 0, if the BM25 model has not been initialized by calling process first.
camel.retrievers.cohere_rerank_retriever module#
- class camel.retrievers.cohere_rerank_retriever.CohereRerankRetriever(model_name: str = 'rerank-multilingual-v2.0', api_key: str | None = None)[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever using the Cohere Re-ranking model.
- model_name#
The model name to use for re-ranking.
- Type:
str
- api_key#
The API key for authenticating with the Cohere service.
- Type:
Optional[str]
References
https://txt.cohere.com/rerank/
- query(query: str, retrieved_result: List[Dict[str, Any]], top_k: int = 1) List[Dict[str, Any]] [source]#
Queries and compiles results using the Cohere re-ranking model.
- Parameters:
query (str) – Query string for information retriever.
retrieved_result (List[Dict[str, Any]]) – The content to be re-ranked, should be the output from BaseRetriever like VectorRetriever.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str, Any]]
camel.retrievers.vector_retriever module#
- class camel.retrievers.vector_retriever.VectorRetriever(embedding_model: BaseEmbedding | None = None, storage: BaseVectorStorage | None = None)[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever by using vector storage and embedding model.
This class facilitates the retriever of relevant information using a query-based approach, backed by vector embeddings.
- embedding_model#
Embedding model used to generate vector embeddings.
- Type:
- storage#
Vector storage to query.
- Type:
- unstructured_modules#
A module for parsing files and URLs and chunking content based on specified parameters.
- Type:
- process(content: str | Element | IO[bytes], chunk_type: str = 'chunk_by_title', max_characters: int = 500, embed_batch: int = 50, should_chunk: bool = True, extra_info: dict | None = None, metadata_filename: str | None = None, **kwargs: Any) None [source]#
Processes content from local file path, remote URL, string content, Element object, or a binary file object, divides it into chunks by using Unstructured IO, and stores their embeddings in the specified vector storage.
- Parameters:
content (Union[str, Element, IO[bytes]]) – Local file path, remote URL, string content, Element object, or a binary file object.
chunk_type (str) – Type of chunking going to apply. Defaults to “chunk_by_title”.
max_characters (int) – Max number of characters in each chunk. Defaults to 500.
embed_batch (int) – Size of batch for embeddings. Defaults to 50.
should_chunk (bool) – If True, divide the content into chunks, otherwise skip chunking. Defaults to True.
extra_info (Optional[dict]) – Extra information to be added to the payload. Defaults to None.
metadata_filename (Optional[str]) – The metadata filename to be used for storing metadata. Defaults to None.
**kwargs (Any) – Additional keyword arguments for content parsing.
- query(query: str, top_k: int = 1, similarity_threshold: float = 0.7) List[Dict[str, Any]] [source]#
Executes a query in vector storage and compiles the retrieved results into a dictionary.
- Parameters:
query (str) – Query string for information retriever.
similarity_threshold (float, optional) – The similarity threshold for filtering results. Defaults to DEFAULT_SIMILARITY_THRESHOLD.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str, Any]]
- Raises:
ValueError – If ‘top_k’ is less than or equal to 0, if vector storage is empty, if payload of vector storage is None.
Module contents#
- class camel.retrievers.AutoRetriever(url_and_api_key: Tuple[str, str] | None = None, vector_storage_local_path: str | None = None, storage_type: StorageType | None = None, embedding_model: BaseEmbedding | None = None)[source]#
Bases:
object
Facilitates the automatic retrieval of information using a query-based approach with pre-defined elements.
- url_and_api_key#
URL and API key for accessing the vector storage remotely.
- Type:
Optional[Tuple[str, str]]
- vector_storage_local_path#
Local path for vector storage, if applicable.
- Type:
Optional[str]
- storage_type#
The type of vector storage to use. Defaults to StorageType.QDRANT.
- Type:
Optional[StorageType]
- embedding_model#
Model used for embedding queries and documents. Defaults to OpenAIEmbedding().
- Type:
Optional[BaseEmbedding]
- run_vector_retriever(query: str, contents: str | List[str] | Element | List[Element], top_k: int = 1, similarity_threshold: float = 0.7, return_detailed_info: bool = False, max_characters: int = 500) dict[str, Sequence[Collection[str]]] [source]#
Executes the automatic vector retriever process using vector storage.
- Parameters:
query (str) – Query string for information retriever.
contents (Union[str, List[str], Element, List[Element]]) – Local file paths, remote URLs, string contents or Element objects.
top_k (int, optional) – The number of top results to return during retrieve. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
similarity_threshold (float, optional) – The similarity threshold for filtering results. Defaults to DEFAULT_SIMILARITY_THRESHOLD.
return_detailed_info (bool, optional) – Whether to return detailed information including similarity score, content path and metadata. Defaults to False.
max_characters (int) – Max number of characters in each chunk. Defaults to 500.
- Returns:
- By default, returns
only the text information. If return_detailed_info is True, return detailed information including similarity score, content path and metadata.
- Return type:
dict[str, Sequence[Collection[str]]]
- Raises:
ValueError – If there’s an vector storage existing with content name in the vector path but the payload is None. If contents is empty.
RuntimeError – If any errors occur during the retrieve process.
- class camel.retrievers.BM25Retriever[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever using the BM25 model.
This class facilitates the retriever of relevant information using a query-based approach, it ranks documents based on the occurrence and frequency of the query terms.
- bm25#
An instance of the BM25Okapi class used for calculating document scores.
- Type:
BM25Okapi
- content_input_path#
The path to the content that has been processed and stored.
- Type:
str
- unstructured_modules#
A module for parsing files and URLs and chunking content based on specified parameters.
- Type:
References
- process(content_input_path: str, chunk_type: str = 'chunk_by_title', **kwargs: Any) None [source]#
Processes content from a file or URL, divides it into chunks by using Unstructured IO,then stored internally. This method must be called before executing queries with the retriever.
- Parameters:
content_input_path (str) – File path or URL of the content to be processed.
chunk_type (str) – Type of chunking going to apply. Defaults to “chunk_by_title”.
**kwargs (Any) – Additional keyword arguments for content parsing.
- query(query: str, top_k: int = 1) List[Dict[str, Any]] [source]#
Executes a query and compiles the results.
- Parameters:
query (str) – Query string for information retriever.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str]]
- Raises:
ValueError – If top_k is less than or equal to 0, if the BM25 model has not been initialized by calling process first.
- class camel.retrievers.BaseRetriever[source]#
Bases:
ABC
Abstract base class for implementing various types of information retrievers.
- process(*input: Any) None #
Defines the process behavior performed at every call.
- Processes content from a file or URL, divides it into chunks by
using Unstructured IO,then stored internally. This method must be called before executing queries with the retriever.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
BaseRetriever
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- query(*input: Any) None #
Defines the query behavior performed at every call.
- Query the results. Subclasses should implement this
method according to their specific needs.
It should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
BaseRetriever
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class camel.retrievers.CohereRerankRetriever(model_name: str = 'rerank-multilingual-v2.0', api_key: str | None = None)[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever using the Cohere Re-ranking model.
- model_name#
The model name to use for re-ranking.
- Type:
str
- api_key#
The API key for authenticating with the Cohere service.
- Type:
Optional[str]
References
https://txt.cohere.com/rerank/
- query(query: str, retrieved_result: List[Dict[str, Any]], top_k: int = 1) List[Dict[str, Any]] [source]#
Queries and compiles results using the Cohere re-ranking model.
- Parameters:
query (str) – Query string for information retriever.
retrieved_result (List[Dict[str, Any]]) – The content to be re-ranked, should be the output from BaseRetriever like VectorRetriever.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str, Any]]
- class camel.retrievers.VectorRetriever(embedding_model: BaseEmbedding | None = None, storage: BaseVectorStorage | None = None)[source]#
Bases:
BaseRetriever
An implementation of the BaseRetriever by using vector storage and embedding model.
This class facilitates the retriever of relevant information using a query-based approach, backed by vector embeddings.
- embedding_model#
Embedding model used to generate vector embeddings.
- Type:
- storage#
Vector storage to query.
- Type:
- unstructured_modules#
A module for parsing files and URLs and chunking content based on specified parameters.
- Type:
- process(content: str | Element | IO[bytes], chunk_type: str = 'chunk_by_title', max_characters: int = 500, embed_batch: int = 50, should_chunk: bool = True, extra_info: dict | None = None, metadata_filename: str | None = None, **kwargs: Any) None [source]#
Processes content from local file path, remote URL, string content, Element object, or a binary file object, divides it into chunks by using Unstructured IO, and stores their embeddings in the specified vector storage.
- Parameters:
content (Union[str, Element, IO[bytes]]) – Local file path, remote URL, string content, Element object, or a binary file object.
chunk_type (str) – Type of chunking going to apply. Defaults to “chunk_by_title”.
max_characters (int) – Max number of characters in each chunk. Defaults to 500.
embed_batch (int) – Size of batch for embeddings. Defaults to 50.
should_chunk (bool) – If True, divide the content into chunks, otherwise skip chunking. Defaults to True.
extra_info (Optional[dict]) – Extra information to be added to the payload. Defaults to None.
metadata_filename (Optional[str]) – The metadata filename to be used for storing metadata. Defaults to None.
**kwargs (Any) – Additional keyword arguments for content parsing.
- query(query: str, top_k: int = 1, similarity_threshold: float = 0.7) List[Dict[str, Any]] [source]#
Executes a query in vector storage and compiles the retrieved results into a dictionary.
- Parameters:
query (str) – Query string for information retriever.
similarity_threshold (float, optional) – The similarity threshold for filtering results. Defaults to DEFAULT_SIMILARITY_THRESHOLD.
top_k (int, optional) – The number of top results to return during retriever. Must be a positive integer. Defaults to DEFAULT_TOP_K_RESULTS.
- Returns:
Concatenated list of the query results.
- Return type:
List[Dict[str, Any]]
- Raises:
ValueError – If ‘top_k’ is less than or equal to 0, if vector storage is empty, if payload of vector storage is None.