Camel.utils.deduplication
DeduplicationResult
The result of deduplication.
Attributes: original_texts (List[str]): The original texts. unique_ids (List[int]): A list of ids that are unique (not duplicates). unique_embeddings_dict (Dict[int, List[float]]): A mapping from the index of each unique text to its embedding. duplicate_to_target_map (Dict[int, int]): A mapping from the index of the duplicate text to the index of the text it is considered a duplicate of.
deduplicate_internally
Deduplicate a list of strings based on their cosine similarity.
You can either:
- Provide a CAMEL
BaseEmbedding
instance viaembedding_instance
to let this function handle the embedding internally, OR - Directly pass a list of pre-computed embeddings to
embeddings
.
If both embedding_instance
and embeddings
are provided, the function
will raise a ValueError to avoid ambiguous usage.
strategy is used to specify different strategies, where ‘top1’ selects the one with highest similarity, and ‘llm-supervise’ uses LLM to determine if texts are duplicates (not yet implemented).
Parameters:
- texts (List[str]): The list of texts to be deduplicated.
- threshold (float, optional): The similarity threshold for considering two texts as duplicates. (default: :obj:
0.65
) - embedding_instance (Optional[BaseEmbedding[str]], optional): A CAMEL embedding instance for automatic embedding. (default: :obj:
None
) - embeddings (Optional[List[List[float]]], optional): Pre-computed embeddings of
texts
. Each element in the list corresponds to the embedding of the text in the same index oftexts
. (default: :obj:None
) - strategy (
Literal["top1", "llm-supervise"], optional
): The strategy to use for deduplication. (default: :obj:"top1"
) - batch_size (int, optional): The size of the batch to use for calculating cosine similarities. (default: :obj:
1000
)
Returns:
DeduplicationResult: An object that contains:
original_texts
: The original texts.unique_ids
: The unique ids after deduplication.unique_embeddings_dict
: A dict mapping from (unique) text id to its embedding.duplicate_to_target_map
: A dict mapping from the id of a duplicate text to the id of the text it is considered a duplicate of.
Raises:
- NotImplementedError: If the strategy is not “top1”.
- ValueError: If neither embeddings nor embedding_instance is provided,
- ValueError: If the length of
embeddings
does not match the length of