BaseEmbedding
instance via embedding_instance
to let
this function handle the embedding internally, ORembeddings
.embedding_instance
and embeddings
are provided, the function
will raise a ValueError to avoid ambiguous usage.
strategy is used to specify different strategies, where ‘top1’ selects the
one with highest similarity, and ‘llm-supervise’ uses LLM to determine if
texts are duplicates (not yet implemented).
Parameters:
0.65
)None
)texts
. Each element in the list corresponds to the embedding of the text in the same index of texts
. (default: :obj:None
)Literal["top1", "llm-supervise"], optional
): The strategy to use for deduplication. (default: :obj:"top1"
)1000
)original_texts
: The original texts.unique_ids
: The unique ids after deduplication.unique_embeddings_dict
: A dict mapping from (unique) text id
to its embedding.duplicate_to_target_map
: A dict mapping from the id of a
duplicate text to the id of the text it is considered a duplicate
of.embeddings
does not match the length of