> ## Documentation Index
> Fetch the complete documentation index at: https://docs.camel-ai.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Camel.utils.deduplication

<a id="camel.utils.deduplication" />

<a id="camel.utils.deduplication.DeduplicationResult" />

## DeduplicationResult

```python theme={"system"}
class DeduplicationResult(BaseModel):
```

The result of deduplication.

**Parameters:**

* **original\_texts** (List\[str]): The original texts.
* **unique\_ids** (List\[int]): A list of ids that are unique (not duplicates).
* **unique\_embeddings\_dict** (Dict\[int, List\[float]]): A mapping from the index of each unique text to its embedding.
* **duplicate\_to\_target\_map** (Dict\[int, int]): A mapping from the index of the duplicate text to the index of the text it is considered a duplicate of.

<a id="camel.utils.deduplication.deduplicate_internally" />

## deduplicate\_internally

```python theme={"system"}
def deduplicate_internally(
    texts: List[str],
    threshold: float = 0.65,
    embedding_instance: Optional[BaseEmbedding[str]] = None,
    embeddings: Optional[List[List[float]]] = None,
    strategy: Literal['top1', 'llm-supervise'] = 'top1',
    batch_size: int = 1000
):
```

Deduplicate a list of strings based on their cosine similarity.

You can either:

1. Provide a CAMEL `BaseEmbedding` instance via `embedding_instance` to let
   this function handle the embedding internally, OR
2. Directly pass a list of pre-computed embeddings to `embeddings`.

If both `embedding_instance` and `embeddings` are provided, the function
will raise a ValueError to avoid ambiguous usage.

strategy is used to specify different strategies, where 'top1' selects the
one with highest similarity, and 'llm-supervise' uses LLM to determine if
texts are duplicates (not yet implemented).

**Parameters:**

* **texts** (List\[str]): The list of texts to be deduplicated.
* **threshold** (float, optional): The similarity threshold for considering two texts as duplicates. (default: :obj:`0.65`)
* **embedding\_instance** (Optional\[BaseEmbedding\[str]], optional): A CAMEL embedding instance for automatic embedding. (default: :obj:`None`)
* **embeddings** (Optional\[List\[List\[float]]], optional): Pre-computed embeddings of `texts`. Each element in the list corresponds to the embedding of the text in the same index of `texts`. (default: :obj:`None`)
* **strategy** (`Literal["top1", "llm-supervise"], optional`): The strategy to use for deduplication. (default: :obj:`"top1"`)
* **batch\_size** (int, optional): The size of the batch to use for calculating cosine similarities. (default: :obj:`1000`)

**Returns:**

DeduplicationResult: An object that contains:

* `original_texts`: The original texts.
* `unique_ids`: The unique ids after deduplication.
* `unique_embeddings_dict`: A dict mapping from (unique) text id
  to its embedding.
* `duplicate_to_target_map`: A dict mapping from the id of a
  duplicate text to the id of the text it is considered a duplicate
  of.

**Raises:**

* **NotImplementedError**: If the strategy is not "top1".
* **ValueError**: If neither embeddings nor embedding\_instance is provided,
* **ValueError**: If the length of `embeddings` does not match the length of
