camel.loaders package

Contents

camel.loaders package#

Submodules#

camel.loaders.base_io module#

class camel.loaders.base_io.DocxFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) DocxFile[source]#

Creates a DocxFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the docx file.

  • filename (str) – The name of the file.

Returns:

A DocxFile object.

Return type:

DocxFile

class camel.loaders.base_io.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: ABC

Represents an uploaded file comprised of Documents.

Parameters:
  • name (str) – The name of the file.

  • file_id (str) – The unique identifier of the file.

  • metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.

  • docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.

  • raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.

copy() File[source]#

Create a deep copy of this File

static create_file(file: BytesIO, filename: str) File[source]#

Reads an uploaded file and returns a File object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

static create_file_from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Reads raw bytes and returns a File object.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

abstract classmethod from_bytes(file: BytesIO, filename: str) File[source]#

Creates a File object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

classmethod from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Creates a File object from raw bytes.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

class camel.loaders.base_io.HtmlFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) HtmlFile[source]#

Creates a HtmlFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the html file.

  • filename (str) – The name of the file.

Returns:

A HtmlFile object.

Return type:

HtmlFile

class camel.loaders.base_io.JsonFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) JsonFile[source]#

Creates a JsonFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the json file.

  • filename (str) – The name of the file.

Returns:

A JsonFile object.

Return type:

JsonFile

class camel.loaders.base_io.PdfFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) PdfFile[source]#

Creates a PdfFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the pdf file.

  • filename (str) – The name of the file.

Returns:

A PdfFile object.

Return type:

PdfFile

class camel.loaders.base_io.TxtFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) TxtFile[source]#

Creates a TxtFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the txt file.

  • filename (str) – The name of the file.

Returns:

A TxtFile object.

Return type:

TxtFile

camel.loaders.base_io.strip_consecutive_newlines(text: str) str[source]#

Strips consecutive newlines from a string.

Parameters:

text (str) – The string to strip.

Returns:

The string with consecutive newlines stripped.

Return type:

str

camel.loaders.firecrawl_reader module#

class camel.loaders.firecrawl_reader.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#

Bases: object

Firecrawl allows you to turn entire websites into LLM-ready markdown.

Parameters:
  • api_key (Optional[str]) – API key for authenticating with the Firecrawl API.

  • api_url (Optional[str]) – Base URL for the Firecrawl API.

References

https://docs.firecrawl.dev/introduction

check_crawl_job(job_id: str) Dict[source]#

Check the status of a crawl job.

Parameters:

job_id (str) – The ID of the crawl job.

Returns:

The response including status of the crawl job.

Return type:

Dict

Raises:

RuntimeError – If the check process fails.

crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any[source]#

Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.

Parameters:
  • url (str) – The URL to crawl.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.

  • **kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.

Returns:

The crawl job ID or the crawl results if waiting until

completion.

Return type:

Any

Raises:

RuntimeError – If the crawling process fails.

map_site(url: str, params: Dict[str, Any] | None = None) list[source]#

Map a website to retrieve all accessible URLs.

Parameters:
  • url (str) – The URL of the site to map.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.

Returns:

A list containing the URLs found on the site.

Return type:

list

Raises:

RuntimeError – If the mapping process fails.

markdown_crawl(url: str) str[source]#

Crawl a URL and all accessible subpages and return the content in Markdown format.

Parameters:

url (str) – The URL to crawl.

Returns:

The content of the URL in Markdown format.

Return type:

str

Raises:

RuntimeError – If the crawling process fails.

scrape(url: str, params: Dict[str, Any] | None = None) Dict[source]#

To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.

Reference: https://docs.firecrawl.dev/advanced-scraping-guide

Parameters:
  • url (str) – The URL to read.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.

Returns:

The scraped data.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

structured_scrape(url: str, response_format: BaseModel) Dict[source]#

Use LLM to extract structured data from given URL.

Parameters:
  • url (str) – The URL to read.

  • response_format (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.

Returns:

The content of the URL.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

camel.loaders.jina_url_reader module#

class camel.loaders.jina_url_reader.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#

Bases: object

URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.

Parameters:
  • api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.

  • return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.

  • json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.

  • timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.

  • **kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.

References

https://jina.ai/reader

read_content(url: str) str[source]#

Reads the content of a URL and returns it as a string with given form.

Parameters:

url (str) – The URL to read.

Returns:

The content of the URL.

Return type:

str

camel.loaders.unstructured_io module#

class camel.loaders.unstructured_io.UnstructuredIO[source]#

Bases: object

A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.

References

https://docs.unstructured.io/

static chunk_elements(elements: List[Element], chunk_type: str, **kwargs) List[Element][source]#

Chunks elements by titles.

Parameters:
  • elements (List[Element]) – List of Element objects to be chunked.

  • chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.

  • **kwargs – Additional keyword arguments for chunking.

Returns:

List of chunked sections.

Return type:

List[Dict]

References

https://unstructured-io.github.io/unstructured/

static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str[source]#

Cleans text data using a variety of cleaning functions provided by the unstructured library.

This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing Unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.

If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.

Parameters:
  • text (str) – The text to be cleaned.

  • clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.

Returns:

The cleaned text.

Return type:

str

Raises:

AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.

Notes

The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.

References

https://unstructured-io.github.io/unstructured/

static create_element_from_text(text: str, element_id: str | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | None = None) Element[source]#

Creates a Text element from a given text input, with optional metadata and embeddings.

Parameters:
  • text (str) – The text content for the element.

  • element_id (Optional[str], optional) – Unique identifier for the element. (default: None)

  • embeddings (List[float], optional) – A list of float numbers representing the text embeddings. (default: None)

  • filename (Optional[str], optional) – The name of the file the element is associated with. (default: None)

  • file_directory (Optional[str], optional) – The directory path where the file is located. (default: None)

  • last_modified (Optional[str], optional) – The last modified date of the file. (default: None)

  • filetype (Optional[str], optional) – The type of the file. (default: None)

  • parent_id (Optional[str], optional) – The identifier of the parent element. (default: None)

Returns:

An instance of Text with the provided content and

metadata.

Return type:

Element

static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any[source]#

Extracts various types of data from text using functions from unstructured.cleaners.extract.

Parameters:
  • text (str) – Text to extract data from.

  • (Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

:param‘extract_email_address’, ‘extract_ip_address’,

‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

Parameters:

**kwargs – Additional keyword arguments for specific extraction functions.

Returns:

The extracted data, type depends on extract_type.

Return type:

Any

References

https://unstructured-io.github.io/unstructured/

static parse_bytes(file: IO[bytes], **kwargs: Any) List[Element] | None[source]#

Parses a bytes stream and converts its contents into elements.

Parameters:
  • file (IO[bytes]) – The file in bytes format to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

if successful, otherwise None.

Return type:

Union[List[Element], None]

Notes

Supported file types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://docs.unstructured.io/open-source/core-functionality/partitioning

static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None[source]#

Loads a file or a URL and parses its contents into elements.

Parameters:
  • input_path (str) – Path to the file or URL to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

or URL if success.

Return type:

Union[List[Element],None]

Raises:

FileNotFoundError – If the file does not exist at the path specified.

Notes

Supported file types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://unstructured-io.github.io/unstructured/

static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any[source]#

Stages elements for various platforms based on the specified staging type.

This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.

Parameters:
  • elements (List[Any]) – List of Element objects to be staged.

  • (Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

  • 'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

:param‘convert_to_dict’, ‘dict_to_elements’,

‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

Parameters:

**kwargs – Additional keyword arguments specific to the staging type.

Returns:

Staged data in the

format appropriate for the specified staging type.

Return type:

Union[str, List[Dict], Any]

Raises:

ValueError – If the staging type is not supported or a required argument is missing.

References

https://unstructured-io.github.io/unstructured/

Module contents#

class camel.loaders.Apify(api_key: str | None = None)[source]#

Bases: object

Apify is a platform that allows you to automate any web workflow.

Parameters:

api_key (Optional[str]) – API key for authenticating with the Apify API.

get_dataset(dataset_id: str) dict | None[source]#

Get a dataset from the Apify platform.

Parameters:

dataset_id (str) – The ID of the dataset to get.

Returns:

The dataset.

Return type:

dict

Raises:

RuntimeError – If the dataset fails to be retrieved.

get_dataset_client(dataset_id: str) DatasetClient[source]#

Get a dataset client from the Apify platform.

Parameters:

dataset_id (str) – The ID of the dataset to get the client for.

Returns:

The dataset client.

Return type:

DatasetClient

Raises:

RuntimeError – If the dataset client fails to be retrieved.

get_dataset_items(dataset_id: str) List[source]#

Get items from a dataset on the Apify platform.

Parameters:

dataset_id (str) – The ID of the dataset to get items from.

Returns:

The items in the dataset.

Return type:

list

Raises:

RuntimeError – If the items fail to be retrieved.

get_datasets(unnamed: bool | None = None, limit: int | None = None, offset: int | None = None, desc: bool | None = None) List[dict][source]#

Get all named datasets from the Apify platform.

Parameters:
  • unnamed (bool, optional) – Whether to include unnamed key-value stores in the list

  • limit (int, optional) – How many key-value stores to retrieve

  • offset (int, optional) – What key-value store to include as first when retrieving the list

  • desc (bool, optional) – Whether to sort the key-value stores in descending order based on their modification date

Returns:

The datasets.

Return type:

List[dict]

Raises:

RuntimeError – If the datasets fail to be retrieved.

run_actor(actor_id: str, run_input: dict | None = None, content_type: str | None = None, build: str | None = None, max_items: int | None = None, memory_mbytes: int | None = None, timeout_secs: int | None = None, webhooks: list | None = None, wait_secs: int | None = None) dict | None[source]#

Run an actor on the Apify platform.

Parameters:
  • actor_id (str) – The ID of the actor to run.

  • run_input (Optional[dict]) – The input data for the actor. Defaults to None.

  • content_type (str, optional) – The content type of the input.

  • build (str, optional) – Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest).

  • max_items (int, optional) – Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit.

  • memory_mbytes (int, optional) – Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor.

  • timeout_secs (int, optional) – Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor.

  • webhooks (list, optional) – Optional webhooks (https://docs.apify.com/webhooks) associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here.

  • wait_secs (int, optional) – The maximum number of seconds the server waits for finish. If not provided, waits indefinitely.

Returns:

The output data from the actor if successful. # please use the ‘defaultDatasetId’ to get the dataset

Return type:

Optional[dict]

Raises:

RuntimeError – If the actor fails to run.

update_dataset(dataset_id: str, name: str) dict[source]#

Update a dataset on the Apify platform.

Parameters:
  • dataset_id (str) – The ID of the dataset to update.

  • name (str) – The new name for the dataset.

Returns:

The updated dataset.

Return type:

dict

Raises:

RuntimeError – If the dataset fails to be updated.

class camel.loaders.ChunkrReader(api_key: str | None = None, url: str | None = 'https://api.chunkr.ai/api/v1/task', timeout: int = 30, **kwargs: Any)[source]#

Bases: object

Chunkr Reader for processing documents and returning content in various formats.

Parameters:
  • api_key (Optional[str], optional) – The API key for Chunkr API. If not provided, it will be retrieved from the environment variable CHUNKR_API_KEY. (default: None)

  • url (Optional[str], optional) – The url to the Chunkr service. (default: https://api.chunkr.ai/api/v1/task)

  • timeout (int, optional) – The maximum time in seconds to wait for the API responses. (default: 30)

  • **kwargs (Any) – Additional keyword arguments for request headers.

get_task_output(task_id: str, max_retries: int = 5) str[source]#

Polls the Chunkr API to check the task status and returns the task result.

Parameters:
  • task_id (str) – The task ID to check the status for.

  • max_retries (int, optional) – Maximum number of retry attempts. (default: 5)

Returns:

The formatted task result in JSON format.

Return type:

str

Raises:
  • ValueError – If the task status cannot be retrieved.

  • RuntimeError – If the maximum number of retries is reached without a successful task completion.

submit_task(file_path: str, model: str = 'Fast', ocr_strategy: str = 'Auto', target_chunk_length: str = '512') str[source]#

Submits a file to the Chunkr API and returns the task ID.

Parameters:
  • file_path (str) – The path to the file to be uploaded.

  • model (str, optional) – The model to be used for the task. (default: Fast)

  • ocr_strategy (str, optional) – The OCR strategy. Defaults to ‘Auto’.

  • target_chunk_length (str, optional) – The target chunk length. (default: 512)

Returns:

The task ID.

Return type:

str

class camel.loaders.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: ABC

Represents an uploaded file comprised of Documents.

Parameters:
  • name (str) – The name of the file.

  • file_id (str) – The unique identifier of the file.

  • metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.

  • docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.

  • raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.

copy() File[source]#

Create a deep copy of this File

static create_file(file: BytesIO, filename: str) File[source]#

Reads an uploaded file and returns a File object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

static create_file_from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Reads raw bytes and returns a File object.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

abstract classmethod from_bytes(file: BytesIO, filename: str) File[source]#

Creates a File object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

classmethod from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Creates a File object from raw bytes.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

class camel.loaders.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#

Bases: object

Firecrawl allows you to turn entire websites into LLM-ready markdown.

Parameters:
  • api_key (Optional[str]) – API key for authenticating with the Firecrawl API.

  • api_url (Optional[str]) – Base URL for the Firecrawl API.

References

https://docs.firecrawl.dev/introduction

check_crawl_job(job_id: str) Dict[source]#

Check the status of a crawl job.

Parameters:

job_id (str) – The ID of the crawl job.

Returns:

The response including status of the crawl job.

Return type:

Dict

Raises:

RuntimeError – If the check process fails.

crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any[source]#

Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.

Parameters:
  • url (str) – The URL to crawl.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.

  • **kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.

Returns:

The crawl job ID or the crawl results if waiting until

completion.

Return type:

Any

Raises:

RuntimeError – If the crawling process fails.

map_site(url: str, params: Dict[str, Any] | None = None) list[source]#

Map a website to retrieve all accessible URLs.

Parameters:
  • url (str) – The URL of the site to map.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.

Returns:

A list containing the URLs found on the site.

Return type:

list

Raises:

RuntimeError – If the mapping process fails.

markdown_crawl(url: str) str[source]#

Crawl a URL and all accessible subpages and return the content in Markdown format.

Parameters:

url (str) – The URL to crawl.

Returns:

The content of the URL in Markdown format.

Return type:

str

Raises:

RuntimeError – If the crawling process fails.

scrape(url: str, params: Dict[str, Any] | None = None) Dict[source]#

To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.

Reference: https://docs.firecrawl.dev/advanced-scraping-guide

Parameters:
  • url (str) – The URL to read.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.

Returns:

The scraped data.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

structured_scrape(url: str, response_format: BaseModel) Dict[source]#

Use LLM to extract structured data from given URL.

Parameters:
  • url (str) – The URL to read.

  • response_format (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.

Returns:

The content of the URL.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

class camel.loaders.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#

Bases: object

URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.

Parameters:
  • api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.

  • return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.

  • json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.

  • timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.

  • **kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.

References

https://jina.ai/reader

read_content(url: str) str[source]#

Reads the content of a URL and returns it as a string with given form.

Parameters:

url (str) – The URL to read.

Returns:

The content of the URL.

Return type:

str

class camel.loaders.UnstructuredIO[source]#

Bases: object

A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.

References

https://docs.unstructured.io/

static chunk_elements(elements: List[Element], chunk_type: str, **kwargs) List[Element][source]#

Chunks elements by titles.

Parameters:
  • elements (List[Element]) – List of Element objects to be chunked.

  • chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.

  • **kwargs – Additional keyword arguments for chunking.

Returns:

List of chunked sections.

Return type:

List[Dict]

References

https://unstructured-io.github.io/unstructured/

static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str[source]#

Cleans text data using a variety of cleaning functions provided by the unstructured library.

This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing Unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.

If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.

Parameters:
  • text (str) – The text to be cleaned.

  • clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.

Returns:

The cleaned text.

Return type:

str

Raises:

AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.

Notes

The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.

References

https://unstructured-io.github.io/unstructured/

static create_element_from_text(text: str, element_id: str | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | None = None) Element[source]#

Creates a Text element from a given text input, with optional metadata and embeddings.

Parameters:
  • text (str) – The text content for the element.

  • element_id (Optional[str], optional) – Unique identifier for the element. (default: None)

  • embeddings (List[float], optional) – A list of float numbers representing the text embeddings. (default: None)

  • filename (Optional[str], optional) – The name of the file the element is associated with. (default: None)

  • file_directory (Optional[str], optional) – The directory path where the file is located. (default: None)

  • last_modified (Optional[str], optional) – The last modified date of the file. (default: None)

  • filetype (Optional[str], optional) – The type of the file. (default: None)

  • parent_id (Optional[str], optional) – The identifier of the parent element. (default: None)

Returns:

An instance of Text with the provided content and

metadata.

Return type:

Element

static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any[source]#

Extracts various types of data from text using functions from unstructured.cleaners.extract.

Parameters:
  • text (str) – Text to extract data from.

  • (Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

:param‘extract_email_address’, ‘extract_ip_address’,

‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

Parameters:

**kwargs – Additional keyword arguments for specific extraction functions.

Returns:

The extracted data, type depends on extract_type.

Return type:

Any

References

https://unstructured-io.github.io/unstructured/

static parse_bytes(file: IO[bytes], **kwargs: Any) List[Element] | None[source]#

Parses a bytes stream and converts its contents into elements.

Parameters:
  • file (IO[bytes]) – The file in bytes format to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

if successful, otherwise None.

Return type:

Union[List[Element], None]

Notes

Supported file types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://docs.unstructured.io/open-source/core-functionality/partitioning

static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None[source]#

Loads a file or a URL and parses its contents into elements.

Parameters:
  • input_path (str) – Path to the file or URL to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

or URL if success.

Return type:

Union[List[Element],None]

Raises:

FileNotFoundError – If the file does not exist at the path specified.

Notes

Supported file types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://unstructured-io.github.io/unstructured/

static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any[source]#

Stages elements for various platforms based on the specified staging type.

This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.

Parameters:
  • elements (List[Any]) – List of Element objects to be staged.

  • (Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

  • 'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

:param‘convert_to_dict’, ‘dict_to_elements’,

‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

Parameters:

**kwargs – Additional keyword arguments specific to the staging type.

Returns:

Staged data in the

format appropriate for the specified staging type.

Return type:

Union[str, List[Dict], Any]

Raises:

ValueError – If the staging type is not supported or a required argument is missing.

References

https://unstructured-io.github.io/unstructured/