camel.loaders package#

Submodules#

camel.loaders.base_io module#

class camel.loaders.base_io.DocxFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) DocxFile[source]#

Creates a DocxFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the docx file.

  • filename (str) – The name of the file.

Returns:

A DocxFile object.

Return type:

DocxFile

class camel.loaders.base_io.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: ABC

Represents an uploaded file comprised of Documents.

Parameters:
  • name (str) – The name of the file.

  • file_id (str) – The unique identifier of the file.

  • metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.

  • docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.

  • raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.

copy() File[source]#

Create a deep copy of this File

static create_file(file: BytesIO, filename: str) File[source]#

Reads an uploaded file and returns a File object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

static create_file_from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Reads raw bytes and returns a File object.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

abstract classmethod from_bytes(file: BytesIO, filename: str) File[source]#

Creates a File object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

classmethod from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Creates a File object from raw bytes.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

class camel.loaders.base_io.HtmlFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) HtmlFile[source]#

Creates a HtmlFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the html file.

  • filename (str) – The name of the file.

Returns:

A HtmlFile object.

Return type:

HtmlFile

class camel.loaders.base_io.JsonFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) JsonFile[source]#

Creates a JsonFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the json file.

  • filename (str) – The name of the file.

Returns:

A JsonFile object.

Return type:

JsonFile

class camel.loaders.base_io.PdfFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) PdfFile[source]#

Creates a PdfFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the pdf file.

  • filename (str) – The name of the file.

Returns:

A PdfFile object.

Return type:

PdfFile

class camel.loaders.base_io.TxtFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: File

classmethod from_bytes(file: BytesIO, filename: str) TxtFile[source]#

Creates a TxtFile object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the txt file.

  • filename (str) – The name of the file.

Returns:

A TxtFile object.

Return type:

TxtFile

camel.loaders.base_io.strip_consecutive_newlines(text: str) str[source]#

Strips consecutive newlines from a string.

Parameters:

text (str) – The string to strip.

Returns:

The string with consecutive newlines stripped.

Return type:

str

camel.loaders.firecrawl_reader module#

class camel.loaders.firecrawl_reader.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#

Bases: object

Firecrawl allows you to turn entire websites into LLM-ready markdown.

Parameters:
  • api_key (Optional[str]) – API key for authenticating with the Firecrawl API.

  • api_url (Optional[str]) – Base URL for the Firecrawl API.

References

https://docs.firecrawl.dev/introduction

check_crawl_job(job_id: str) Dict[source]#

Check the status of a crawl job.

Parameters:

job_id (str) – The ID of the crawl job.

Returns:

The response including status of the crawl job.

Return type:

Dict

Raises:

RuntimeError – If the check process fails.

crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any[source]#

Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.

Parameters:
  • url (str) – The URL to crawl.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.

  • **kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.

Returns:

The crawl job ID or the crawl results if waiting until

completion.

Return type:

Any

Raises:

RuntimeError – If the crawling process fails.

map_site(url: str, params: Dict[str, Any] | None = None) list[source]#

Map a website to retrieve all accessible URLs.

Parameters:
  • url (str) – The URL of the site to map.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.

Returns:

A list containing the URLs found on the site.

Return type:

list

Raises:

RuntimeError – If the mapping process fails.

markdown_crawl(url: str) str[source]#

Crawl a URL and all accessible subpages and return the content in Markdown format.

Parameters:

url (str) – The URL to crawl.

Returns:

The content of the URL in Markdown format.

Return type:

str

Raises:

RuntimeError – If the crawling process fails.

scrape(url: str, params: Dict[str, Any] | None = None) Dict[source]#

To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.

Reference: https://docs.firecrawl.dev/advanced-scraping-guide

Parameters:
  • url (str) – The URL to read.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.

Returns:

The scraped data.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

structured_scrape(url: str, output_schema: BaseModel) Dict[source]#

Use LLM to extract structured data from given URL.

Parameters:
  • url (str) – The URL to read.

  • output_schema (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.

Returns:

The content of the URL.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

camel.loaders.jina_url_reader module#

class camel.loaders.jina_url_reader.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#

Bases: object

URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.

Parameters:
  • api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.

  • return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.

  • json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.

  • timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.

  • **kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.

References

https://jina.ai/reader

read_content(url: str) str[source]#

Reads the content of a URL and returns it as a string with given form.

Parameters:

url (str) – The URL to read.

Returns:

The content of the URL.

Return type:

str

camel.loaders.unstructured_io module#

class camel.loaders.unstructured_io.UnstructuredIO[source]#

Bases: object

A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.

References

https://docs.unstructured.io/

static chunk_elements(elements: List[Any], chunk_type: str, **kwargs) List[Element][source]#

Chunks elements by titles.

Parameters:
  • elements (List[Element]) – List of Element objects to be chunked.

  • chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.

  • **kwargs – Additional keyword arguments for chunking.

Returns:

List of chunked sections.

Return type:

List[Dict]

References

https://unstructured-io.github.io/unstructured/

static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str[source]#

Cleans text data using a variety of cleaning functions provided by the unstructured library.

This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.

If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.

Parameters:
  • text (str) – The text to be cleaned.

  • clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.

Returns:

The cleaned text.

Return type:

str

Raises:

AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.

Notes

The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.

References

https://unstructured-io.github.io/unstructured/

static create_element_from_text(text: str, element_id: str | UUID | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | UUID | None = None) Element[source]#

Creates a Text element from a given text input, with optional metadata and embeddings.

Parameters:
  • text (str) – The text content for the element.

  • element_id (Optional[Union[str, uuid.UUID]], optional) – Unique identifier for the element. Defaults to None.

  • embeddings (Optional[List[float]], optional) – A list of float numbers representing the text embeddings. Defaults to None.

  • filename (Optional[str], optional) – The name of the file the element is associated with. Defaults to None.

  • file_directory (Optional[str], optional) – The directory path where the file is located. Defaults to None.

  • last_modified (Optional[str], optional) – The last modified date of the file. Defaults to None.

  • filetype (Optional[str], optional) – The type of the file. Defaults to None.

  • parent_id (Optional[Union[str, uuid.UUID]], optional) – The identifier of the parent element. Defaults to None.

Returns:

An instance of Text with the provided content and

metadata.

Return type:

Element

static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any[source]#

Extracts various types of data from text using functions from unstructured.cleaners.extract.

Parameters:
  • text (str) – Text to extract data from.

  • (Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

:param‘extract_email_address’, ‘extract_ip_address’,

‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

Parameters:

**kwargs – Additional keyword arguments for specific extraction functions.

Returns:

The extracted data, type depends on extract_type.

Return type:

Any

References

https://unstructured-io.github.io/unstructured/

static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None[source]#

Loads a file or a URL and parses its contents into elements.

Parameters:
  • input_path (str) – Path to the file or URL to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

or URL if success.

Return type:

Union[List[Element],None]

Raises:

FileNotFoundError – If the file does not exist at the path specified.

Notes

Available document types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://unstructured-io.github.io/unstructured/

static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any[source]#

Stages elements for various platforms based on the specified staging type.

This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.

Parameters:
  • elements (List[Any]) – List of Element objects to be staged.

  • (Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

  • 'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

:param‘convert_to_dict’, ‘dict_to_elements’,

‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

Parameters:

**kwargs – Additional keyword arguments specific to the staging type.

Returns:

Staged data in the

format appropriate for the specified staging type.

Return type:

Union[str, List[Dict], Any]

Raises:

ValueError – If the staging type is not supported or a required argument is missing.

References

https://unstructured-io.github.io/unstructured/

Module contents#

class camel.loaders.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#

Bases: ABC

Represents an uploaded file comprised of Documents.

Parameters:
  • name (str) – The name of the file.

  • file_id (str) – The unique identifier of the file.

  • metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.

  • docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.

  • raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.

copy() File[source]#

Create a deep copy of this File

static create_file(file: BytesIO, filename: str) File[source]#

Reads an uploaded file and returns a File object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

static create_file_from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Reads raw bytes and returns a File object.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

abstract classmethod from_bytes(file: BytesIO, filename: str) File[source]#

Creates a File object from a BytesIO object.

Parameters:
  • file (BytesIO) – A BytesIO object representing the contents of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

classmethod from_raw_bytes(raw_bytes: bytes, filename: str) File[source]#

Creates a File object from raw bytes.

Parameters:
  • raw_bytes (bytes) – The raw bytes content of the file.

  • filename (str) – The name of the file.

Returns:

A File object.

Return type:

File

class camel.loaders.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#

Bases: object

Firecrawl allows you to turn entire websites into LLM-ready markdown.

Parameters:
  • api_key (Optional[str]) – API key for authenticating with the Firecrawl API.

  • api_url (Optional[str]) – Base URL for the Firecrawl API.

References

https://docs.firecrawl.dev/introduction

check_crawl_job(job_id: str) Dict[source]#

Check the status of a crawl job.

Parameters:

job_id (str) – The ID of the crawl job.

Returns:

The response including status of the crawl job.

Return type:

Dict

Raises:

RuntimeError – If the check process fails.

crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any[source]#

Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.

Parameters:
  • url (str) – The URL to crawl.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.

  • **kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.

Returns:

The crawl job ID or the crawl results if waiting until

completion.

Return type:

Any

Raises:

RuntimeError – If the crawling process fails.

map_site(url: str, params: Dict[str, Any] | None = None) list[source]#

Map a website to retrieve all accessible URLs.

Parameters:
  • url (str) – The URL of the site to map.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.

Returns:

A list containing the URLs found on the site.

Return type:

list

Raises:

RuntimeError – If the mapping process fails.

markdown_crawl(url: str) str[source]#

Crawl a URL and all accessible subpages and return the content in Markdown format.

Parameters:

url (str) – The URL to crawl.

Returns:

The content of the URL in Markdown format.

Return type:

str

Raises:

RuntimeError – If the crawling process fails.

scrape(url: str, params: Dict[str, Any] | None = None) Dict[source]#

To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.

Reference: https://docs.firecrawl.dev/advanced-scraping-guide

Parameters:
  • url (str) – The URL to read.

  • params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.

Returns:

The scraped data.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

structured_scrape(url: str, output_schema: BaseModel) Dict[source]#

Use LLM to extract structured data from given URL.

Parameters:
  • url (str) – The URL to read.

  • output_schema (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.

Returns:

The content of the URL.

Return type:

Dict

Raises:

RuntimeError – If the scrape process fails.

class camel.loaders.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#

Bases: object

URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.

Parameters:
  • api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.

  • return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.

  • json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.

  • timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.

  • **kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.

References

https://jina.ai/reader

read_content(url: str) str[source]#

Reads the content of a URL and returns it as a string with given form.

Parameters:

url (str) – The URL to read.

Returns:

The content of the URL.

Return type:

str

class camel.loaders.UnstructuredIO[source]#

Bases: object

A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.

References

https://docs.unstructured.io/

static chunk_elements(elements: List[Any], chunk_type: str, **kwargs) List[Element][source]#

Chunks elements by titles.

Parameters:
  • elements (List[Element]) – List of Element objects to be chunked.

  • chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.

  • **kwargs – Additional keyword arguments for chunking.

Returns:

List of chunked sections.

Return type:

List[Dict]

References

https://unstructured-io.github.io/unstructured/

static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str[source]#

Cleans text data using a variety of cleaning functions provided by the unstructured library.

This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.

If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.

Parameters:
  • text (str) – The text to be cleaned.

  • clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.

Returns:

The cleaned text.

Return type:

str

Raises:

AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.

Notes

The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.

References

https://unstructured-io.github.io/unstructured/

static create_element_from_text(text: str, element_id: str | UUID | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | UUID | None = None) Element[source]#

Creates a Text element from a given text input, with optional metadata and embeddings.

Parameters:
  • text (str) – The text content for the element.

  • element_id (Optional[Union[str, uuid.UUID]], optional) – Unique identifier for the element. Defaults to None.

  • embeddings (Optional[List[float]], optional) – A list of float numbers representing the text embeddings. Defaults to None.

  • filename (Optional[str], optional) – The name of the file the element is associated with. Defaults to None.

  • file_directory (Optional[str], optional) – The directory path where the file is located. Defaults to None.

  • last_modified (Optional[str], optional) – The last modified date of the file. Defaults to None.

  • filetype (Optional[str], optional) – The type of the file. Defaults to None.

  • parent_id (Optional[Union[str, uuid.UUID]], optional) – The identifier of the parent element. Defaults to None.

Returns:

An instance of Text with the provided content and

metadata.

Return type:

Element

static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any[source]#

Extracts various types of data from text using functions from unstructured.cleaners.extract.

Parameters:
  • text (str) – Text to extract data from.

  • (Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

:param‘extract_email_address’, ‘extract_ip_address’,

‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.

Parameters:

**kwargs – Additional keyword arguments for specific extraction functions.

Returns:

The extracted data, type depends on extract_type.

Return type:

Any

References

https://unstructured-io.github.io/unstructured/

static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None[source]#

Loads a file or a URL and parses its contents into elements.

Parameters:
  • input_path (str) – Path to the file or URL to be parsed.

  • **kwargs – Extra kwargs passed to the partition function.

Returns:

List of elements after parsing the file

or URL if success.

Return type:

Union[List[Element],None]

Raises:

FileNotFoundError – If the file does not exist at the path specified.

Notes

Available document types:

“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.

References

https://unstructured-io.github.io/unstructured/

static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any[source]#

Stages elements for various platforms based on the specified staging type.

This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.

Parameters:
  • elements (List[Any]) – List of Element objects to be staged.

  • (Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

  • 'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

:param‘convert_to_dict’, ‘dict_to_elements’,

‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.

Parameters:

**kwargs – Additional keyword arguments specific to the staging type.

Returns:

Staged data in the

format appropriate for the specified staging type.

Return type:

Union[str, List[Dict], Any]

Raises:

ValueError – If the staging type is not supported or a required argument is missing.

References

https://unstructured-io.github.io/unstructured/