camel.loaders package#
Submodules#
camel.loaders.base_io module#
- class camel.loaders.base_io.DocxFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
File
- class camel.loaders.base_io.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
ABC
Represents an uploaded file comprised of Documents.
- Parameters:
name (str) – The name of the file.
file_id (str) – The unique identifier of the file.
metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.
docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.
raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.
- class camel.loaders.base_io.HtmlFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
File
- class camel.loaders.base_io.JsonFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
File
- class camel.loaders.base_io.PdfFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
File
- class camel.loaders.base_io.TxtFile(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
File
- camel.loaders.base_io.create_file(file: BytesIO, filename: str) File [source]#
Reads an uploaded file and returns a File object.
- Parameters:
file (BytesIO) – A BytesIO object representing the contents of the file.
filename (str) – The name of the file.
- Returns:
A File object.
- Return type:
camel.loaders.firecrawl_reader module#
- class camel.loaders.firecrawl_reader.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#
Bases:
object
Firecrawl allows you to turn entire websites into LLM-ready markdown.
- Parameters:
api_key (Optional[str]) – API key for authenticating with the Firecrawl API.
api_url (Optional[str]) – Base URL for the Firecrawl API.
References
https://docs.firecrawl.dev/introduction
- check_crawl_job(job_id: str) Dict [source]#
Check the status of a crawl job.
- Parameters:
job_id (str) – The ID of the crawl job.
- Returns:
The response including status of the crawl job.
- Return type:
Dict
- Raises:
RuntimeError – If the check process fails.
- crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any [source]#
Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.
- Parameters:
url (str) – The URL to crawl.
params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.
**kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.
- Returns:
- The crawl job ID or the crawl results if waiting until
completion.
- Return type:
Any
- Raises:
RuntimeError – If the crawling process fails.
- map_site(url: str, params: Dict[str, Any] | None = None) list [source]#
Map a website to retrieve all accessible URLs.
- Parameters:
url (str) – The URL of the site to map.
params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.
- Returns:
A list containing the URLs found on the site.
- Return type:
list
- Raises:
RuntimeError – If the mapping process fails.
- scrape(url: str, params: Dict[str, Any] | None = None) Dict [source]#
To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.
Reference: https://docs.firecrawl.dev/advanced-scraping-guide
- Parameters:
url (str) – The URL to read.
params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.
- Returns:
The scraped data.
- Return type:
Dict
- Raises:
RuntimeError – If the scrape process fails.
- structured_scrape(url: str, response_format: BaseModel) Dict [source]#
Use LLM to extract structured data from given URL.
- Parameters:
url (str) – The URL to read.
response_format (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.
- Returns:
The content of the URL.
- Return type:
Dict
- Raises:
RuntimeError – If the scrape process fails.
camel.loaders.jina_url_reader module#
- class camel.loaders.jina_url_reader.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#
Bases:
object
URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.
- Parameters:
api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.
return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.
json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.
timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.
**kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.
References
camel.loaders.unstructured_io module#
- class camel.loaders.unstructured_io.UnstructuredIO[source]#
Bases:
object
A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.
References
- static chunk_elements(elements: List[Element], chunk_type: str, **kwargs) List[Element] [source]#
Chunks elements by titles.
- Parameters:
elements (List[Element]) – List of Element objects to be chunked.
chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.
**kwargs – Additional keyword arguments for chunking.
- Returns:
List of chunked sections.
- Return type:
List[Dict]
References
- static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str [source]#
Cleans text data using a variety of cleaning functions provided by the unstructured library.
This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing Unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.
If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.
- Parameters:
text (str) – The text to be cleaned.
clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.
- Returns:
The cleaned text.
- Return type:
str
- Raises:
AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.
Notes
The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.
References
- static create_element_from_text(text: str, element_id: str | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | None = None) Element [source]#
Creates a Text element from a given text input, with optional metadata and embeddings.
- Parameters:
text (str) – The text content for the element.
element_id (Optional[str], optional) – Unique identifier for the element. (default:
None
)embeddings (List[float], optional) – A list of float numbers representing the text embeddings. (default:
None
)filename (Optional[str], optional) – The name of the file the element is associated with. (default:
None
)file_directory (Optional[str], optional) – The directory path where the file is located. (default:
None
)last_modified (Optional[str], optional) – The last modified date of the file. (default:
None
)filetype (Optional[str], optional) – The type of the file. (default:
None
)parent_id (Optional[str], optional) – The identifier of the parent element. (default:
None
)
- Returns:
- An instance of Text with the provided content and
metadata.
- Return type:
Element
- static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any [source]#
Extracts various types of data from text using functions from unstructured.cleaners.extract.
- Parameters:
text (str) – Text to extract data from.
(Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.
- :param‘extract_email_address’, ‘extract_ip_address’,
‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.
- Parameters:
**kwargs – Additional keyword arguments for specific extraction functions.
- Returns:
The extracted data, type depends on extract_type.
- Return type:
Any
References
- static parse_bytes(file: IO[bytes], **kwargs: Any) List[Element] | None [source]#
Parses a bytes stream and converts its contents into elements.
- Parameters:
file (IO[bytes]) – The file in bytes format to be parsed.
**kwargs – Extra kwargs passed to the partition function.
- Returns:
- List of elements after parsing the file
if successful, otherwise None.
- Return type:
Union[List[Element], None]
Notes
- Supported file types:
“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.
References
https://docs.unstructured.io/open-source/core-functionality/partitioning
- static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None [source]#
Loads a file or a URL and parses its contents into elements.
- Parameters:
input_path (str) – Path to the file or URL to be parsed.
**kwargs – Extra kwargs passed to the partition function.
- Returns:
- List of elements after parsing the file
or URL if success.
- Return type:
Union[List[Element],None]
- Raises:
FileNotFoundError – If the file does not exist at the path specified.
Notes
- Supported file types:
“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.
References
- static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any [source]#
Stages elements for various platforms based on the specified staging type.
This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.
- Parameters:
elements (List[Any]) – List of Element objects to be staged.
(Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
- :param‘convert_to_dict’, ‘dict_to_elements’,
‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
- Parameters:
**kwargs – Additional keyword arguments specific to the staging type.
- Returns:
- Staged data in the
format appropriate for the specified staging type.
- Return type:
Union[str, List[Dict], Any]
- Raises:
ValueError – If the staging type is not supported or a required argument is missing.
References
Module contents#
- class camel.loaders.Apify(api_key: str | None = None)[source]#
Bases:
object
Apify is a platform that allows you to automate any web workflow.
- Parameters:
api_key (Optional[str]) – API key for authenticating with the Apify API.
- get_dataset(dataset_id: str) dict | None [source]#
Get a dataset from the Apify platform.
- Parameters:
dataset_id (str) – The ID of the dataset to get.
- Returns:
The dataset.
- Return type:
dict
- Raises:
RuntimeError – If the dataset fails to be retrieved.
- get_dataset_client(dataset_id: str) DatasetClient [source]#
Get a dataset client from the Apify platform.
- Parameters:
dataset_id (str) – The ID of the dataset to get the client for.
- Returns:
The dataset client.
- Return type:
DatasetClient
- Raises:
RuntimeError – If the dataset client fails to be retrieved.
- get_dataset_items(dataset_id: str) List [source]#
Get items from a dataset on the Apify platform.
- Parameters:
dataset_id (str) – The ID of the dataset to get items from.
- Returns:
The items in the dataset.
- Return type:
list
- Raises:
RuntimeError – If the items fail to be retrieved.
- get_datasets(unnamed: bool | None = None, limit: int | None = None, offset: int | None = None, desc: bool | None = None) List[dict] [source]#
Get all named datasets from the Apify platform.
- Parameters:
unnamed (bool, optional) – Whether to include unnamed key-value stores in the list
limit (int, optional) – How many key-value stores to retrieve
offset (int, optional) – What key-value store to include as first when retrieving the list
desc (bool, optional) – Whether to sort the key-value stores in descending order based on their modification date
- Returns:
The datasets.
- Return type:
List[dict]
- Raises:
RuntimeError – If the datasets fail to be retrieved.
- run_actor(actor_id: str, run_input: dict | None = None, content_type: str | None = None, build: str | None = None, max_items: int | None = None, memory_mbytes: int | None = None, timeout_secs: int | None = None, webhooks: list | None = None, wait_secs: int | None = None) dict | None [source]#
Run an actor on the Apify platform.
- Parameters:
actor_id (str) – The ID of the actor to run.
run_input (Optional[dict]) – The input data for the actor. Defaults to None.
content_type (str, optional) – The content type of the input.
build (str, optional) – Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest).
max_items (int, optional) – Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit.
memory_mbytes (int, optional) – Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor.
timeout_secs (int, optional) – Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor.
webhooks (list, optional) – Optional webhooks (https://docs.apify.com/webhooks) associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here.
wait_secs (int, optional) – The maximum number of seconds the server waits for finish. If not provided, waits indefinitely.
- Returns:
The output data from the actor if successful. # please use the ‘defaultDatasetId’ to get the dataset
- Return type:
Optional[dict]
- Raises:
RuntimeError – If the actor fails to run.
- update_dataset(dataset_id: str, name: str) dict [source]#
Update a dataset on the Apify platform.
- Parameters:
dataset_id (str) – The ID of the dataset to update.
name (str) – The new name for the dataset.
- Returns:
The updated dataset.
- Return type:
dict
- Raises:
RuntimeError – If the dataset fails to be updated.
- class camel.loaders.ChunkrReader(api_key: str | None = None, url: str | None = 'https://api.chunkr.ai/api/v1/task', timeout: int = 30, **kwargs: Any)[source]#
Bases:
object
Chunkr Reader for processing documents and returning content in various formats.
- Parameters:
api_key (Optional[str], optional) – The API key for Chunkr API. If not provided, it will be retrieved from the environment variable CHUNKR_API_KEY. (default:
None
)url (Optional[str], optional) – The url to the Chunkr service. (default:
https://api.chunkr.ai/api/v1/task
)timeout (int, optional) – The maximum time in seconds to wait for the API responses. (default:
30
)**kwargs (Any) – Additional keyword arguments for request headers.
- get_task_output(task_id: str, max_retries: int = 5) str [source]#
Polls the Chunkr API to check the task status and returns the task result.
- Parameters:
task_id (str) – The task ID to check the status for.
max_retries (int, optional) – Maximum number of retry attempts. (default:
5
)
- Returns:
The formatted task result in JSON format.
- Return type:
str
- Raises:
ValueError – If the task status cannot be retrieved.
RuntimeError – If the maximum number of retries is reached without a successful task completion.
- submit_task(file_path: str, model: str = 'Fast', ocr_strategy: str = 'Auto', target_chunk_length: str = '512') str [source]#
Submits a file to the Chunkr API and returns the task ID.
- Parameters:
file_path (str) – The path to the file to be uploaded.
model (str, optional) – The model to be used for the task. (default:
Fast
)ocr_strategy (str, optional) – The OCR strategy. Defaults to ‘Auto’.
target_chunk_length (str, optional) – The target chunk length. (default:
512
)
- Returns:
The task ID.
- Return type:
str
- class camel.loaders.File(name: str, file_id: str, metadata: Dict[str, Any] | None = None, docs: List[Dict[str, Any]] | None = None, raw_bytes: bytes = b'')[source]#
Bases:
ABC
Represents an uploaded file comprised of Documents.
- Parameters:
name (str) – The name of the file.
file_id (str) – The unique identifier of the file.
metadata (Dict[str, Any], optional) – Additional metadata associated with the file. Defaults to None.
docs (List[Dict[str, Any]], optional) – A list of documents contained within the file. Defaults to None.
raw_bytes (bytes, optional) – The raw bytes content of the file. Defaults to b””.
- class camel.loaders.Firecrawl(api_key: str | None = None, api_url: str | None = None)[source]#
Bases:
object
Firecrawl allows you to turn entire websites into LLM-ready markdown.
- Parameters:
api_key (Optional[str]) – API key for authenticating with the Firecrawl API.
api_url (Optional[str]) – Base URL for the Firecrawl API.
References
https://docs.firecrawl.dev/introduction
- check_crawl_job(job_id: str) Dict [source]#
Check the status of a crawl job.
- Parameters:
job_id (str) – The ID of the crawl job.
- Returns:
The response including status of the crawl job.
- Return type:
Dict
- Raises:
RuntimeError – If the check process fails.
- crawl(url: str, params: Dict[str, Any] | None = None, **kwargs: Any) Any [source]#
Crawl a URL and all accessible subpages. Customize the crawl by setting different parameters, and receive the full response or a job ID based on the specified options.
- Parameters:
url (str) – The URL to crawl.
params (Optional[Dict[str, Any]]) – Additional parameters for the crawl request. Defaults to None.
**kwargs (Any) – Additional keyword arguments, such as poll_interval, idempotency_key.
- Returns:
- The crawl job ID or the crawl results if waiting until
completion.
- Return type:
Any
- Raises:
RuntimeError – If the crawling process fails.
- map_site(url: str, params: Dict[str, Any] | None = None) list [source]#
Map a website to retrieve all accessible URLs.
- Parameters:
url (str) – The URL of the site to map.
params (Optional[Dict[str, Any]]) – Additional parameters for the map request. Defaults to None.
- Returns:
A list containing the URLs found on the site.
- Return type:
list
- Raises:
RuntimeError – If the mapping process fails.
- scrape(url: str, params: Dict[str, Any] | None = None) Dict [source]#
To scrape a single URL. This function supports advanced scraping by setting different parameters and returns the full scraped data as a dictionary.
Reference: https://docs.firecrawl.dev/advanced-scraping-guide
- Parameters:
url (str) – The URL to read.
params (Optional[Dict[str, Any]]) – Additional parameters for the scrape request.
- Returns:
The scraped data.
- Return type:
Dict
- Raises:
RuntimeError – If the scrape process fails.
- structured_scrape(url: str, response_format: BaseModel) Dict [source]#
Use LLM to extract structured data from given URL.
- Parameters:
url (str) – The URL to read.
response_format (BaseModel) – A pydantic model that includes value types and field descriptions used to generate a structured response by LLM. This schema helps in defining the expected output format.
- Returns:
The content of the URL.
- Return type:
Dict
- Raises:
RuntimeError – If the scrape process fails.
- class camel.loaders.JinaURLReader(api_key: str | None = None, return_format: JinaReturnFormat = JinaReturnFormat.DEFAULT, json_response: bool = False, timeout: int = 30, **kwargs: Any)[source]#
Bases:
object
URL Reader provided by Jina AI. The output is cleaner and more LLM-friendly than the URL Reader of UnstructuredIO. Can be configured to replace the UnstructuredIO URL Reader in the pipeline.
- Parameters:
api_key (Optional[str], optional) – The API key for Jina AI. If not provided, the reader will have a lower rate limit. Defaults to None.
return_format (ReturnFormat, optional) – The level of detail of the returned content, which is optimized for LLMs. For now screenshots are not supported. Defaults to ReturnFormat.DEFAULT.
json_response (bool, optional) – Whether to return the response in JSON format. Defaults to False.
timeout (int, optional) – The maximum time in seconds to wait for the page to be rendered. Defaults to 30.
**kwargs (Any) – Additional keyword arguments, including proxies, cookies, etc. It should align with the HTTP Header field and value pairs listed in the reference.
References
- class camel.loaders.PandaReader(config: Dict[str, Any] | None = None)[source]#
Bases:
object
- load(data: DataFrame | str, *args: Any, **kwargs: Dict[str, Any]) SmartDataframe [source]#
Loads a file or DataFrame and returns a SmartDataframe object.
- Parameters:
data (Union[DataFrame, str]) – The data to load.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The SmartDataframe object.
- Return type:
SmartDataframe
- read_clipboard(*args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a clipboard and returns a DataFrame.
- Parameters:
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_csv(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a CSV file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the CSV file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_excel(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads an Excel file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the Excel file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_feather(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a Feather file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the Feather file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_hdf(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads an HDF file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the HDF file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_html(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads an HTML file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the HTML file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_json(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a JSON file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the JSON file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_orc(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads an ORC file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the ORC file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_parquet(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a Parquet file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the Parquet file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_pickle(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a Pickle file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the Pickle file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_sas(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a SAS file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the SAS file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_sql(*args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a SQL file and returns a DataFrame.
- Parameters:
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_stata(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a Stata file and returns a DataFrame.
- Parameters:
file_path (str) – The path to the Stata file.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- read_table(file_path: str, *args: Any, **kwargs: Dict[str, Any]) DataFrame [source]#
Reads a table and returns a DataFrame.
- Parameters:
file_path (str) – The path to the table.
*args (Any) – Additional positional arguments.
**kwargs (Dict[str, Any]) – Additional keyword arguments.
- Returns:
The DataFrame object.
- Return type:
DataFrame
- class camel.loaders.UnstructuredIO[source]#
Bases:
object
A class to handle various functionalities provided by the Unstructured library, including version checking, parsing, cleaning, extracting, staging, chunking data, and integrating with cloud services like S3 and Azure for data connection.
References
- static chunk_elements(elements: List[Element], chunk_type: str, **kwargs) List[Element] [source]#
Chunks elements by titles.
- Parameters:
elements (List[Element]) – List of Element objects to be chunked.
chunk_type (str) – Type chunk going to apply. Supported types: ‘chunk_by_title’.
**kwargs – Additional keyword arguments for chunking.
- Returns:
List of chunked sections.
- Return type:
List[Dict]
References
- static clean_text_data(text: str, clean_options: List[Tuple[str, Dict[str, Any]]] | None = None) str [source]#
Cleans text data using a variety of cleaning functions provided by the unstructured library.
This function applies multiple text cleaning utilities by calling the unstructured library’s cleaning bricks for operations like replacing Unicode quotes, removing extra whitespace, dashes, non-ascii characters, and more.
If no cleaning options are provided, a default set of cleaning operations is applied. These defaults including operations “replace_unicode_quotes”, “clean_non_ascii_chars”, “group_broken_paragraphs”, and “clean_extra_whitespace”.
- Parameters:
text (str) – The text to be cleaned.
clean_options (dict) – A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.
- Returns:
The cleaned text.
- Return type:
str
- Raises:
AttributeError – If a cleaning option does not correspond to a valid cleaning function in unstructured.
Notes
The ‘options’ dictionary keys must correspond to valid cleaning brick names from the unstructured library. Each brick’s parameters must be provided in a nested dictionary as the value for the key.
References
- static create_element_from_text(text: str, element_id: str | None = None, embeddings: List[float] | None = None, filename: str | None = None, file_directory: str | None = None, last_modified: str | None = None, filetype: str | None = None, parent_id: str | None = None) Element [source]#
Creates a Text element from a given text input, with optional metadata and embeddings.
- Parameters:
text (str) – The text content for the element.
element_id (Optional[str], optional) – Unique identifier for the element. (default:
None
)embeddings (List[float], optional) – A list of float numbers representing the text embeddings. (default:
None
)filename (Optional[str], optional) – The name of the file the element is associated with. (default:
None
)file_directory (Optional[str], optional) – The directory path where the file is located. (default:
None
)last_modified (Optional[str], optional) – The last modified date of the file. (default:
None
)filetype (Optional[str], optional) – The type of the file. (default:
None
)parent_id (Optional[str], optional) – The identifier of the parent element. (default:
None
)
- Returns:
- An instance of Text with the provided content and
metadata.
- Return type:
Element
- static extract_data_from_text(text: str, extract_type: Literal['extract_datetimetz', 'extract_email_address', 'extract_ip_address', 'extract_ip_address_name', 'extract_mapi_id', 'extract_ordered_bullets', 'extract_text_after', 'extract_text_before', 'extract_us_phone_number'], **kwargs) Any [source]#
Extracts various types of data from text using functions from unstructured.cleaners.extract.
- Parameters:
text (str) – Text to extract data from.
(Literal['extract_datetimetz' (extract_type) – ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.
- :param‘extract_email_address’, ‘extract_ip_address’,
‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract.
- Parameters:
**kwargs – Additional keyword arguments for specific extraction functions.
- Returns:
The extracted data, type depends on extract_type.
- Return type:
Any
References
- static parse_bytes(file: IO[bytes], **kwargs: Any) List[Element] | None [source]#
Parses a bytes stream and converts its contents into elements.
- Parameters:
file (IO[bytes]) – The file in bytes format to be parsed.
**kwargs – Extra kwargs passed to the partition function.
- Returns:
- List of elements after parsing the file
if successful, otherwise None.
- Return type:
Union[List[Element], None]
Notes
- Supported file types:
“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.
References
https://docs.unstructured.io/open-source/core-functionality/partitioning
- static parse_file_or_url(input_path: str, **kwargs: Any) List[Element] | None [source]#
Loads a file or a URL and parses its contents into elements.
- Parameters:
input_path (str) – Path to the file or URL to be parsed.
**kwargs – Extra kwargs passed to the partition function.
- Returns:
- List of elements after parsing the file
or URL if success.
- Return type:
Union[List[Element],None]
- Raises:
FileNotFoundError – If the file does not exist at the path specified.
Notes
- Supported file types:
“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”, “org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.
References
- static stage_elements(elements: List[Any], stage_type: Literal['convert_to_csv', 'convert_to_dataframe', 'convert_to_dict', 'dict_to_elements', 'stage_csv_for_prodigy', 'stage_for_prodigy', 'stage_for_baseplate', 'stage_for_datasaur', 'stage_for_label_box', 'stage_for_label_studio', 'stage_for_weaviate'], **kwargs) str | List[Dict] | Any [source]#
Stages elements for various platforms based on the specified staging type.
This function applies multiple staging utilities to format data for different NLP annotation and machine learning tools. It uses the ‘unstructured.staging’ module’s functions for operations like converting to CSV, DataFrame, dictionary, or formatting for specific platforms like Prodigy, etc.
- Parameters:
elements (List[Any]) – List of Element objects to be staged.
(Literal['convert_to_csv' (stage_type) – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
'convert_to_dataframe' – ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
- :param‘convert_to_dict’, ‘dict_to_elements’,
‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform.
- Parameters:
**kwargs – Additional keyword arguments specific to the staging type.
- Returns:
- Staged data in the
format appropriate for the specified staging type.
- Return type:
Union[str, List[Dict], Any]
- Raises:
ValueError – If the staging type is not supported or a required argument is missing.
References