1. Concept
CAMEL introduced two IO modules, Base IO
and Unstructured IO
which are designed for handling various file types and unstructured data processing.
Additionally, four new data readers were added, Apify Reader
,Chunkr Reader
, Firecrawl Reader
, and Jina_url Reader
, which enable retrieval of external data for improved data integration and analysis.
2. Types
2.1. Base IO
Base IO module is focused on fundamental input/output operations related to files. It includes functionalities for representing, reading, and processing different file formats.
2.2. Unstructured IO
Unstructured IO module deals with the handling, parsing, and processing of unstructured data. It provides tools for parsing files or URLs, cleaning data, extracting specific information, staging elements for different platforms, and chunking elements. The core of this module lies in its advanced ETL capabilities to manipulate unstructured data to make it usable for various applications like Retrieval-Augmented Generation(RAG).
2.3. Apify Reader
Apify Reader provides a Python interface to interact with the Apify platform for automating web workflows. It allows users to authenticate via an API key and offers methods to execute and manage actors (automated web tasks) and datasets on the platform.
It includes functionalities for client initialization, actor management, dataset operation.
2.4. Chunkr Reader
Chunkr Reader is a Python client for interacting with the Chunkr API, which processes documents and returns content in various formats. It includes functionalities for client initialization, task management, formatting response.
2.5. Firecrawl Reader
Firecrawl Reader provides a Python interface to interact with the Firecrawl API, allowing users to turn websites into large language model (LLM)-ready markdown format.
2.6. Jina_url Reader
JinaURL Reader is a Python client for Jina AI’s URL reading service, optimized to provide cleaner, LLM-friendly content from URLs.
2.7 MarkitDown Reader
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.
3. Get Started
3.1. Using Base IO
This module is designed to read files of various formats, extract their contents, and represent them as File
objects, each tailored to handle a specific file type.
3.2. Using Unstructured IO
To get started with the Unstructured IO
module, you first need to import the module and initialize an instance of it. Once initialized, you can utilize this module to handle a variety of functionalities such as parsing, cleaning, extracting data, and integrating with cloud services like AWS S3 and Azure. Here’s a basic guide to help you begin:
Utilize parse_file_or_url
to load and parse unstructured data from a file or URL
Utilize clean_text_data
to do various text cleaning operations
Utilize extract_data_from_text
to do text extraction operation
Utilize chunk_elements
to chunk the content
Utilize stage_elements
to do element staging
This is a basic guide to get you started with the Unstructured IO
module. For more advanced usage, refer to the specific method documentation and the Unstructured IO Documentation.
3.3. Using Apify Reader
Initialize the client, set up the required actors and parameters.
Retrieve the result database ID and access it using the get_dataset_items method.
>>>[{'url': 'https://www.camel-ai.org/', 'crawl': {'loadedUrl': 'https://www.camel
-ai.org/', 'loadedTime': '2024-10-27T04:51:16.651Z', 'referrerUrl': 'https://ww
w.camel-ai.org/', 'depth': 0, 'httpStatusCode': 200}, 'metadata': {'canonicalUr
l': 'https://www.camel-ai.org/', 'title': 'CAMEL-AI', 'description': 'CAMEL-AI.
org is the 1st LLM multi-agent framework and an open-source community dedicated
to finding the scaling law of agents.', 'author': None, 'keywords': None, 'lan
guageCode': 'en', 'openGraph': [{'property': 'og:title', 'content': 'CAMEL-AI'
}, {'property': 'og:description', 'content': 'CAMEL-AI.org is the 1st LLM mult
i-agent framework and an open-source community dedicated to finding the scaling
g law of agents.'}, {'property': 'twitter:title', 'content': 'CAMEL-AI'}, {'pr
operty': 'twitter:description', 'content': 'CAMEL-AI.org is the 1st LLM multi-
agent framework and an open-source community dedicated to finding the scaling
law of agents.'}, {'property': 'og:type', 'content': 'website'}], 'jsonLd': No
ne, 'headers': {'date': 'Sun, 27 Oct 2024 04:50:18 GMT', 'content-type': 'text
/html', 'cf-ray': '8d901082dae7efbe-PDX', 'cf-cache-status': 'HIT', 'age': '10
81', 'content-encoding': 'gzip', 'last-modified': 'Sat, 26 Oct 2024 11:51:32 G
MT', 'strict-transport-security': 'max-age=31536000', 'surrogate-control': 'ma
x-age=432000', 'surrogate-key': 'www.camel-ai.org 6659a154491a54a40551bc78 pag
eId:6686a2bcb7ece5fb40457491 668181a0a818ade34e653b24 6659a155491a54a40551bd7e
', 'x-lambda-id': 'd6c4424b-ac67-4c54-b52a-cb2a22ca09f0', 'vary': 'Accept-Enco
ding', 'set-cookie': '__cf_bm=oX5EmZ2SNJDOBQXI8dL_reCYlCpp1FMzu40qCNxiopU-1730
004618-1.0.1.1-3teEeqUoemzHWAeGCtlPJVqdmAbiFkyu3JxopKfQFFndSCi_Z56RR.UDjLGZiq.
L_4LvAZYmNKxQ.k6VRhbA7g; path=/; expires=Sun, 27-Oct-24 05:20:18 GMT; domain=.
cdn.webflow.com; HttpOnly; Secure; SameSite=None\n_cfuvid=om_8lj9jNMIh.HEIxEAq
gszhEWaKlyS2kdXKwqGedSM-1730004618924-0.0.1.1-604800000; path=/; domain=.cdn.w
ebflow.com; HttpOnly; Secure; SameSite=None', 'alt-svc': 'h3=":443"; ma=86400'
, 'x-cluster-name': 'us-west-2-prod-hosting-red', 'x-firefox-spdy': 'h2'}}, 's
creenshotUrl': None, 'text': 'Build Multi-Agent Systems for _\nFEATURES & Inte
grations\nSeamless integrations with\npopular platforms \nScroll to explore ou
r features & integrations.', 'markdown': '# Build Multi-Agent Systems for \\_
\n\nFEATURES & Integrations\n\n## Seamless integrations with \npopular platfo
rms\n\nScroll to explore our features & integrations.'}]
3.4. Using Firecrawl Reader
Initialize the client and set the URL from which we want to retrieve information. When the status is “completed,” the information retrieval is finished and ready for reading.
Directly retrieve information from the returned results.
3.5. Using Chunkr Reader
Initialize the ChunkrReader
and ChunkrReaderConfig
. Set the local PDF file path and configuration, then submit the task. Use the generated task ID to fetch the output.
The submit_task
and get_task_output
methods are asynchronous, so you’ll need to run them using an event loop (e.g., asyncio.run()
).
>>>Task ID: 7becf001-6f07-4f63-bddf-5633df363bbb
>>>Task Output:
>>>{ "task_id": "7becf001-6f07-4f63-bddf-5633df363bbb", "status": "Succeeded", "created_at": "2024-11-08T12:45:04.260765Z", "finished_at": "2024-11-08T12:45:48.942365Z", "expires_at": null, "message": "Task succeeded", "output": { "chunks": [ { "segments": [ { "segment_id": "d53ec931-3779-41be-a220-3fe4da2770c5", "bbox": { "left": 224.16666, "top": 370.0, "width": 2101.6665, "height": 64.166664 }, "page_number": 1, "page_width": 2550.0, "page_height": 3300.0, "content": "Large Language Model based Multi-Agents: A Survey of Progress and Challenges", "segment_type": "Title", "ocr": null, "image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/.../d53ec931-3779-41be-a220-3fe4da2770c5.jpg?...", "html": "<h1>Large Language Model based Multi-Agents: A Survey of Progress and Challenges</h1>", "markdown": "# Large Language Model based Multi-Agents: A Survey of Progress and Challenges\n\n" } ], "chunk_length": 11 }, { "segments": [ { "segment_id": "7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a", "bbox": { "left": 432.49997, "top": 474.16666, "width": 1659.9999, "height": 122.49999 }, "page_number": 1, "page_width": 2550.0, "page_height": 3300.0, "content": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020", "segment_type": "Text", "ocr": null, "image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/.../7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a.jpg?...", "html": "<p>Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020</p>", "markdown": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020\n\n" } ], "chunk_length": 100 # Example, actual length may vary } // ... other chunks and segments truncated for brevity ... ] }}```
### 3.6. Using `Jina Reader`
Initialize the client and set the URL from which we want to retrieve information, then print the response.
```python
from camel.loaders import JinaURLReader
from camel.types.enums import JinaReturnFormat
jina_reader = JinaURLReader(return_format=JinaReturnFormat.MARKDOWN)
response = jina_reader.read_content("https://docs.camel-ai.org/")
print(response)
3.6. Using MarkitDown Reader
Initialize the loader and pass in the path of the file to retrieve information, then print the response.