Loaders
Camel.loaders.mineru extractor
MinerU
Document extraction service supporting OCR, formula recognition and tables.
Parameters:
- api_key (str, optional): Authentication key for MinerU API service. If not provided, will use MINERU_API_KEY environment variable. (default: :obj:
None
) - api_url (str, optional): Base URL endpoint for the MinerU API service. (default: :obj:
"https://mineru.net/api/v4"
) - Note: - Single file size limit: 200MB - Page limit per file: 600 pages - Daily high-priority parsing quota: 2000 pages - Some URLs (GitHub, AWS) may timeout due to network restrictions
init
Initialize MinerU extractor.
Parameters:
- api_key (str, optional): Authentication key for MinerU API service. If not provided, will use MINERU_API_KEY environment variable.
- api_url (str, optional): Base URL endpoint for MinerU API service. (default: “https://mineru.net/api/v4”)
- is_ocr (bool, optional): Enable optical character recognition. (default: :obj:
False
) - enable_formula (bool, optional): Enable formula recognition. (default: :obj:
False
) - enable_table (bool, optional): Enable table detection, extraction. (default: :obj:
True
) - layout_model (str, optional): Model for document layout detection. Options are ‘doclayout_yolo’ or ‘layoutlmv3’. (default: :obj:
"doclayout_yolo"
) - language (str, optional): Primary language of the document. (default: :obj:
"en"
)
extract_url
Extract content from a URL document.
Parameters:
- url (str): Document URL to extract content from.
Returns:
Dict: Task identifier for tracking extraction progress.
batch_extract_urls
Extract content from multiple document URLs in batch.
Parameters:
- files (List[Dict[str, Union[str, bool]]]): List of document configurations. Each document requires ‘url’ and optionally ‘is_ocr’ and ‘data_id’ parameters.
Returns:
str: Batch identifier for tracking extraction progress.
get_task_status
Retrieve status of a single extraction task.
Parameters:
- task_id (str): Unique identifier of the extraction task.
Returns:
Dict: Current task status and results if completed.
get_batch_status
Retrieve status of a batch extraction task.
Parameters:
- batch_id (str): Unique identifier of the batch extraction task.
Returns:
Dict: Current status and results for all documents in the batch.
wait_for_completion
Monitor task until completion or timeout.
Parameters:
- task_id (str): Unique identifier of the task or batch.
- is_batch (bool, optional): Indicates if task is a batch operation. (default: :obj:
False
) - timeout (float, optional): Maximum wait time in seconds. (default: :obj:
100
) - check_interval (float, optional): Time between status checks in seconds. (default: :obj:
5
)
Returns:
Dict: Final task status and extraction results.