MinerUToolkit
- Maximum file size: 200MB per file
- Maximum pages: 600 pages per file
- Daily quota: 2000 pages for high-priority parsing
- Network restrictions may affect certain URLs (e.g., GitHub, AWS)
init
- api_key (Optional[str]): Authentication key for MinerU API access. If not provided, uses MINERU_API_KEY environment variable. (default: :obj:
None
) - api_url (Optional[str]): Base endpoint URL for MinerU API service. (default: :obj:
"https://mineru.net/api/v4"
) - is_ocr (bool): Enable Optical Character Recognition for image-based text extraction. (default: :obj:
False
) - enable_formula (bool): Enable mathematical formula detection and recognition. (default: :obj:
False
) - enable_table (bool): Enable table structure detection and extraction. (default: :obj:
True
) - layout_model (str): Document layout analysis model selection. Available options: ‘doclayout_yolo’, ‘layoutlmv3’. (default: :obj:
"doclayout_yolo"
) - language (str): Primary language of the document for processing. (default: :obj:
"en"
) - wait (bool): Block execution until processing completion. (default: :obj:
True
) - timeout (float): Maximum duration in seconds to wait for task completion. (default: :obj:
300
)
extract_from_urls
- urls (str | List[str]): Target URL or list of URLs for content extraction. Supports both single URL string and multiple URLs in a list.
get_task_status
- task_id (str): Unique identifier for the extraction task to check.
get_batch_status
- batch_id (str): Unique identifier for the batch extraction task to check.