MinerUToolkit

class MinerUToolkit(BaseToolkit):

Toolkit for extracting and processing document content using MinerU API.

Provides comprehensive document processing capabilities including content extraction from URLs and files, with support for OCR, formula recognition, and table detection through the MinerU API service.

Note:

  • Maximum file size: 200MB per file
  • Maximum pages: 600 pages per file
  • Daily quota: 2000 pages for high-priority parsing
  • Network restrictions may affect certain URLs (e.g., GitHub, AWS)

init

def __init__(
    self,
    api_key: Optional[str] = None,
    api_url: Optional[str] = 'https://mineru.net/api/v4',
    is_ocr: bool = False,
    enable_formula: bool = False,
    enable_table: bool = True,
    layout_model: str = 'doclayout_yolo',
    language: str = 'en',
    wait: bool = True,
    timeout: float = 300
):

Initialize the MinerU document processing toolkit.

Parameters:

  • api_key (Optional[str]): Authentication key for MinerU API access. If not provided, uses MINERU_API_KEY environment variable. (default: :obj:None)
  • api_url (Optional[str]): Base endpoint URL for MinerU API service. (default: :obj:"https://mineru.net/api/v4")
  • is_ocr (bool): Enable Optical Character Recognition for image-based text extraction. (default: :obj:False)
  • enable_formula (bool): Enable mathematical formula detection and recognition. (default: :obj:False)
  • enable_table (bool): Enable table structure detection and extraction. (default: :obj:True)
  • layout_model (str): Document layout analysis model selection. Available options: ‘doclayout_yolo’, ‘layoutlmv3’. (default: :obj:"doclayout_yolo")
  • language (str): Primary language of the document for processing. (default: :obj:"en")
  • wait (bool): Block execution until processing completion. (default: :obj:True)
  • timeout (float): Maximum duration in seconds to wait for task completion. (default: :obj:300)

extract_from_urls

def extract_from_urls(self, urls: str | List[str]):

Process and extract content from one or multiple URLs.

Parameters:

  • urls (str | List[str]): Target URL or list of URLs for content extraction. Supports both single URL string and multiple URLs in a list.

Returns:

Dict: Response containing either completed task results when wait is True, or task/batch identifiers for status tracking when wait is False.

get_task_status

def get_task_status(self, task_id: str):

Retrieve current status of an individual extraction task.

Parameters:

  • task_id (str): Unique identifier for the extraction task to check.

Returns:

Dict: Status information and results (if task is completed) for the specified task.

Note: This is a low-level status checking method. For most use cases, prefer using extract_from_url with wait=True for automatic completion handling.

get_batch_status

def get_batch_status(self, batch_id: str):

Retrieve current status of a batch extraction task.

Parameters:

  • batch_id (str): Unique identifier for the batch extraction task to check.

Returns:

Dict: Comprehensive status information and results for all files in the batch task.

Note: This is a low-level status checking method. For most use cases, prefer using batch_extract_from_urls with wait=True for automatic completion handling.

get_tools

def get_tools(self):

Returns:

List[FunctionTool]: Collection of FunctionTool objects representing the available document processing functions in this toolkit.