MinerU

class MinerU:

Document extraction service supporting OCR, formula recognition and tables.

Parameters:

  • api_key (str, optional): Authentication key for MinerU API service. If not provided, will use MINERU_API_KEY environment variable. (default: :obj:None)
  • api_url (str, optional): Base URL endpoint for the MinerU API service. (default: :obj:"https://mineru.net/api/v4")
  • Note: - Single file size limit: 200MB - Page limit per file: 600 pages - Daily high-priority parsing quota: 2000 pages - Some URLs (GitHub, AWS) may timeout due to network restrictions

init

def __init__(
    self,
    api_key: Optional[str] = None,
    api_url: Optional[str] = 'https://mineru.net/api/v4',
    is_ocr: bool = False,
    enable_formula: bool = False,
    enable_table: bool = True,
    layout_model: str = 'doclayout_yolo',
    language: str = 'en'
):

Initialize MinerU extractor.

Parameters:

  • api_key (str, optional): Authentication key for MinerU API service. If not provided, will use MINERU_API_KEY environment variable.
  • api_url (str, optional): Base URL endpoint for MinerU API service. (default: “https://mineru.net/api/v4”)
  • is_ocr (bool, optional): Enable optical character recognition. (default: :obj:False)
  • enable_formula (bool, optional): Enable formula recognition. (default: :obj:False)
  • enable_table (bool, optional): Enable table detection, extraction. (default: :obj:True)
  • layout_model (str, optional): Model for document layout detection. Options are ‘doclayout_yolo’ or ‘layoutlmv3’. (default: :obj:"doclayout_yolo")
  • language (str, optional): Primary language of the document. (default: :obj:"en")

extract_url

def extract_url(self, url: str):

Extract content from a URL document.

Parameters:

  • url (str): Document URL to extract content from.

Returns:

Dict: Task identifier for tracking extraction progress.

batch_extract_urls

def batch_extract_urls(self, files: List[Dict[str, Union[str, bool]]]):

Extract content from multiple document URLs in batch.

Parameters:

  • files (List[Dict[str, Union[str, bool]]]): List of document configurations. Each document requires ‘url’ and optionally ‘is_ocr’ and ‘data_id’ parameters.

Returns:

str: Batch identifier for tracking extraction progress.

get_task_status

def get_task_status(self, task_id: str):

Retrieve status of a single extraction task.

Parameters:

  • task_id (str): Unique identifier of the extraction task.

Returns:

Dict: Current task status and results if completed.

get_batch_status

def get_batch_status(self, batch_id: str):

Retrieve status of a batch extraction task.

Parameters:

  • batch_id (str): Unique identifier of the batch extraction task.

Returns:

Dict: Current status and results for all documents in the batch.

wait_for_completion

def wait_for_completion(
    self,
    task_id: str,
    is_batch: bool = False,
    timeout: float = 100,
    check_interval: float = 5
):

Monitor task until completion or timeout.

Parameters:

  • task_id (str): Unique identifier of the task or batch.
  • is_batch (bool, optional): Indicates if task is a batch operation. (default: :obj:False)
  • timeout (float, optional): Maximum wait time in seconds. (default: :obj:100)
  • check_interval (float, optional): Time between status checks in seconds. (default: :obj:5)

Returns:

Dict: Final task status and extraction results.