MarkItDownLoader
- Microsoft Office documents:
- Word (.doc, .docx)
- Excel (.xls, .xlsx)
- PowerPoint (.ppt, .pptx)
- EPUB
- HTML
- Images (with EXIF metadata and OCR support)
- Audio files (with EXIF metadata and speech transcription)
- Text-based formats:
- CSV
- JSON
- XML
- ZIP archives (iterates over contents)
- YouTube URLs (via transcript extraction)
init
- llm_client (Optional[object]): Optional client for LLM integration. (default: :obj:
None
) - llm_model (Optional[str]): Optional model name for the LLM. (default: :obj:
None
)
_validate_format
- file_path (str): Path to the input file.
convert_file
- file_path (str): Path to the input file.
convert_files
- file_paths (List[str]): List of file paths to convert.
- parallel (bool): Whether to process files in parallel. (default: :obj:
False
) - skip_failed (bool): Whether to skip failed files instead of including error messages. (default: :obj:
False
)