Docling
Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.
This integration provides Docling's capabilities via the DoclingLoader
document loader.
Overview
The presented DoclingLoader
component enables you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.
DoclingLoader
supports two different export modes:
ExportType.DOC_CHUNKS
(default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain Document downstream, orExportType.MARKDOWN
: if you want to capture each input document as a separate LangChain Document
The example allows exploring both modes via parameter EXPORT_TYPE
; depending on the
value set, the example pipeline is then set up accordingly.
Setup
%pip install -qU langchain-docling
Note: you may need to restart the kernel to use updated packages.
For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use a GPU-enabled runtime.
Initialization
Basic initialization looks as follows:
from langchain_docling import DoclingLoader
FILE_PATH = "https://arxiv.org/pdf/2408.09869"
loader = DoclingLoader(file_path=FILE_PATH)
For advanced usage, DoclingLoader
has the following parameters:
file_path
: source as single str (URL or local file) or iterable thereofconverter
(optional): any specific Docling converter instance to useconvert_kwargs
(optional): any specific kwargs for conversion executionexport_type
(optional): export mode to use:ExportType.DOC_CHUNKS
(default) orExportType.MARKDOWN
md_export_kwargs
(optional): any specific Markdown export kwargs (for Markdown mode)chunker
(optional): any specific Docling chunker instance to use (for doc-chunk mode)meta_extractor
(optional): any specific metadata extractor to use