Document Loaders are LangChain components that standardize the ingestion of data from diverse sources, each with specific implementations optimized for different data origins.
Document Loaders are LangChain components designed to read data from various sources (files, databases, websites) and convert them into a standardized Document object format (containing page_content text and metadata). They implement a load() method for immediate loading and a lazy_load() method for streaming large datasets. Choosing the right loader depends on the source type and the file's structure, with specialized loaders available for each common use case.
When selecting a loader, consider the file format, size, and required fidelity. For simple PDF text extraction, PyPDFLoader is sufficient; for complex documents with tables or images, use UnstructuredPDFLoader. For web scraping, WebBaseLoader works for static HTML, while PlaywrightURLLoader is needed for JavaScript-heavy sites. Notion offers both directory-based (exported markdown) and API-based (NotionDBLoader) options. For SQL databases, use SQLDatabaseLoader with a custom query to map rows to documents, specifying which column becomes page_content and which become metadata.