asdasd

14th of 46 Questions.

How do you preserve and propagate source metadata (filename, page number, URL, timestamp) through the loading and splitting pipeline?

Preserve metadata by using create_documents or split_documents methods that propagate the metadata dictionary from the original Document object to each output chunk, and by manually adding metadata during loader initialization.

When using LangChain, each loader can attach metadata (e.g., source, page, url) to the Document objects it creates. This metadata is automatically carried over when you use split_documents (the method that splits a list of Document objects). If you're using low-level text splitting methods, you should use create_documents instead of split_text, as it accepts a list of metadata dictionaries that are propagated to each chunk. For custom loaders, you can manually create Document objects with metadata before splitting.

Preserving Metadata with split_documents

Custom Metadata with create_documents

Key points: Always use split_documents (which operates on Document objects) rather than split_text (which operates on raw strings) to ensure metadata propagation. If you need to add or modify metadata after splitting, you can iterate through the chunks and update the metadata dictionaries. Many loaders (e.g., UnstructuredPDFLoader) automatically add detailed metadata like page numbers, element types, and coordinates, which can be invaluable for precise retrieval.

Question Loading...

asdasd

14th of 46 Questions.

How do you preserve and propagate source metadata (filename, page number, URL, timestamp) through the loading and splitting pipeline?

Preserving Metadata with split_documents

Custom Metadata with create_documents