LangChain provides specialized splitters for structured documents: MarkdownHeaderTextSplitter for markdown, RecursiveCharacterTextSplitter with language-specific separators for code, and custom logic for tables to preserve their integrity.
Splitting structured documents requires preserving the hierarchical and syntactical relationships. For markdown, the MarkdownHeaderTextSplitter splits based on header levels, ensuring that each chunk inherits the header context. For code, you can create a custom splitter using RecursiveCharacterTextSplitter with separators tailored to the programming language (e.g., \n\n, \n, ;, {, }). For tables, the best approach is to treat each table as a single chunk, as splitting a table across boundaries would break its logical structure; you can use metadata to mark the chunk as a table.
For tables, avoid splitting them across boundaries. Use a loader that can detect tables (e.g., UnstructuredPDFLoader with mode="elements") and keep them as single documents, or use a custom function to treat tables as atomic units by setting their chunk size to the table's length.