Chunk size is the maximum length of each text segment, while chunk overlap is the amount of text shared between consecutive chunks to preserve context at boundaries; optimal values depend on the LLM's token limit and the need for contextual continuity.
Chunk size defines the maximum length (in characters, tokens, or other units) of each text segment returned by the splitter. It ensures that chunks fit within the LLM's context window. Chunk overlap is the number of characters or tokens that are repeated from the end of one chunk at the beginning of the next. This prevents semantic information from being lost at the boundaries where a sentence or idea might be split.
Choosing the right values depends on your LLM's token limit and the nature of your content. A common starting point is 512-1024 tokens (or 2000-4000 characters). For LLMs with larger context windows (e.g., 8k-128k tokens), you can increase chunk size to reduce the number of chunks. Overlap typically ranges from 10-20% of the chunk size. A higher overlap (e.g., 100-200 characters) is beneficial for narrative text where context is crucial; a lower overlap may suffice for factual, disjointed content. Important Note: In LangChain, chunk_overlap is only applied when a chunk actually exceeds the chunk_size limit; it's not applied to every chunk if they are already small. If you encounter issues with overlap not working, ensure that the splitter is actually hitting the chunk size limit.