Self-RAG trains the LLM to generate special reflection tokens that govern retrieval and generation decisions: it decides whether to retrieve, evaluates retrieved document relevance, checks answer grounding, and optionally critiques the final response.
Self-RAG (Self‑Reflective Retrieval‑Augmented Generation) extends standard RAG by training an LLM to output special tokens that control the RAG process. The model learns to decide: whether retrieval is needed at all (Retrieve token), whether each retrieved document is relevant (Relevant token), whether the generated answer is grounded in the retrieved documents (Grounded token), and optionally whether the answer is useful or needs refinement. This self‑reflection mechanism improves factual accuracy and reduces hallucination.[reference:10][reference:11]
Training: Fine‑tune an LLM (e.g., Llama, GPT) on a dataset with reflection tokens inserted by a teacher model or human annotators.
Inference: The model generates tokens step by step, deciding adaptively whether to retrieve, which documents to use, and how to synthesize the answer.
Advantages: Adaptability to query complexity, improved faithfulness, and reduced latency when retrieval is unnecessary.
Limitations: Higher computational cost, requires specialized training data, and not yet widely available in off‑the‑shelf libraries.