Self-Rag is NOT Self-Rag: 长文分析高被引文章SELF-RAG

论文题目：SELF-RAG: Learning to Retrieve, Generate and Critique through Self-reflection

作者：Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

机构：University of Washington, Allen Institute for AI, IBM Research AI

发布状态：University of Washington, Allen Institute for AI, IBM Research AI

下载：https://arxiv.org/abs/2310.11511

论文代码

安装依赖

pip install -r requirements.txt

short-form 任务

python run_short_form.py \
--model_name selfrag/selfrag_llama2_7b \
--input_file eval_data/popqa_longtail_w_gs.jsonl \
--mode MODE --max_new_tokens 100 \
--threshold 0.2 \
--output_file YOUR_OUTPUT_FILE \
--metric match --ndocs 10 --use_groundness --use_utility --use_seqscore \
--dtype half

pip install -r requirements.txt

main.py

解读

i. Special Tokens = Reflection Tokens = Retrieval Tokens([Retrival], [No Retrival]) + Critique Tokens() [3][5][6]

ii. Self-RAG的adaptive retrieve模式，会输出一个retrieval token来按需决定是否要进行retrieve动作。[2][4][7]

iii. Self-RAG在原始语料库中插入了reflection tokens。

问题

i. 错误地使用Reflection Tokens指代其子集Critique Tokens. [11]

ii. [9]指出要采用"best one"策略，

iii. 在原始语料库中插入了reflection tokens可能导致作弊。根据RAGLAB的数据，当special tokens被移除，Self-Rag的性能落回至与原始模型相近。

iv. [24]的推理算法完全是虚构的。它从未被实现过，也不可能被实现。

iv. [27]是逻辑上不可能实现的。

总结：Self-Rag存在以下问题：

1. 代码未能实现论文宣称的[Retrieval]/[No Retrieval]功能。

2. 论文对于[Retrieval]/[No Retrieval]没有精确的定义。

3. 代码未能实现论文宣称的Beam Search功能。

4. 论文关于模型的形式化定义是segment(sentence)-level的，而关于beam search的描述是request-level的，二者自相矛盾。

5. 代码对于Adaptive Retrieve的实现逻辑是编造的、没有道理的。

内容摘录

1. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (SELF-RAG) ...

2. Our framework trains a single arbitrary LM that adaptively ...

3. using special tokens, called reflection tokens.

4. This work introduces ... (SELF-RAG) to improve ... via on-demand retrieval and self-reflection.

5. We train an arbitrary LM in an end-to-end manner ... by generating both task output and intermittent special tokens (i.e., reflection tokens).

6. Reflection tokens are categorized into retrieval and critique tokens to indicate the need for retrieval and its generation quality respectively.

7. In particular, given an input prompt and preceding generations, SELF-RAG first determines if augmenting the continued generation with retrieved passages would be helpful. If so, it outputs a retrieval token that calls a retriever model on demand (Step 1).

8. Subsequently, SELF-RAG concurrently processes multiple retrieved passages, evaluating their relevance and then ... (Step 2).

9. It then generates critique tokens to criticize its own output and choose best one (Step 3) in terms of factuality and overall quality.

10. Moreover, SELF-RAG provides citations for each segment with its self-assessment of whether the output is supported by the passage, leading to easier fact verification.

11. SELF-RAG trains an arbitrary LM to generate text with reflection tokens by unifying them as the next token prediction from the expanded model vocabulary.

12. Reflection tokens, inspired by reward models used in reinforcement learning, are inserted offline into the original corpus by a trained critic model.

13. While we draw inspiration from studies that use control tokens to start and guide text generation (Lu et al., 2022; Keskar et al., 2019), our trained LM uses critique tokens to assess its own predictions after each generated segment as an integral part of the generation output.

14. SELF-RAG further enables a customizable decoding algorithm to satisfy hard or soft constraints, which are defined by reflection token predictions.

15. In particular, our inference-time algorithm enables us to (1) flexibly adjust retrieval frequency for different downstream applications and (2) customize models’ behaviors to user preferences by leveraging reflection tokens through segment-level beam search using the weighted linear sum of the reflection token probabilities as segment score.

16. We introduce a method to train an arbitrary LM to learn to use retrieval on-demand for diverse instruction-following queries and introduce controlled generation guided by reflections tokens to further improve generation quality and attributions.

17. While their value function simply indicates an overall score of each generation, SELF-RAG trains to an arbitrary LM to learn to generate fine-grained self-reflection and customizable inference.

18. Other works use general control tokens to guide LM generation (Lu et al., 2022; Korbak et al., 2023), while SELF-RAG uses reflection tokens to decide the need for retrieval and to self-evaluate generation quality.

19. Our end-to-end training lets an LM M generate text informed by retrieved passages, if needed, and criticize the output by learning to generate special tokens.

20. Formally, given input x, we train M to sequentially generate textual outputs y consisting of multiple segments y = [y1, . . . , yT ], where yt indicates a sequence of tokens for the t-th segment.

21. Generated tokens in yt include text from the original vocabulary as well as the reflection tokens (Table 1).

22. In this paper, we treat one sentence as a segment in our experiments, but our framework is applicable to any segment unit (i.e., sub-sentence).

23.

Type	Input	Output	Definitions
Retrieve	x / x, y	{yes, no, continue}	Decides when to retrieve with R

24.

25. For every x and preceding generation y<t, the model decodes a retrieval token to evaluate the utility of retrieval.

26. If retrieval is needed, the model generates: a critique token to evaluate the retrieved passage’s relevance, the next response segment, and a critique token to evaluate if the information in the response segment is supported by the passage. Finally, a new critique token evaluates the overall utility of the response.

27. To generate each segment, SELF-RAG processes multiple passages in parallel and uses its own generated reflection tokens to enforce soft constraints (Section 3.3) or hard control (Algorithm 1) over the generated task output.

28. Training overview. SELF-RAG enables an arbitrary LM to generate text with reflection tokens by unifying them as next token predictions from the expanded model vocabulary (i.e., the original vocabulary plus reflection tokens). Specifically, we train the generator model M on a curated corpus with interleaving passages retrieved by a retriever R and reflection tokens predicted by a critic model C (summarized in Appendix Algorithm 2). We train C to generate reflection tokens for evaluating retrieved passages and the quality of a given task output (Section 3.2.1). Using the critic model, we update the training corpus by inserting reflection tokens into task outputs offline. Subsequently, we train the final generator model (M) using the conventional LM objective (Section 3.2.2) to enable M to generate reflection tokens by itself without relying on the critic at inference time.

29. Generating reflection tokens to self-evaluate its own output makes SELF-RAG controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements.

30. For tasks demanding factual accuracy (Min et al., 2023), we aim for the model to retrieve passages more frequently to ensure that the output aligns closely with the available evidence.

31. Conversely, in more open-ended tasks, like composing a personal experience essay, the emphasis shifts towards retrieving less and prioritizing the overall creativity or utility score.

32. Adaptive retrieval with threshold. SELF-RAG dynamically decides when to retrieve text passages by predicting Retrieve .

33. Alternatively, our framework allows a threshold to be set.

34. Specifically, if the probability of generating the Retrieve =Yes token normalized over all output tokens in Retrieve surpasses a designated threshold, we trigger retrieval (details in Appendix Section A.3).

35. Tree-decoding with critique tokens. At each segment step t, when retrieval is required, based either on hard or soft conditions, R retrieves K passages, and the generator M processes each passage in parallel and outputs K different continuation candidates.

36. We conduct a segment-level beam search (with the beam size=B) to obtain the top-B segment continuations at each timestamp t, and return the best sequence at the end of generation.

37. The score of each segment yt with respect to passage d is updated with a critic score S that is the linear weighted sum of the normalized probability of each Critique token type.

38.

39. The weights wG in Eq. 4 are hyperparameters that can be adjusted at inference time to enable customized behaviors at test time.

40. For instance, to ensure that result y is mostly supported by evidence, we can set a weight term for the ISSUP score higher, while relatively lowering weights for other aspects.

41. Alternatively, we could further enforce hard constraints during decoding using Critique .

42. Instead of using a soft reward function in Eq. 4, we could explicitly filter out a segment continuation when the model generates an undesirable Critique token (e.g., ISSUP =No support) .

43. Balancing the trade-off between multiple preferences has been studied in RLHF (Touvron et al., 2023; Wu et al., 2023), which often requires training to change models’ behaviors. SELF-RAG tailors an LM with no additional training.

Aiden Leong