2024년 9월 24일

Sep 24, 2024

Backtracking Improves Generation Safety

(Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, Eric Michael Smith)

Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% →→ 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

Backtracking으로 안전한 응답을 생성하도록 유도. 위험한 응답을 생성하다가 리셋한 다음 안전한 응답으로 고치도록 학습시키는 방법입니다.

Self Correction은 지금 최고의 화제이죠. 그런 측면에서 생각해볼 수 있겠네요.

#rlhf #safety

Instruction Following without Instruction Tuning

(John Hewitt, Nelson F. Liu, Percy Liang, Christopher D. Manning)

Instruction tuning commonly means finetuning a language model on instruction-response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of responses. However, we then find it's not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models' responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model's distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words' probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.

Instruction 튜닝에서 Instruction 입력 없이 응답으로만 학습시켜도, 하나의 과제에만 학습시켜도 Instruction을 따르는 능력이 생긴다는 결과. 더 나아가 단어의 분포를 규칙 기반으로 변형하는 것으로도 Instruction Following 능력이 등장한다고.

#instruction-tuning #alignment

Direct Judgement Preference Optimization

(Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty)

Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models' outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.

응답에 대한 평가 모델에 CoT를 결합. 얼마 전에 Reward Model에 대해 비슷한 시도를 한 사례가 있었죠. (https://arxiv.org/abs/2408.11791) CoT 없이 평가, 그리고 평가 결과에 대해 응답을 추측하게 하는 과제를 결합한 다음 DPO로 학습. Teacher 모델의 크기나 Teacher 모델이 생성한 결과가 정답과 일치하는지를 통해 Positive/Negative를 설정합니다.

#reward-model

2024년 9월 24일

Backtracking Improves Generation Safety

Instruction Following without Instruction Tuning

Direct Judgement Preference Optimization

Discussion about this post