2024년 6월 21일

Jun 21, 2024

Claude 3.5 Sonnet

(Anthropic)

Claude 3.5 Sonnet가 나왔습니다. https://x.com/AnthropicAI/status/1803774865473696237 예고한 이후 얼마 되지 않아 나왔네요.

예고부터 공개까지 잠깐동안 얼마나 향상된 모델을 내놓을까 하는 생각을 했었는데, 결과적으로 나온 모델은 Claude 3 Opus를 상회하는 모델이었습니다. 3.5 Opus와 3.5 Haiku도 연내 나온다고 하네요.

채팅 인터페이스 측면에서는 Artifacts라는 코드 등 Claude가 생성한 결과를 저장하고 실행하는 워크스페이스가 추가됐네요.

Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet 모두 1M 토큰 당 $7 - $15 수준에서 경쟁하면서 이전의 GPT-4를 뛰어넘는 스코어를 달성하고 있습니다. 이 모델들이 어느 정도의 크기인지, 얼마나 학습했는지, 그 데이터의 구성이 이전 모델과는 크게 다른지 같은 문제들에 대해서 생각해보게 됩니다. 무엇보다 데이터의 구성이 바뀌고 있는 것이 아닌가 하는 생각이 있네요.

#llm

Optimizing AI Inference at Character.AI

(Character.AI)

Character.AI의 추론 최적화를 위한 아키텍처 변형.

GQA 대신 MQA
Local Attention 5 + Global Attention 1 레이어 조합
레이어 간 KV 캐시 공유.
Radix Attention과 비슷하게 KV 캐시를 트리로 구성해 여러 턴 사이에서도 KV 캐시를 공유
Weight, Activation, KV Cache에 대한 Int8 Quantization. 놀랍게도 PTQ가 아니라 Int8 학습을 했다고 합니다.

모델 성능을 위한 레시피 뿐만 아니라 추론을 위한 최적화 측면에서도 선두 기업들과의 차이가 크다는 이야기가 있었는데 그것이 좀 드러난 것 같네요. 레이어 간 캐시 공유 같은 것은 최근에야 논문으로 나오고 있는 방법인데 이미 사용하고 있었군요. Int8 학습 같은 것은 거의 알려지지 않았죠.

#transformer #efficiency

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

(Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, Aviral Kumar)

Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data doublesdoubles the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by 8×8×. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

합성 데이터를 사용해 수학 문제를 푸는 능력을 주입한 실험. 요점은 1. 더 강력한 모델로 생성한 데이터보다 모델 자체가 생성한 데이터 중 Positive를 걸러서 학습시키는 것이 더 데이터 효율적이다. 2. 그렇지만 Positive만 사용해서 학습하면 잘못된 패턴을 학습하는 위험이 있다. (Spurious Correlation) 3. Negative 샘플을 사용해 중요한 스텝을 부각시킬 수 있고 이것이 데이터 효율성을 크게 향상시킨다.

#rlaif

Consistency Models Made Easy

(Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, J. Zico Kolter)

Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (https://github.com/locuslab/ect) is available.

Consistency Models의 학습 목표를 df/dt = 0, dt가 너무 작은 상태에서 학습하는 것이 학습의 어려움을 만든다는 아이디어. dt = t로 잡으면 Diffusion Model과 일치하기 때문에 dt = t에서 시작해서 dt를 감소시키는 형태로 학습을 하는 알고리즘이군요.

#diffusion

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

(Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou)

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Instruction Following 능력을 모델이 자체 생성한 데이터로 향상시키기. 여기서는 IFEval이나 FollowBench 같은 코드로 검증 가능한 경우가 많은 사례를 대상으로 하긴 했습니다. Qwen 2에도 들어갔다고 하네요.

꽤 복잡한데 Seed Instruction을 만들고, Instruction을 LLM으로 증폭하고,답변을 검증하는 함수를 LLM으로 작성하고, Back Translation으로 이 함수를 다시 검수해서 Insturction에 대한 검증 함수 페어를 만듭니다.

그 다음 유저 쿼리를 이 Instruction과 결합한 Instruction을 만든 다음 Instruction에 문제가 없는지를 검증하고 응답을 검증 함수로 검수합니다. 이 피드백을 사용해 정렬을 진행하네요.

#instruction-tuning #alignment #synthetic-data a