2024년 1월 9일

Jan 09, 2024

Mixtral of Experts

(Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed)

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Mixtral의 테크니컬 리포트가 나왔군요. Mistral의 리포트가 그렇듯 크게 새로운 사실은 별로 없습니다. MoE 라우팅이 도메인이나 토큰에 따라 어떻게 달라지는지에 대한 관찰은 흥미로운 부분이네요. 도메인에 따른 차이는 거의 없는 듯 하고 동일한 (연속된) 토큰에 대해 동일한 Expert로 넘어가는 패턴이 나타납니다.

이런 패턴이 나타난다는 것이 토큰 ID 기반 라우팅이 가능한 이유겠죠. 추가적으로 연속된 토큰에 대해 동일한 Expert로 연결된다는 것에서는 여러 토큰에 해당하는 라우팅용 토큰을 만들어 라우팅한 접근 (https://arxiv.org/abs/2311.10768) 이 떠오르네요. 생각해볼 여지가 있지 않을까 싶습니다.

#llm #moe

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

(Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang)

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

코드 벤치마크가 하나 나왔군요. 함수를 주고 주어진 입력에 대한 출력 예측 혹은 주어진 출력에 대한 입력을 예측하는 과제입니다. GPT-3.5와 다른 코드 모델들 사이의 갭은 꽤 줄어든 것 같은데 GPT-4는 여전히 압도적이군요

#benchmark #code

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

(Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal)

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.

Reward 대신 Preference 기반으로 Minimax Winner를 산출하기 위한 방법. Nash Learning (https://arxiv.org/abs/2312.00886) 과 연결되는군요. 기본적으로 Reward보다 Preference가 더 바람직하다고 보고, Minimax Solution을 모델 하나로 학습하기 위한 방법입니다.

알고리즘을 보면 샘플 Trajectory를 큐에 넣고 새로운 샘플을 만들면서 큐 내의 샘플과 Preference를 계산해 Win rate를 Reward로 사용하는 방식입니다. 일단 Continuous Control 문제에 대한 테스트만 있긴 합니다만 Preference 기반 방법들이 계속 나오고 있는 걸 보면 관심이 가네요.

#rlhf

2024년 1월 9일

Mixtral of Experts

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Discussion about this post