2024년 12월 13일
Phi-4 Technical Report
(Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang)
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
Phi-4가 나왔군요. Phi 스타일의 합성 데이터가 벤치마크 스코어를 굉장히 높게 끌어올린다는 것은 분명하겠죠. 이것이 실제 유용성과 연결이 된다면 괜찮겠지만 그렇지 않다면 벤치마크 성능이 판단의 기준이 되기 어렵다는 의미일 겁니다. 합성 데이터가 중요한 역할을 할 수 있으리라고 생각하지만 만약 벤치마크 스코어가 충분한 지표가 되지 않는다면 판단에 보다 조심스러워져야만 하겠죠.
Phi-4 has been released. It's clear that Phi-style synthetic data can significantly boost benchmark scores. If this correlates well with actual usefulness, that would be great. However, if not, it would mean that benchmark performance may not be a reliable criterion for evaluation. While I believe synthetic data can play an important role, if benchmark scores prove to be insufficient indicators, we should be more cautious in our judgments about the effectiveness of synthetic data.
#llm #synthetic-data
Byte Latent Transformer: Patches Scale Better Than Tokens
(Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer)
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
바이트 단위 LM. 바이트를 묶기 위해 작은 바이트 단위 LM으로 엔트로피를 측정해 엔트로피가 큰 지점에서 자르는 방식을 사용했습니다. BPE와는 달리 고정된 Vocabulary 없이 상황에 따라, 예측 난이도에 따라 다른 바이트 패치가 만들어질 수 있죠. 재미있네요.
Byte-level LM. To group bytes, it uses a small byte-level LM to measure entropy and splits bytes at points where entropy is high. Unlike BPE, this approach doesn't use a fixed vocabulary, and allows for different byte patches to be created depending on the context and prediction difficulty. Interesting.
#tokenizer
Large Concept Models: Language Modeling in a Sentence Representation Space
(The LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk)
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.
문장 임베딩을 개념으로 간주해 이 개념들의 시퀀스를 Autoregressive 생성하는 모델. Skip Thought Vector 같은 고전적인 아이디어가 떠오르네요. 문장 임베딩을 타겟한다는 것 자체가 많은 문제를 야기하겠지만요.
This model generates an autoregressive sequence of concepts, treating sentence embeddings as concepts. It reminds me of classical ideas like Skip Thought Vectors. However, targeting sentence embeddings itself will likely cause many problems.
#embedding #llm