2024년 6월 11일
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
(Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan)
We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.
Autoregressive 이미지 생성에 대한 결과가 본격적으로 나오기 시작하는군요. 더 나은 토크나이저 + 데이터의 조합으로 성능을 끌어올렸습니다.
이미지 생성 자체에서는 재미있는 결과들이 많이 나오는 듯한데 다음 단계로 흥미로운 문제는 이미지 인식과 이미지 생성을 어떻게 통합할 것인가일 것 같네요. Reconstruction이 충분히 잘 된다면 토크나이저 기반 방법으로도 가능할 듯 한데 이 문제에 대한 실험 결과들이 더 나오면 흥미로울 듯 합니다. OCR 같은 과제로 테스트해보면 되지 않을까 싶네요.
#autoregressive-model #vq #image-generation
Language Models Resist Alignment
(Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Yaodong Yang)
Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely superficial? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process \textit{disproportionately} undermines alignment compared to pre-training, potentially by orders of magnitude. We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data. Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.
정렬 과정을 거친 모델의 정렬을 깨는 것이 왜 그렇게 쉬운가? 에 대한 탐구. 프리트레이닝 데이터셋의 크기가 훨씬 크기 때문에 포스트트레이닝을 거친 모델을 프리트레이닝 분포로 돌리는 것이 비교적 쉽다는 이야기를 하고 있습니다.
반대로 생각하면 포스트트레이닝 과정에서 프리트레이닝에서 학습된 것들이 보존되기 때문에 포스트트레이닝이 의미가 있는 것이라고 할 수도 있겠죠. (그리고 이런 관점에서 포스트트레이닝에 접근하는 것이 더 유용하다는 생각도 합니다.)
#alignment #posttraining
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
(Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng)
Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. Additionally, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.
QA 데이터를 의도적으로 사용해서 모델의 학습을 가속하는 접근의 확장 버전 (https://arxiv.org/abs/2402.12847) 여기에서는 추가적인 과제를 더 도입했네요.
프리트레이닝 데이터에 Instruction 데이터를 사용하는 사례처럼 이런 추가적인 과제의 활용이 실질적인 성능 향상으로 이어지는지의 여부가 중요한 문제이긴 할 듯 합니다. 다만 일종의 학습 Objective로 보면 의미가 있을 수 있겠다는 생각이 드네요.
#pretraining