2024년 2월 16일

Gemini 1.5 Pro, Sora

Feb 16, 2024

Gemini 1.5 Pro가 등장했습니다! 텍스트로 10M (!) Context Length를 지원하는 모델입니다. 영상과 오디오는 Context Length가 다른데 영상은 초당 1 프레임으로 3시간, 오디오는 22 시간, 텍스트는 1440 페이지 분량이라고 하는군요.

긴 Context Length를 사용해 Kalamang라는 언어에 대한 자료를 넣고 바로 English <-> Kalamang 번역을 해버리는 능력을 보여줍니다.

MoE 모델이고 상당수 벤치마크에서 Gemini 1.0 Pro를 넘어 Gemini 1.0 Ultra를 상회하는 퍼포먼스를 보여줍니다. 무시무시하네요.

OpenAI Sora

OpenAI의 Text-to-Video 모델 Sora가 공개됐습니다.

Unreal 5나 NeRF 같은 도구로 합성한 데이터가 들어갔으리라는 추측이 많았지만 그런 흔적은 없습니다. 언급하지 않는다는 것이 부재한다는 증명은 아니지만요.

대신 보이는 것은 Spatiotemporal Latent에 대한 Diffusion Transformer 뿐이군요. 그리고 그 Diffusion Transformer를 크롭하지도 리사이즈 하지도 않은 비디오로 학습시켰습니다. 그리고 이를 뒷받침하는 충분한 연산력 투입. 결과적으로 최대 1920x1080p

영상에 대한 샘플링이 가능하군요. (2048x2048 이미지 샘플링도 가능하다고 합니다.) 추가적으로 DALL-E 3처럼 Recaptioning을 했습니다.

그리고 이렇게 학습된 모델에서 3D에 대한 어떠한 Inductive Bias도 없이 3D 일관성 장면의 연속성, 대상 영속성이 창발적으로 나타났습니다. 지금 이 시점에 규모의 힘을 의심해서는 안 되죠.

비디오 모델이 World Model이 될 수 있다는 추측이 많이 있었는데 비디오 내에서 행위에 의한 세상의 변화 과정을 시뮬레이션 하는 패턴이 나타났다는 것이 그걸 강력하게 뒷받침하고 있는 것으로 보입니다.

더 많은 데이터, 더 거대한 모델, 더 막대한 연산력이 문제를 해결한다는 것을 다시 보여주는군요.

#video-generation #world-models

Chain-of-Thought Reasoning Without Prompting

(Xuezhi Wang, Denny Zhou)

In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason effectively without prompting? Our findings reveal that, intriguingly, CoT reasoning paths can be elicited from pre-trained LLMs by simply altering the \textit{decoding} process. Rather than conventional greedy decoding, we investigate the top-k alternative tokens, uncovering that CoT paths are frequently inherent in these sequences. This approach not only bypasses the confounders of prompting but also allows us to assess the LLMs' \textit{intrinsic} reasoning abilities. Moreover, we observe that the presence of a CoT in the decoding path correlates with a higher confidence in the model's decoded answer. This confidence metric effectively differentiates between CoT and non-CoT paths. Extensive empirical studies on various reasoning benchmarks show that the proposed CoT-decoding substantially outperforms the standard greedy decoding.

Chain of Thought 프롬프팅 없이 Chain of Thought를 유도하기. LLM에서 응답할 때 Greedy 디코딩을 하는 것이 아니라 Top-K 토큰 중 특정 토큰으로 시작하게 하면 (그리고 나머지 토큰은 그냥 Greedy 디코딩을 하더라도) Chain of Thought 같은 추론이 발생한다는 결과입니다.

그래서 어떤 토큰으로 시작해야 하는가? 이렇게 디코딩을 하면 정답으로 이어지는 경우에는 Confidence가 더 높다고 합니다.

바로 답을 하려고 하는 경향이 LLM의 문제라고 하는데 그걸 살짝 억제해주면 좋은 패턴이 나타날 수 있다는 사례네요.

#prompt #decoding

How to Train Data-Efficient LLMs

(Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng)

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

LLM 프리트레이닝 데이터셋을 필터링하기 위한 방법 제안. 한 가지는 임베딩을 사용해 분포 밀도로 분포에 대한 커버는 유지하면서 중복을 제거하는 평이한(?) 방법인 Density이고, 다른 방법은 LLM에게 프리트레이닝 데이터로 쓸만하겠냐고 물어보는 Ask-LLM 입니다.

실험이 굉장히 많은데 결론은 Ask-LLM이 (비싸다는 것을 제외하면) 전체 데이터 사용 이상으로 성능을 향상시키는 것에 유망한 방법이라고 보고 있네요.

#dataset

BitDelta: Your Fine-Tune May Only Be Worth One Bit

(James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai)

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

베이스 모델과 파인튜닝한 모델의 Weight의 차이를 1 bit Quantization해서 크기를 줄이겠다는 아이디어. PEFT로 풀 파인튜닝의 효과를 도달하려고 하는 대신 풀 파인튜닝의 체크포인트 크기를 줄여보면 어떨까 하는 생각이군요.

#quantization #efficiency