2024년 10월 30일

Oct 30, 2024

How Does Critical Batch Size Scale in Pre-training?

(Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade)

Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size, concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control on factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

Crticial Batch Size에 대한 Scaling Law. 결론은 모델 크기에 비해 데이터 크기 증가에 따라 Critical Batch Size가 증가한다는 것이군요. 배치 크기 - 토큰 길이에 따른 관계라거나 LR 스케줄을 대체하는 Constant + EWA 등등 재미있는 부분들이 많습니다.

This paper discusses the scaling law for critical batch size. The conclusion is that critical batch size increases more with data size than with model size. There are many interesting aspects explored, such as the relationship between batch size and token length, and the use of constant learning rate + EWA as a replacement for learning rate scheduling.

#scaling-law #hyperparameter

L3Ms -- Lagrange Large Language Models

(Guneet S. Dhillon, Xingjian Shi, Yee Whye Teh, Alex Smola)

Supervised fine-tuning (SFT) and alignment of large language models (LLMs) are key steps in providing a good user experience. However, the concept of an appropriate alignment is inherently application-dependent, and current methods often rely on heuristic choices to drive the optimization. In this work, we formulate SFT and alignment as a constrained optimization problem, where the LLM is trained on a task while being required to meet application-specific requirements, without resorting to heuristics. To solve this, we propose Lagrange Large Language Models (L3Ms), which employ logarithmic barriers to enforce the constraints. This approach allows for the customization of L3Ms across diverse applications while avoiding heuristic-driven processes. We demonstrate experimentally the versatility and efficacy of L3Ms in achieving tailored alignments for various applications.

Reward를 제약 조건으로 건 상황에서 SFT. 그런 의미에서 SFT와 RLHF를 통합했다고 표현하고 있네요.

This paper performs SFT with reward constraints. In this sense, it can be said to unify SFT and RLHF.

#reward-model #alignment

Task Vectors are Cross-Modal

(Grace Luo, Trevor Darrell, Amir Bar)

We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page:
https://task-vectors-are-cross-modal.github.io
.

In-context Learning 프롬프트에서 추출한 과제의 내용을 기술하는 임베딩이 모달리티 사이에 공통적으로 작동한다는 연구. 즉 텍스트로 작성한 과제에 대한 임베딩을 이미지 입력에 대해 사용하면 이미지에 대해 동일한 과제를 수행한다고 합니다. 이미지와 텍스트의 정렬이 이미지-텍스트 모델의 성능에 중요하다는 것과 관련이 있지 않을까요. (https://arxiv.org/abs/2408.16357)

This research shows that embeddings containing task information, extracted from in-context learning prompts, function across different modalities. Specifically, if you extract embeddings from tasks described in text and apply them to image inputs, the model performs the same tasks on images. I think this might be related to findings suggesting that alignment between images and text is crucial for the performance of image-text models. (https://arxiv.org/abs/2408.16357)

#in-context-learning #multimodal

2024년 10월 30일

How Does Critical Batch Size Scale in Pre-training?

L3Ms -- Lagrange Large Language Models

Task Vectors are Cross-Modal

Discussion about this post