2025년 2월 17일

Feb 17, 2025

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

(Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini)

Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.

웹 코퍼스를 주제와 포맷으로 분류한 다음 이 분류에 데이터 비율 결정을 위한 알고리즘을 적용해 (https://arxiv.org/abs/2407.01492) 프리트레이닝을 시도했군요.

The study on classifying web corpus by topics and formats. Then the authors did pretraining using algorithms to determine the optimal sampling ratio for these domains (https://arxiv.org/abs/2407.01492).

#corpus #pretraining

FoNE: Precise Single-Token Number Embeddings via Fourier Features

(Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan)

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64× less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3× and 6× fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at
https://fouriernumber.github.io/
.

숫자를 한 토큰으로 임베딩하기 위한 방법. 각 자릿수를 Fourier Feature로 변환했군요. 재미있네요.

A method for embedding numbers into a single token. They transformed each digit into fourier features. Interesting.

#tokenizer #embedding

Large Language Diffusion Models

(Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li)

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

8B/2.3T 규모의 Masked Diffusion 모델이 나왔군요. Discrete Diffusion 쪽 연구도 최근 좀 나오고 있네요. (https://arxiv.org/abs/2502.09622) 다만 Masked Diffusion의 특성에 대해서는 분석이 필요할 듯 합니다. (토큰 다양성 등, https://arxiv.org/abs/2409.02908)

An 8B/2.3T scale masked diffusion language model has been released. Recently, there has been an some research on discrete diffusion models (https://arxiv.org/abs/2502.09622). Further analysis of the characteristics of masked diffusion models maybe necessary (such as token diversity, https://arxiv.org/abs/2409.02908).

#diffusion

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

(Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo , Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang)

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

30B 규모의 Text to Video 모델. 요즘 비디오 생성 모델에 대한 리포트가 데이터 구축 작업에 대해 비교적 상세한 편이죠.

A 30B-parameter text-to-video model. Recent technical reports on video generation models tend to be relatively detailed about their data curation processes.

#text-to-video #diffusion

2025년 2월 17일

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Large Language Diffusion Models

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Discussion about this post