2024년 2월 21일

Feb 21, 2024

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

(Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Roberta Railneau)

State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or ⋆V⋆. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.

Math-Shepherd (https://arxiv.org/abs/2312.08935) 와 비슷하게 최종 답을 사용해서 과정에 대한 Reward (Step-Wise ORM)를 생성하는 파이프라인. 여기서는 Refinement 모델을 학습시키고 ORM으로 Reranking 하는 방식을 생각했군요.

#reward-model

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

(Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei)

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

다양한 Instruction을 생성하기 위한 방법. ChatGPT에서 데이터를 뽑아내는 작업에 가장 뛰어난 성과를 보여주는 MS의 작업입니다.

인간의 지식과 능력에 대한 분류군을 만들고, 각 분류에서 학습해야 할 주제들을 생성하고, 각 주제에 대해 실라버스를 만들고, 이 실라버스에 포함된 수업들을 기반으로 숙제와 그에 대한 정답을 만들어 Instruction 샘플을 만드는 형태입니다.

#instruction

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

(Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan)

The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 850 test samples divided into 6 categories, such as non-existent objects, count of objects, spatial relationship, and visual confusion. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4V, Gemini-Pro, to open-sourced models, such as LLaVA-1.5 and CogVLM. Empirically, we observe significant performance gaps between GPT-4V and other models; and previous robust instruction-tuned models, such as LRV-Instruction and LLaVA-RLHF, are not effective on this new benchmark. While GPT-4V achieves 75.02% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 5% to 35%. We further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. Surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. We hope MAD-Bench can serve as a valuable benchmark to stimulate further research to enhance models' resilience against deceptive prompts.

이미지에 부합하지 않는 텍스트로 할루시네이션을 유발시키는 벤치마크. GPT-4V와 다른 모델들 사이의 차이가 많이 크네요.

#hallucination #benchmark

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

(Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin)

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.

코드에 주석을 추가한 다음 이 데이터로 추가 학습을 시켜본다는 아이디어. 다만 샘플의 주석은 코드의 개별 라인에 대한 설명에 가까워 보이긴 합니다. 일종의 Instruction 데이터로 보이기도 하네요.

코드에 대해 좀 더 추상적인 레벨의 주석을 붙이는 방향이 재미있겠다는 생각도 듭니다.

#code #synthetic-data

VideoPrism: A Foundational Visual Encoder for Video Understanding

(Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong)

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 30 out of 33 video understanding benchmarks.

비디오 인코더 프리트레이닝. 1단계로 비디오-텍스트에 대한 Contrastive Learning을 적용하고 2단계로 비디오 전체가 들어가는 인코더와 마스킹된 비디오가 들어가는 인코더 사이의 Distillation을 적용합니다. V-JEPA (https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/) 도 떠오르는군요.

#video #pretraining #video-text

Neural Network Diffusion

(Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You)

Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a standard latent diffusion model. The autoencoder extracts latent representations of a subset of the trained network parameters. A diffusion model is then trained to synthesize these latent parameter representations from random noise. It then generates new representations that are passed through the autoencoder's decoder, whose outputs are ready to use as new subsets of network parameters. Across various architectures and datasets, our diffusion process consistently generates models of comparable or improved performance over trained networks, with minimal additional cost. Notably, we empirically find that the generated models perform differently with the trained networks. Our results encourage more exploration on the versatile use of diffusion models.

Latent Diffusion을 사용한 HyperNetwork? Weight의 일부를 Autoencoder로 Latent로 변환한 다음 Diffusion을 사용해 학습했습니다. 학습 샘플은 원 모델의 학습 과정의 마지막 Epoch에서 체크포인트를 여러 개 뽑아오는 방식으로 만들었군요. 아무래도 그냥 학습한 Weight를 그대로 복사하는 게 아니냐고 할 수 있어서 샘플링된 모델의 다양성으로 그에 대해 대답하고 있습니다.

#hypernetwork #diffusion

Video ReCap: Recursive Captioning of Hour-Long Videos

(Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius)

Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

긴 비디오에 대한 캡셔닝. 비디오를 짧게 쪼개서 그에 대해 캡셔닝을 하고, 비디오와 캡셔닝 결과를 합쳐 임베딩을 만든 다음 이 임베딩에 대해서 캡셔닝을 하는 것을 반복하는 형태입니다.

비디오를 통째로 학습시키는 방식을 채택하더라도 짧은 비디오에 대한 캡션을 조합해 긴 비디오에 대한 캡션을 만들도록 LLM을 튜닝해서 레이블을 만든 부분 등은 참고할만할 것 같네요.

#captioning #video