2024년 4월 3일

Apr 03, 2024

Long-context LLMs Struggle with Long In-context Learning

(Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen)

Large Language Models (LLMs) have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LIConBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) length from 2K to 50K. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct prediction. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well under the token length of 20K and the performance benefits from utilizing the long context window. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented towards the end at the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LIConBench could serve as a more realistic evaluation for the future long context LLMs.

Long In-context Learning. 레이블의 수를 늘려서 전체 샘플들을 커버해야만 풀 수 있게 만들고, 각 레이블에 대한 샘플 수를 1 ~ 5개로 세팅해서 Context Length를 변화시키는 방법을 썼습니다.

따라서 Context Length가 길어지는 것에 따라 성능 향상이 있기를 기대하고 있는 세팅이네요. Discovery 과제를 빼고는 GPT-4는 실제로 그런 패턴이 나타나네요.

#in-context-learning #long-context

Many-shot Jailbreaking

(Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, Jamie Sully, Alex Tamkin, Tamera Lanham, Karina Nguyen, Tomasz Korbak, Jared Kaplan, Deep Ganguli, Samuel R. Bowman, Ethan Perez, Roger Grosse, David Duvenaud)

https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf#page=1.58

사용자의 악의적인 질문에 모델이 답을 하는 형태의 샘플을 입력으로 주는 것으로 탈옥하는 방법. 샘플이 늘어나고 모델이 클수록 잘 뚫립니다. 이를 타겟해서 SFT/RL을 하면 도움이 되긴 하지만 문제를 해소하진 못하는군요.

효과가 컸던 것은 프롬프트에 네가 지켜야 할 원칙에 위배되지 않을 때에만 답하라고 앞뒤에 붙이는 방식이었습니다. 일반적인 답에 대해서 패턴을 어떻게 바꾸는지는 아직 결과가 없네요.

#safety

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

(Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar)

We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.

Diffusion 모델은 샘플링 스텝의 수를 바꿀 수 있으니 작은 모델을 더 많이 돌리기 vs 큰 모델을 더 적게 돌리기라는 비교가 가능하군요. Text-Image 모델에서는 고정된 샘플링 비용에 대해서는 작은 모델이 낫고 큰 모델은 사용 비용이 높아졌을 때의 고점이 높다는 느낌이네요. Super Resolution이나 Dreambooth로 넘어가면 큰 모델 쪽의 메리트가 더 커지는 듯 합니다.

#diffusion

2024년 4월 3일

Long-context LLMs Struggle with Long In-context Learning

Many-shot Jailbreaking

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Discussion about this post