2024년 3월 11일
Denoising Autoregressive Representation Learning
(Yazhe Li, Jorg Bornschein, Ting Chen)
In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.
MSE를 사용한 패치 단위 Autoregressive 학습에 추가로 패치에 대한 Denoising Diffusion Objective를 결합해서 생성 능력의 개선을 시도했습니다. 인식 능력에 더해 생성 능력도 고려하고 싶고, 거기에 텍스트와의 결합까지 생각하자면 아키텍처적으로 탐색할 여지가 아직 많지 않나 싶습니다.
#pretraining #diffusion #autoregressive_model
DeepSeek-VL: Towards Real-World Vision-Language Understanding
(Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan)
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.
DeepSeek의 Vision Language 모델이군요. 흥미로운 부분들.
이미지2코드와 OCR 데이터셋의 구축 및 추가
Low Resolution SigLIP와 High Resolution SAM 인코더
어댑터 학습 - 전체 학습 - SFT라는 일반적인 단계
1.3B 모델로 7B로 Scaling 과정의 난점. 1.3B는 모델의 성능 문제로 학습 과정 중 메트릭의 요동이 심함. SFT 데이터를 약간 넣고 Perplexity 기반으로 답을 고르는 것으로 수정
텍스트 임베딩과 Vision Encoder를 묶어서 하나의 Pipeline Block으로 취급
CLIP에 추가 인코더를 사용해서 본격적으로 결과를 본 사례가 드디어 등장했군요.
여담이지만 Anna's Archive의 책 데이터를 사용했다고 명시하고 있습니다. 이 책 데이터가 암암리에 쓰이고 있다는 이야기가 있는데 그걸 확인해줬네요.
#vision-language