2025년 4월 2일

Apr 02, 2025

Command A: An Enterprise-Ready Large Language Model

(Cohere)

In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.

Cohere에서 얼마 전 공개했던 Command A에 대해 리포트를 냈네요. 흥미롭게도 H100 클러스터에서 JAX로 학습을 했군요.

포스트트레이닝에 대해서 비교적 자세한데 한 가지, 코드 실행 결과를 프리트레이닝 데이터로 사용했다는 언급이 있네요.

Cohere has published a report on Command A, which they released recently. Interestingly, they conducted the training using JAX on H100 clusters.

The report is relatively detailed about post-training, but one noteworthy point is that they mention using code execution results as pretraining data.

#llm #pretraining #post-training

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

(Kai Yan, Yufei Xu, Zhengyin Du, Xuesong Yao, Zheyu Wang, Xiaowen Guo, Jiecao Chen)

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60%60% performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

인터넷에 존재하는 문제들을 살짝 바꾼 문제로 추론 모델들을 테스트. 성능이 크게 떨어지는 동시에 비추론 모델들과 성능이 비슷해지네요. 이건 꽤 중요한 문제 같습니다.

The study tests reasoning models using slightly modified versions of problems found on the internet. The results show a significant drop in performance, with these models performing similarly to non-reasoning models. This seems to be quite an important issue.

#reasoning

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

(Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou)

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

각 단계에 대해 추론을 통해 평가하는 Process Reward Model. 추론을 통해 평가하는 Outcome Reward Model도 생각할 수 있겠죠.

A process reward model that evaluates each step through reasoning. Similarly, we could also consider an outcome reward model that evaluates the final result through reasoning.

#reasoning #reward-model

JudgeLRM: Large Reasoning Models as a Judge

(Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He)

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

Outcome Reward를 통한 추론 평가 모델 개발.

Development of a reasoning evaluation model using outcome rewards.

#reward-model #rl #reasoning

2025년 4월 2일

Command A: An Enterprise-Ready Large Language Model

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

JudgeLRM: Large Reasoning Models as a Judge

Discussion about this post