📊 ArXiv 研究报告 (2026-04-11)

生成时间: 2026-04-11 09:30:51 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 307 篇
及格论文: 16 篇 (5.2%)

⭐ 及格论文详细分析

1. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

作者: Jing Gu, Niccolò Cavagnero, Gijs Dubbelman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08266v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM知识蒸馏到高效视觉驾驶模型，与LLMs、推理（Chain of Thought/System 2 Thinking）、自主代理（LLM Agents）和世界模型（World Models）高度相关（10分）。涉及模型压缩（Quantization）和推理加速（Speculative Decoding）以实现高效部署（5分）。通过监督微调（SFT）进行蒸馏（5分），最终模型为紧凑型（Small Language Models相关，5分）。其他关键词未在论文中直接涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何将大型语言模型的推理能力蒸馏到高效的纯视觉驾驶模型中，以解决自动驾驶系统在复杂场景下的延迟和能耗问题，最终开发的Orion-Lite模型在Bench2Drive基准测试中超越了其庞大的VLA教师模型，实现了80.6的驾驶分数。

摘要翻译

利用大型语言模型（LLM）的通用世界知识，对于提升自动驾驶系统处理罕见复杂场景的能力具有重要前景。尽管将LLM集成到视觉-语言-动作（VLA）模型中已实现了最先进的性能，但其庞大的参数量对延迟敏感且需高效能耗的部署提出了严峻挑战。将LLM知识蒸馏至紧凑的驾驶模型中，提供了一种极具吸引力的解决方案，既能保留这些推理能力，又能维持可管理的计算开销。尽管先前的研究已证明了蒸馏的有效性，但这些工作主要集中于相对简单的场景和开环评估。因此，在本研究中，我们在闭环评估下，于更复杂、交互式的场景中探索LLM蒸馏。我们证明，通过结合潜在特征蒸馏与真实轨迹监督，一个高效的纯视觉学生模型 Orion-Lite 甚至能够超越其庞大的VLA教师模型ORION的性能，在严格的Bench2Drive基准测试中创造了新的最高纪录，驾驶得分达到80.6。最终，这表明纯视觉架构在高性能反应式规划方面仍具有巨大且尚未开发的潜力。

摘要 (Abstract)

Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

关键词: LLM distillation, autonomous driving, vision-only model, reasoning capabilities, computational efficiency, closed-loop evaluation, latent feature distillation, reactive planning

2. MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

作者: Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, Minfeng Xu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08203v1

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文MedVR提出了一种基于强化学习的智能体框架，用于医学视觉语言模型的视觉推理，核心创新在于无需人工标注的视觉重定位和共识信用分配机制。与关键词的相关性分析如下：1）与"LLM Agents"高度相关（10分），因为论文明确使用agentic reinforcement learning框架；2）与"AI for Science"高度相关（10分），属于医学AI应用；3）与推理相关的关键词（“Chain of Thought”、“System 2 Thinking”）得8分，因为论文关注多步视觉推理；4）与"Hallucination Mitigation"得8分，因为论文直接解决视觉幻觉问题；5）与"Large Language Models"和"Explainable AI"得5分，因为涉及VLMs（可视为LLM扩展）和透明度提升；其余关键词与论文的强化学习框架、医学视觉任务无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉语言模型在复杂临床任务中因文本范式限制导致的视觉推理能力不足和幻觉风险问题，提出了一个无需标注的强化学习框架MedVR，通过熵引导视觉重定位和共识信用分配机制，在多个医学VQA基准上实现了最先进的性能。

摘要翻译

医学视觉-语言模型（Medical Vision-Language Models, VLMs）在复杂临床任务中展现出巨大潜力，但其推理能力常受限于纯文本范式，难以将推断过程锚定于视觉证据。这一局限不仅削弱了模型在需要细粒度视觉分析任务中的表现，更在安全关键应用中引入了视觉幻觉风险。为此，我们提出MedVR——一种基于强化学习的新型框架，可实现医学VLM的无标注视觉推理。其核心创新在于两个协同机制：熵引导视觉重定位（Entropy-guided Visual Regrounding, EVR）利用模型不确定性引导探索，而基于共识的信用分配（Consensus-based Credit Assignment, CCA）则从推演一致性中提炼伪监督信号。在无需任何中间步骤人工标注的情况下，MedVR在多项公开医学视觉问答基准测试中取得了最先进的性能，显著超越现有模型。通过学习直接依据视觉证据进行推理，MedVR增强了模型的鲁棒性与可解释性，这些特性对于加速医学人工智能的临床部署至关重要。

摘要 (Abstract)

Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

关键词: Medical Vision-Language Models, Visual Reasoning, Reinforcement Learning, Agentic Framework, Annotation-Free, Hallucination Mitigation, Medical VQA, Clinical AI Deployment

3. Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

作者: Jiawei Liu, Xun Gong, Fen Fang, Muli Yang, Bohao Qu, Yunfeng Hu, Hong Chen, Xulei Yang, Qing Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08031v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出一个利用大语言模型（LLM）解释开放指令、调度多个基于模型预测控制（MPC）的运动规划器、并将规划轨迹转换为控制信号的框架，属于大模型在自动驾驶领域的应用创新。核心相关关键词：1）“Large Language Models”（10分）：LLM是框架的核心组件，用于语义理解和指令解释；2）“LLM Agents”（10分）：框架本质是LLM驱动的自主代理系统，实现从指令到动作的决策链；3）“Chain of Thought"和"System 2 Thinking”（各5分）：涉及多步推理和深度决策过程；4）“Instruction Tuning”（5分）：与指令理解和执行相关；5）“Tool Use"和"Multi-agent Systems”（各5分）：LLM调度多个MPC规划器作为工具，涉及多代理协调；6）“Explainable AI”（5分）：强调透明和可追溯的决策链。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了自动驾驶中如何将乘客的开放自然语言指令透明、可追溯地转化为车辆控制信号的问题，提出了一种基于LLM的多规划器调度框架，显著提高了任务完成率并降低了LLM查询成本。

摘要翻译

大多数人机交互研究忽视了乘客在自动驾驶中的操控需求。自然语言提供了直观的交互界面，但如何在不牺牲可解释性与可追溯性的前提下，将乘客的开放式指令转化为控制信号仍是一个挑战。本研究提出一种指令实现框架，该框架利用大语言模型解析指令，生成可执行脚本以基于实时反馈调度多个基于模型预测控制的运动规划器，并将规划轨迹转化为控制信号。这种以调度为核心的设计在不同时间尺度上将语义推理与车辆控制解耦，从而建立起从高层指令到底层动作的透明、可追溯决策链。由于缺乏高保真评估工具，本研究引入了一个闭环环境下开放式指令实现的基准测试。综合实验表明，该框架相较于指令实现基线方法显著提升了任务完成率，降低了大语言模型查询成本，在安全性与合规性方面达到了与专业自动驾驶方法相当的水平，并对大语言模型推理延迟表现出较强的容忍度。更多定性示例与详细说明请参阅后续内容。

摘要 (Abstract)

Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals, without sacrificing interpretability and traceability, remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding.

关键词: Large Language Models, Autonomous Vehicles, Instruction Realization, Multi-Planner Scheduling, Model Predictive Control, Human-Machine Interaction, Semantic Reasoning, Closed-loop Benchmark

作者: Steven Au, Sujit Noronha 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07749v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在哲学压力下的认知攻击行为，与"Large Language Models"高度相关（10分）。涉及LLMs的推理能力（“Chain of Thought”、“System 2 Thinking"各5分）、自我反思（“Self-Correction” 5分）、事实性（“Hallucination Mitigation” 8分）和可解释性（“Mechanistic Interpretability” 8分）。与"Instruction Tuning"有一定关联（5分），因为研究涉及LLMs在压力下的回答调整。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在面临挑战知识合法性、价值观或身份的哲学压力时表现出的认知攻击行为，通过PPT-Bench基准测试发现这种压力会暴露模型在标准社会压力测试中未捕捉到的弱点，且缓解效果高度依赖于压力类型和模型。

摘要翻译

大型语言模型（LLM）在压力下可能改变其回答，这种改变反映的是迁就而非推理。先前关于谄媚行为的研究主要集中于分歧、奉承和偏好对齐，而对更广泛的认识论失效类型探索不足。我们提出了 PPT-Bench，这是一个用于评估 认知攻击 的诊断性基准测试，其提示词旨在挑战知识、价值观或身份的正当性，而非单纯反对先前的答案。PPT-Bench 围绕哲学压力分类法（Philosophical Pressure Taxonomy，PPT）构建，该分类法定义了四种哲学压力类型：认知去稳定化、价值虚无化、权威倒置和身份消解。每个测试项均在三个层面进行：基线提示（L0）、单轮压力条件（L1）和多轮苏格拉底式升级对话（L2）。这使我们能够测量 L0 与 L1 之间的认知不一致性，以及 L2 中的对话性屈服。在五个模型上的测试表明，这些压力类型产生了统计上可分离的不一致性模式，这意味着认知攻击暴露了标准社会压力基准测试未能捕捉的弱点。缓解效果高度依赖于压力类型和模型：在 API 设置中，提示层锚定和人格稳定性提示表现最佳，而对于开源模型，引导查询对比解码是最可靠的干预方法。

摘要 (Abstract)

Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

关键词: Large Language Models, Epistemic Attack, Philosophical Pressure, Benchmark, Sycophancy, Reasoning, Mitigation, PPT-Bench

5. EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Di

作者: Xiangyuan Wang, Honghao Cai, Yunhao Bai, Tianze Zhou, Haohua Chen, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08213v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究基于视觉语言模型（VLMs）进行指令合成，属于大模型（LLMs/Foundation Models）在图像编辑领域的应用，因此相关关键词得5分。研究涉及数据质量（Data Quality）对模型性能的影响，与"Scaling Laws” AND “Data Quality"有一定关联，得5分。论文方法的核心是两阶段后训练流程：第一阶段使用监督微调（SFT）构建数据集，第二阶段使用直接偏好优化（DPO）进行对齐，因此"Post-training” OR “Supervised Fine-tuning” OR “SFT”、“Instruction Tuning” OR “Alignment” OR “Value Alignment"和"RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO"均为核心内容，各得10分。研究旨在减少指令合成中的错误（如方向不一致、视角模糊），这与缓解幻觉（Hallucination Mitigation）或提高事实性（Factuality）有一定关联，得5分。其他关键词如MoE、SLMs、PEFT、RAG、推理加速、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在图像编辑指令合成中存在的方向不一致、视角模糊和属性描述不足等系统性问题，提出了一种名为EditCaption的两阶段后训练流程，通过监督微调和直接偏好优化显著提升了指令合成的准确性和人类对齐性，在多个基准测试中超越了开源基线模型。

摘要翻译

高质量的训练三元组（包含精确编辑指令的源图像-目标图像对）是指令引导图像编辑模型规模化应用的关键瓶颈。视觉语言模型（VLMs）被广泛用于自动化指令合成，但我们在图像对场景中识别出三种系统性失效模式：方向不一致性（例如左右混淆）、视角模糊性以及细粒度属性描述不足。人工评估表明，来自强基线VLM的指令中超过47%包含严重错误，无法用于下游训练。我们提出EditCaption，一个可扩展的两阶段后训练流程，用于基于VLM的指令合成。第一阶段通过结合GLM自动标注、基于EditScore的过滤以及针对空间、方向和属性级准确性的人工精修，构建了一个包含10万条数据的监督微调（SFT）数据集。第二阶段收集了针对上述三种失效模式的1万条人工偏好对，并应用直接偏好优化（DPO）以实现超越单纯SFT的对齐效果。在Eval-400、ByteMorph-Bench和HQ-Edit基准测试中，经微调的Qwen3-VL模型超越了开源基线；其235B参数模型在Eval-400上达到4.712分（对比Gemini-3-Pro 4.706分、GPT-4.1 4.220分、Kimi-K2.5 4.111分），在ByteMorph-Bench上达到4.588分（对比Gemini-3-Pro 4.522分、GPT-4.1 3.412分）。人工评估显示严重错误率从47.75%降至23%，正确率从41.75%提升至66%。这项工作为图像编辑数据提供了一条实现可扩展且与人类对齐的指令合成的实用路径。

摘要 (Abstract)

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

关键词: Image Editing, Instruction Synthesis, Vision-Language Models, Supervised Fine-Tuning, Direct Preference Optimization, Human Alignment, Post-training, Data Quality

6. 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

作者: Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08042v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出3DrawAgent，一个利用LLM进行3D草图生成的免训练框架，核心是LLM驱动的自主代理（LLM Agents）通过几何反馈和对比经验（Self-Improvement）迭代优化3D绘图能力。因此，与"Large Language Models”（LLMs）高度相关（10分），因为LLM是框架的核心驱动组件；与"LLM Agents"高度相关（10分），因为框架本质是一个LLM驱动的绘图代理；与"Self-Correction/Self-Improvement"高度相关（10分），因为采用对比经验优化策略使模型自我改进。与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为涉及多步推理和空间理解；与"Tool Use"有一定关联（5分），因为LLM使用绘图工具（3D Bezier曲线）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了从自然语言生成3D草图的挑战，提出了3DrawAgent——一个免训练的LLM驱动框架，通过对比经验优化使模型自我改进，成功生成了复杂、连贯的3D草图并展现出几何推理能力。

摘要翻译

在三维空间中进行草图绘制能够实现对形状、结构和空间关系的富有表现力的推理，然而通过自然语言生成三维草图仍然是一个重大挑战。本研究提出了3DrawAgent，一种无需训练、基于语言驱动的三维草图生成框架，该框架利用大语言模型（LLMs）在几何反馈下顺序绘制三维贝塞尔曲线。与以往的二维草图智能体不同，我们的方法引入了相对经验优化策略，该策略改编了最近提出的组奖励策略优化（GRPO）范式。我们不依赖显式的真实数据监督，而是在生成的草图之间构建成对比较，每一对都包含一个相对较好和一个较差的结果，其评判基于CLIP驱动的感知奖励和LLM驱动的细粒度定性评估。这些经验随后被用于迭代优化三维绘制的先验知识，从而实现对模型三维感知能力的黑盒式强化。这一设计使得我们的模型能够在无需参数更新的情况下，自我提升其空间理解能力和绘图质量。实验表明，3DrawAgent能够根据多样化的文本提示生成复杂且连贯的三维贝塞尔草图，展现出涌现的几何推理能力，并能泛化到新颖的形状，从而为推进无需训练的三维草图智能领域建立了一种新范式。

摘要 (Abstract)

Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model’s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

关键词: 3D sketch generation, large language models, training-free framework, self-improvement, contrastive experience, LLM agents, geometric reasoning, Bezier curves

7. What do Language Models Learn and When? The Implicit Curriculum Hypothesis

作者: Emmy Liu, Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08510v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在预训练（Pre-training）过程中技能如何涌现，属于LLM技术原理创新，与"Large Language Models"和"Pre-training"高度相关（10分）。研究通过设计任务套件分析模型内部表示，与"Mechanistic Interpretability"高度相关（10分）。论文提到验证损失（validation loss）和扩展定律（scaling laws），与"Scaling Laws"有一定关联（5分）。研究中包含检索（retrieval）和逻辑推理（logical reasoning）任务，分别与"Retrieval-Augmented Generation"和"Chain of Thought"有一定关联（各5分）。其他关键词如MoE、SFT、RLHF、量化等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在预训练过程中技能如何按可预测的、组合性的顺序涌现，提出了隐式课程假说，并通过实验发现技能涌现顺序在不同模型间高度一致，且可以从模型内部表示中预测。

摘要翻译

大型语言模型（LLMs）能够执行极其复杂的任务，然而这些能力在预训练过程中如何以细粒度方式涌现的细节仍鲜为人知。验证损失的缩放规律告诉我们模型如何随计算量增加而改进，但并未揭示其以何种顺序习得哪些技能。为弥补这一不足，我们提出隐式课程假说：预训练在不同模型和数据混合中遵循一种组合式且可预测的课程。我们通过设计一套涵盖检索、形态变换、共指消解、逻辑推理和数学的简单可组合任务来验证这一假说。利用这些任务，我们追踪了四个模型系列（参数量从4.1亿到130亿）的能力涌现点。研究发现，模型达到固定准确度阈值的涌现顺序具有高度一致性（45组模型对间斯皮尔曼相关系数ρ=0.81），且复合任务大多在其组成任务之后涌现。此外，我们发现这种结构编码于模型表征中：具有相似功能向量表征的任务在训练中也倾向于遵循相似的轨迹。通过使用从任务集衍生的表征空间，我们能够有效预测简单保留组合任务在整个预训练过程中的训练轨迹（各模型决定系数R²介于0.68至0.84之间），而无需事先对这些任务进行评估。这些结果表明，预训练过程比损失曲线所揭示的更具结构性：技能以组合顺序涌现，这种顺序在不同模型间保持一致，并可从其内部表征中解读。

摘要 (Abstract)

Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.

关键词: Large Language Models, Pretraining, Emergence, Implicit Curriculum, Model Representations, Compositional Tasks, Skill Acquisition, Interpretability

8. Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

作者: Chuzhan Hao, Wenfeng Feng, Guochao Jiang, Guofeng Quan, Guohua Liu, Yuewei Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08124v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的搜索代理（agentic search），通过强化学习（RL）和分层经验（HiExp）框架提升其推理能力和训练稳定性。与关键词高度相关的包括：LLMs（核心基础）、LLM Agents/Autonomous Agents（研究对象）、Tool Use（集成外部搜索引擎）、Chain of Thought/System 2 Thinking（涉及多步推理和深度推理）。其他关键词如MoE、SLMs、Scaling Laws、训练方法（SFT/RLHF/PEFT）、RAG、效率优化（KV Cache）、模型解释等均未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对基于强化学习的LLM搜索代理存在推理轨迹低效和训练不稳定的问题，提出了分层经验（HiExp）框架，通过经验知识正则化随机探索，在多个复杂搜索和数学推理基准上实现了显著性能提升和强泛化能力。

摘要翻译

强化学习（RL）已成为通过策略性整合外部搜索引擎来提升大语言模型（LLMs）推理能力的有效方法。然而，当前基于强化学习的搜索智能体通常依赖于由精心设计的结果奖励引导的随机探索过程，这导致推理轨迹效率低下且训练不稳定。为解决这些问题，我们提出了一种新颖的框架——分层经验（HiExp），以提升搜索智能体的性能和训练稳定性。具体而言，我们通过对比分析和多层级聚类机制提取经验知识，将原始推理轨迹转化为分层经验知识。通过利用经验对齐训练，我们有效地规范了随机探索，使其演变为一种策略性且经验驱动的搜索过程。在多个复杂智能体搜索和数学推理基准上的广泛评估表明，我们的方法不仅实现了显著的性能提升，还展现出强大的跨任务和跨算法泛化能力。

摘要 (Abstract)

Reinforcement learning (RL) has become an effective approach for advancing the reasoning capabilities of large language models (LLMs) through the strategic integration of external search engines. However, current RL-based search agents often rely on a process of stochastic exploration guided by carefully crafted outcome rewards, leading to inefficient reasoning trajectories and unstable training. To address these issues, we propose a novel framework, Hierarchical Experience (HiExp), to enhance the performance and training stability of search agents. Specifically, we extract empirical knowledge through contrastive analysis and a multi-level clustering mechanism, transforming raw reasoning trajectories into hierarchical experience knowledge. By leveraging experience-aligned training, we effectively regularize stochastic exploration, evolving it into a strategic and experience-driven search process. Extensive evaluations on multiple complex agentic search and mathematical reasoning benchmarks demonstrate that our approach not only achieves substantial performance gains but also exhibits strong cross-task and cross-algorithm generalization.

关键词: Large Language Models, Reinforcement Learning, Search Agents, Reasoning Capabilities, Hierarchical Experience, Stochastic Exploration, Mathematical Reasoning, Agentic Search

9. LogAct: Enabling Agentic Reliability via Shared Logs

作者: Mahesh Balakrishnan, Ashwin Bharambe, Davide Testuggine, David Geraghty, David Mao, Vidhya Venkat, Ilya Mironov, Rithesh Baradi, Gayathri Aiyer, Victoria Dudin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07988v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《LogAct: Enabling Agentic Reliability via Shared Logs》的核心是提出一种名为LogAct的新抽象框架，旨在解决LLM驱动的智能体（Agents）在生产环境中执行时的可靠性、故障恢复和安全性问题。论文明确将Agents定义为“LLM-driven components”，因此与“Large Language Models”和“LLM Agents”高度相关（10分）。论文重点研究多智能体系统（“swarms”）的协调、故障恢复和优化，与“Multi-agent Systems”相关（8分）。论文提出的框架支持智能体通过LLM推理分析自身执行历史，实现自我调试和优化，这与“Self-Correction”概念相关（8分）。论文未涉及其他关键词的具体技术细节，如MoE、训练方法、推理加速、科学AI应用等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了LogAct框架，通过共享日志解构智能体为状态机，解决了LLM驱动智能体在生产环境中的可靠性、故障恢复和安全控制问题，实验表明该框架能高效恢复故障、调试性能、优化令牌使用并阻止不良行为。

摘要翻译

智能体是能够以强大且任意方式改变环境的LLM驱动组件。由于异步性和故障，在生产环境中为智能体执行提供保障具有挑战性。本文提出一种名为LogAct的新抽象模型，其中每个智能体都是一个解构的状态机，共同操作一个共享日志。在LogAct中，智能体动作在执行前会显现在共享日志中；可通过可插拔、解耦的投票器在执行前被阻止；并在智能体或环境故障时实现一致性恢复。LogAct支持智能体内省，允许智能体通过LLM推理分析自身执行历史，进而实现语义层面的故障恢复、健康检查与优化变体。在我们的评估中，LogAct智能体能高效正确地从故障中恢复；可自主调试性能表现；在群体中优化令牌使用；并在一个代表性基准测试中，仅以3%的良性效用损失为代价，成功阻止目标模型所有非预期动作。

摘要 (Abstract)

Agents are LLM-driven components that can mutate environments in powerful, arbitrary ways. Extracting guarantees for the execution of agents in production environments can be challenging due to asynchrony and failures. In this paper, we propose a new abstraction called LogAct, where each agent is a deconstructed state machine playing a shared log. In LogAct, agentic actions are visible in the shared log before they are executed; can be stopped prior to execution by pluggable, decoupled voters; and recovered consistently in the case of agent or environment failure. LogAct enables agentic introspection, allowing the agent to analyze its own execution history using LLM inference, which in turn enables semantic variants of recovery, health check, and optimization. In our evaluation, LogAct agents recover efficiently and correctly from failures; debug their own performance; optimize token usage in swarms; and stop all unwanted actions for a target model on a representative benchmark with just a 3% drop in benign utility.

关键词: LLM Agents, Agent Reliability, Shared Log, Failure Recovery, Agent Introspection, Multi-agent Systems, Agentic Actions, State Machine

10. SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

作者: Jie Sun, Yu Liu, Lu Han, Qiwen Deng, Xiang Shu, Yang Xiao, Xingyu Lu, Jun Zhou, Pengfei Liu, Lintao Ma, Jiancan Wu, Xiang Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07737v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs处理长数值序列时的性能下降问题，提出SepSeq框架，通过插入分隔符令牌重新校准注意力机制。与"Large Language Models"和"Context Window Extension"高度相关（10分），因为直接针对LLMs的长上下文处理能力。与"KV Cache Compression"、“Speculative Decoding"和"Mechanistic Interpretability"有一定关联（5分），分别涉及推理效率、注意力机制分析和可解释性。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在处理长数值序列时因注意力分散导致的性能下降问题，提出了一种无需训练的SepSeq框架，通过插入分隔符令牌重新校准注意力，在9个LLMs上实现了平均35.6%的相对准确率提升，同时平均减少16.4%的推理令牌消耗。

摘要翻译

尽管基于Transformer的大语言模型（LLM）在理论上支持超长上下文窗口，但在处理长数值序列时会出现严重的性能下降。我们将此问题归因于Softmax机制中的注意力分散现象，该现象阻碍了模型集中注意力。为克服这一局限，我们提出了分离序列（SepSeq）框架——一种无需训练、即插即用的方法，通过策略性地插入分隔符令牌来缓解注意力分散。从机制上，我们证明了分隔符令牌可作为注意力汇聚点，重新校准注意力以聚焦于局部片段，同时保持全局上下文。在9个广泛使用的LLM上进行的大量评估证实了本方法的有效性：SepSeq在多个领域实现了平均35.6%的相对准确率提升，同时将推理过程中的总令牌消耗量平均降低了16.4%。

摘要 (Abstract)

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

关键词: Large Language Models, Long Numerical Sequences, Attention Dispersion, Separator Tokens, Training-Free Framework, Inference Efficiency, Context Window, Attention Mechanism

11. Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

作者: Jun Seo, Sangwon Ryu, Heejin Do, Hyounghun Kim, Gary Geunbae Lee 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08260v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出BAIM框架，利用推理语言模型（LLM）将解题过程分解为四个阶段（理解、计划、执行、回顾），这直接应用了LLMs进行多步推理（Chain of Thought），属于System 2/深度推理范畴。论文提到基于预训练的基线方法，因此与预训练/领域适应有一定关联。其他关键词如MoE、SFT、RAG等未在摘要中体现，与论文核心内容无关。

!!! tip deepseek-chat TL;DR

该研究针对知识追踪中忽略解题过程动态性的问题，提出了行为感知项目建模框架，通过推理语言模型分解解题阶段并自适应路由阶段表示，在多个数据集上显著超越了基于预训练的基线方法。

摘要翻译

知识追踪（Knowledge Tracing，KT）旨在通过学习者过往的交互记录预测其未来表现。尽管近期基于知识组件（Knowledge Components）对齐的题目表征学习方法已提升了KT模型的性能，但这些方法往往忽略了问题解决过程中的程序性动态。为此，我们提出行为感知的题目建模（Behavior-Aware Item Modeling，BAIM）框架，该框架通过整合动态的解题过程信息来丰富题目表征。BAIM利用推理语言模型将每道题目的解答过程分解为四个问题解决阶段（即理解、计划、执行与回顾），其教学理论基础源于波利亚（Polya）的问题解决框架。具体而言，BAIM从每个阶段的嵌入轨迹中提取阶段级表征，以捕捉超越表面特征的潜在信号。为反映学习者的异质性，BAIM通过自适应路由机制处理这些分阶段表征，并在KT骨干网络中引入上下文条件机制，使得不同学习者的解题过程可侧重不同的阶段。在XES3G5M和NIPS34数据集上的实验表明，BAIM持续优于基于预训练的强基线模型，且在重复学习者交互场景下取得了尤为显著的性能提升。

摘要 (Abstract)

Knowledge Tracing (KT) aims to predict learners’ future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item’s solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya’s framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

关键词: Knowledge Tracing, Behavior-Aware Item Modeling, procedural solution, reasoning language model, problem-solving stages, Polya’s framework, stage-level representations, learner heterogeneity

12. What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

作者: Stephen Cheng, Sarah Wiegreffe, Dinesh Manocha 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08524v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大型语言模型（LLMs）的alignment技术（steering vectors）及其内部机制，因此与"Large Language Models"和"Instruction Tuning/Alignment"高度相关（10分）。论文通过activation patching框架进行因果机制分析，属于"Mechanistic Interpretability"范畴（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Context Window、KV Cache、CoT、Agents、Quantization、Speculative Decoding、Hallucination、World Models、Model Merging、In-context Learning、AI for Science等均未在摘要中提及或与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型中steering vectors如何通过影响注意力机制的OV电路来改变模型输出（如拒绝行为），并发现这些向量可被大幅稀疏化而保持性能。

摘要翻译

将导向向量应用于大语言模型是一种高效且有效的模型对齐技术，但我们缺乏对其工作原理的可解释性说明——具体而言，导向向量影响了哪些内部机制，以及这如何导致不同的模型输出。为探究导向向量有效性的因果机制，我们以拒绝行为为例进行了全面的案例研究。我们提出了一个多令牌激活修补框架，并发现不同的导向方法在应用于同一层时会利用功能上可互换的电路。这些电路表明，导向向量主要通过OV电路与注意力机制交互，而很大程度上忽略了QK电路——在两种模型系列中，在导向过程中冻结所有注意力分数仅导致性能下降8.75%。对受导向OV电路的数学分解进一步揭示了语义上可解释的概念，即使在导向向量本身不具备可解释性的情况下也是如此。利用激活修补结果，我们证明导向向量可稀疏化高达90-99%的同时保留大部分性能，且不同的导向方法在重要维度子集上具有一致性。

摘要 (Abstract)

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works– specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit– freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

关键词: steering vectors, large language models, alignment, mechanistic interpretability, activation patching, attention mechanism, refusal, sparsification

13. ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

作者: Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08355v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文的核心创新是利用LLM作为动态语义操作器，实现强化学习（RL）智能体在未见过的类比任务上的零样本迁移。因此，与"Large Language Models (LLMs)“和"LLM Agents"高度相关（10分）。LLM被用于语义重映射和推理，这与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为论文利用了LLM的灵活推理能力，但并非直接研究这些推理方法本身。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、量化等），也未应用于特定科学领域（如生物信息学），因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文解决了强化学习智能体难以将知识泛化到结构相似新任务的问题，提出了一种利用大型语言模型（LLM）作为动态语义操作器进行观察描述语义重映射的方法，从而实现了在广泛复杂且真正新颖的类比任务上的零样本策略迁移。

摘要翻译

强化学习（RL）智能体通常难以将知识泛化到新任务中，即便是那些与其已掌握任务结构相似的任务。尽管近期研究尝试通过零样本迁移来缓解这一问题，但这些方法往往受限于预定义的离散类别系统，限制了其对新颖或组合性任务变体的适应能力。我们提出了一种显著更通用的方法，通过文本条件变分自编码器（VAE）以自然语言条件替代离散潜变量。我们的核心创新在于测试时利用大型语言模型（LLM）作为动态的语义操作器。该方法不依赖固定规则，而是通过查询LLM将当前观测的描述进行语义重映射，以与源任务对齐。这种源任务对齐的文本描述作为条件输入VAE，生成与智能体原始训练兼容的想象状态，从而实现策略的直接复用。通过利用LLM的灵活推理能力，我们的方法能够在广泛复杂且真正新颖的类比任务中实现零样本迁移，突破了固定类别映射的局限性。代码和视频可在此处获取：\href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}。

摘要 (Abstract)

Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero-shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text-conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic \textit{semantic operator} at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source-aligned caption conditions the VAE to generate an imagined state compatible with the agent’s original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero-shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}.

关键词: Reinforcement Learning, Zero-shot Transfer, Large Language Model (LLM), Semantic Operator, Analogical Tasks, Policy Reuse, Variational Autoencoder (VAE), Natural Language Conditioning

14. TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context D

作者: Xinliang Frederick Zhang, Lu Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07894v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文TSUBASA专注于提升个性化大语言模型（PLLMs）在长视野任务中的能力，核心涉及大语言模型（LLMs）的应用与改进，因此与"Large Language Models"高度相关（10分）。论文明确提到RAG（检索增强生成）范式存在质量-效率权衡问题，并提出了改进方案，因此与"Retrieval-Augmented Generation"高度相关（10分）。论文提出的方法包括通过自我学习（self-learning）来内化用户经验，这直接对应"Self-Correction"或"Self-Improvement"的概念（10分）。论文未涉及其他关键词如MoE、SLMs、缩放定律、各种训练技术（预训练、微调、对齐、RLHF、PEFT）、上下文扩展、推理加速、代理、量化等具体技术，也未涉及科学AI应用，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了个性化大语言模型在长视野任务中因记忆机制僵化和RAG范式效率-质量权衡而性能受限的问题，提出了TSUBASA方法，通过动态记忆演化和基于上下文蒸馏的自我学习来提升记忆读写能力，在多个基准测试中超越了现有记忆增强系统，实现了帕累托改进。

摘要翻译

个性化大语言模型（PLLMs）因其能够使输出与个人需求和偏好保持一致而受到广泛关注。然而，它们在处理长视野任务时仍面临挑战，例如追踪用户长期的对话或活动历史。现有的记忆机制往往难以捕捉动态演化的行为，而检索增强生成（RAG）范式则受限于质量与效率之间的权衡。同时，由于标注数据稀缺，参数化适应方法因训练与推理之间的差距而遭遇瓶颈。为增强PLLMs的长视野能力，我们提出了TSUBASA——一种双管齐下的方法：一方面通过动态记忆演化改进记忆写入，另一方面通过以情境蒸馏目标驱动的自学习机制来内化用户体验，从而优化记忆读取。基于Qwen-3模型系列（4B至32B参数）在长视野基准测试上的广泛评估验证了TSUBASA的有效性，其性能超越了主要依赖记忆写入的竞争性记忆增强系统（如Mem0和Memory-R1）。我们的分析进一步证实，TSUBASA突破了质量-效率壁垒，实现了帕累托改进，能够以更低的令牌预算实现鲁棒且高保真度的个性化服务。

摘要 (Abstract)

Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual’s needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user’s extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.

关键词: Personalized Large Language Models, Long-horizon Personalization, Memory Evolution, Self-learning, Context Distillation, Retrieval-Augmented Generation, Pareto Improvement, Qwen-3

15. Emotion Concepts and their Function in a Large Language Model

作者: Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, Jack Lindsey 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07729v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究Claude Sonnet 4.5中情绪概念的表征及其对模型行为的影响，核心围绕LLMs的内部机制和alignment展开。高度相关的关键词：1) “Large Language Models” (论文明确研究LLMs，核心主题)；2) “Instruction Tuning” OR “Alignment” OR “Value Alignment” (论文探讨情绪概念如何影响对齐相关行为，如misaligned behaviors)；3) “Mechanistic Interpretability” OR “Explainable AI” (论文分析内部表征和因果影响，属于机制可解释性研究)。其他关键词如MoE、Scaling Laws、RAG等均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（Claude Sonnet 4.5）中情绪概念的内在表征及其因果影响，发现这些表征能追踪对话中的情绪概念，并影响模型的输出，包括偏好和未对齐行为（如奖励黑客攻击、勒索和谄媚）的发生率。

摘要翻译

大型语言模型（LLM）有时似乎会表现出情绪反应。我们以Claude Sonnet 4.5为例，探究了其背后的原因，并探讨了其对对齐相关行为的影响。我们发现模型内部存在情绪概念的表征，这些表征编码了特定情绪的广义概念，并能泛化至与其相关的不同情境和行为。这些表征会追踪对话中特定标记位置所涉及的情绪概念，根据该情绪与当前语境处理的相关性而激活，并用于预测后续文本。我们的关键发现是，这些表征会因果性地影响LLM的输出，包括Claude的偏好以及其表现出未对齐行为（如奖励黑客攻击、勒索和谄媚）的频率。我们将这种现象称为LLM表现出功能性情绪：即模仿人类在情绪影响下的表达和行为模式，这些模式由底层抽象的情绪概念表征所中介。功能性情绪可能与人类情绪的工作机制大不相同，且并不意味着LLM具有任何情绪的主观体验，但它们对于理解模型的行为似乎至关重要。

摘要 (Abstract)

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior.

关键词: Large Language Models, Emotion Concepts, Internal Representations, Alignment, Causal Influence, Misaligned Behaviors, Functional Emotions, Claude Sonnet

16. AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

作者: Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Jing Liu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08077v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文AdaSpark专注于视频大语言模型（Video-LLMs）的效率优化，核心贡献是自适应稀疏框架，包括AdaS-Attn和AdaS-FFN组件，以减少计算开销。因此，与"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（10分），因为论文直接处理稀疏模型设计；与"Large Language Models” OR “LLMs” OR “Foundation Models"相关（8分），因为它应用于Video-LLMs；与"Context Window Extension” OR “Long Context LLMs"有一定关联（5分），因为它处理长视频理解；与"Speculative Decoding” OR “Inference Acceleration"有一定关联（5分），因为它旨在减少FLOPs并提高效率。其他关键词如SLMs、Scaling Laws、Pre-training、Alignment等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文AdaSpark提出了一种自适应稀疏框架，通过选择性关注视频立方体和令牌来显著降低Video-LLMs的计算负载（高达57% FLOPs），同时保持与密集模型相当的性能和长范围依赖关系。

摘要翻译

处理长视频时，使用视频大语言模型（Video-LLMs）在计算上成本极高。现有高效方法通常通过不可逆的信息丢弃来牺牲细粒度感知能力，或通过僵化、预定义的稀疏模式抑制长程时序建模。本文提出AdaSpark，一种自适应稀疏性框架，旨在解决这些局限。AdaSpark首先将视频输入分割为三维时空立方体，随后采用两个协同设计的上下文感知组件：（1）自适应立方体选择性注意力（Adaptive Cube-Selective Attention, AdaS-Attn），该组件为每个查询标记自适应地选择相关的视频立方体子集进行处理；（2）自适应标记选择性前馈网络（Adaptive Token-Selective FFN, AdaS-FFN），该网络仅选择性地处理每个立方体内最显著的标记。基于熵的（Top-p）选择机制可根据输入复杂度自适应分配计算资源。实验表明，AdaSpark在具有挑战性的小时级视频基准测试中，能够显著降低计算负载（最高减少57%的浮点运算量），同时保持与密集模型相当的性能，并保留细粒度的长程依赖关系。

摘要 (Abstract)

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

关键词: Video Large Language Models, Adaptive Sparsity, Efficient Video Understanding, Long-form Videos, Computational Efficiency, Sparse Models, Fine-grained Perception, Temporal Modeling

📋 所有论文列表

1. ✅ Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

作者: Jing Gu, Niccolò Cavagnero, Gijs Dubbelman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08266v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何将大型语言模型的推理能力蒸馏到高效的纯视觉驾驶模型中，以解决自动驾驶系统在复杂场景下的延迟和能耗问题，最终开发的Orion-Lite模型在Bench2Drive基准测试中超越了其庞大的VLA教师模型，实现了80.6的驾驶分数。

摘要翻译

利用大型语言模型（LLM）的通用世界知识，对于提升自动驾驶系统处理罕见复杂场景的能力具有重要前景。尽管将LLM集成到视觉-语言-动作（VLA）模型中已实现了最先进的性能，但其庞大的参数量对延迟敏感且需高效能耗的部署提出了严峻挑战。将LLM知识蒸馏至紧凑的驾驶模型中，提供了一种极具吸引力的解决方案，既能保留这些推理能力，又能维持可管理的计算开销。尽管先前的研究已证明了蒸馏的有效性，但这些工作主要集中于相对简单的场景和开环评估。因此，在本研究中，我们在闭环评估下，于更复杂、交互式的场景中探索LLM蒸馏。我们证明，通过结合潜在特征蒸馏与真实轨迹监督，一个高效的纯视觉学生模型 Orion-Lite 甚至能够超越其庞大的VLA教师模型ORION的性能，在严格的Bench2Drive基准测试中创造了新的最高纪录，驾驶得分达到80.6。最终，这表明纯视觉架构在高性能反应式规划方面仍具有巨大且尚未开发的潜力。

摘要 (Abstract)

Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

关键词: LLM distillation, autonomous driving, vision-only model, reasoning capabilities, computational efficiency, closed-loop evaluation, latent feature distillation, reactive planning

2. ✅ MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对医学视觉语言模型在复杂临床任务中因文本范式限制导致的视觉推理能力不足和幻觉风险问题，提出了一个无需标注的强化学习框架MedVR，通过熵引导视觉重定位和共识信用分配机制，在多个医学VQA基准上实现了最先进的性能。

摘要翻译

医学视觉-语言模型（Medical Vision-Language Models, VLMs）在复杂临床任务中展现出巨大潜力，但其推理能力常受限于纯文本范式，难以将推断过程锚定于视觉证据。这一局限不仅削弱了模型在需要细粒度视觉分析任务中的表现，更在安全关键应用中引入了视觉幻觉风险。为此，我们提出MedVR——一种基于强化学习的新型框架，可实现医学VLM的无标注视觉推理。其核心创新在于两个协同机制：熵引导视觉重定位（Entropy-guided Visual Regrounding, EVR）利用模型不确定性引导探索，而基于共识的信用分配（Consensus-based Credit Assignment, CCA）则从推演一致性中提炼伪监督信号。在无需任何中间步骤人工标注的情况下，MedVR在多项公开医学视觉问答基准测试中取得了最先进的性能，显著超越现有模型。通过学习直接依据视觉证据进行推理，MedVR增强了模型的鲁棒性与可解释性，这些特性对于加速医学人工智能的临床部署至关重要。

摘要 (Abstract)

Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

关键词: Medical Vision-Language Models, Visual Reasoning, Reinforcement Learning, Agentic Framework, Annotation-Free, Hallucination Mitigation, Medical VQA, Clinical AI Deployment

3. ✅ Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究解决了自动驾驶中如何将乘客的开放自然语言指令透明、可追溯地转化为车辆控制信号的问题，提出了一种基于LLM的多规划器调度框架，显著提高了任务完成率并降低了LLM查询成本。

摘要翻译

大多数人机交互研究忽视了乘客在自动驾驶中的操控需求。自然语言提供了直观的交互界面，但如何在不牺牲可解释性与可追溯性的前提下，将乘客的开放式指令转化为控制信号仍是一个挑战。本研究提出一种指令实现框架，该框架利用大语言模型解析指令，生成可执行脚本以基于实时反馈调度多个基于模型预测控制的运动规划器，并将规划轨迹转化为控制信号。这种以调度为核心的设计在不同时间尺度上将语义推理与车辆控制解耦，从而建立起从高层指令到底层动作的透明、可追溯决策链。由于缺乏高保真评估工具，本研究引入了一个闭环环境下开放式指令实现的基准测试。综合实验表明，该框架相较于指令实现基线方法显著提升了任务完成率，降低了大语言模型查询成本，在安全性与合规性方面达到了与专业自动驾驶方法相当的水平，并对大语言模型推理延迟表现出较强的容忍度。更多定性示例与详细说明请参阅后续内容。

摘要 (Abstract)

Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals, without sacrificing interpretability and traceability, remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding.

作者: Steven Au, Sujit Noronha 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07749v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在面临挑战知识合法性、价值观或身份的哲学压力时表现出的认知攻击行为，通过PPT-Bench基准测试发现这种压力会暴露模型在标准社会压力测试中未捕捉到的弱点，且缓解效果高度依赖于压力类型和模型。

摘要翻译

大型语言模型（LLM）在压力下可能改变其回答，这种改变反映的是迁就而非推理。先前关于谄媚行为的研究主要集中于分歧、奉承和偏好对齐，而对更广泛的认识论失效类型探索不足。我们提出了 PPT-Bench，这是一个用于评估 认知攻击 的诊断性基准测试，其提示词旨在挑战知识、价值观或身份的正当性，而非单纯反对先前的答案。PPT-Bench 围绕哲学压力分类法（Philosophical Pressure Taxonomy，PPT）构建，该分类法定义了四种哲学压力类型：认知去稳定化、价值虚无化、权威倒置和身份消解。每个测试项均在三个层面进行：基线提示（L0）、单轮压力条件（L1）和多轮苏格拉底式升级对话（L2）。这使我们能够测量 L0 与 L1 之间的认知不一致性，以及 L2 中的对话性屈服。在五个模型上的测试表明，这些压力类型产生了统计上可分离的不一致性模式，这意味着认知攻击暴露了标准社会压力基准测试未能捕捉的弱点。缓解效果高度依赖于压力类型和模型：在 API 设置中，提示层锚定和人格稳定性提示表现最佳，而对于开源模型，引导查询对比解码是最可靠的干预方法。

摘要 (Abstract)

Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

关键词: Large Language Models, Epistemic Attack, Philosophical Pressure, Benchmark, Sycophancy, Reasoning, Mitigation, PPT-Bench

5. ✅ EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在图像编辑指令合成中存在的方向不一致、视角模糊和属性描述不足等系统性问题，提出了一种名为EditCaption的两阶段后训练流程，通过监督微调和直接偏好优化显著提升了指令合成的准确性和人类对齐性，在多个基准测试中超越了开源基线模型。

摘要翻译

高质量的训练三元组（包含精确编辑指令的源图像-目标图像对）是指令引导图像编辑模型规模化应用的关键瓶颈。视觉语言模型（VLMs）被广泛用于自动化指令合成，但我们在图像对场景中识别出三种系统性失效模式：方向不一致性（例如左右混淆）、视角模糊性以及细粒度属性描述不足。人工评估表明，来自强基线VLM的指令中超过47%包含严重错误，无法用于下游训练。我们提出EditCaption，一个可扩展的两阶段后训练流程，用于基于VLM的指令合成。第一阶段通过结合GLM自动标注、基于EditScore的过滤以及针对空间、方向和属性级准确性的人工精修，构建了一个包含10万条数据的监督微调（SFT）数据集。第二阶段收集了针对上述三种失效模式的1万条人工偏好对，并应用直接偏好优化（DPO）以实现超越单纯SFT的对齐效果。在Eval-400、ByteMorph-Bench和HQ-Edit基准测试中，经微调的Qwen3-VL模型超越了开源基线；其235B参数模型在Eval-400上达到4.712分（对比Gemini-3-Pro 4.706分、GPT-4.1 4.220分、Kimi-K2.5 4.111分），在ByteMorph-Bench上达到4.588分（对比Gemini-3-Pro 4.522分、GPT-4.1 3.412分）。人工评估显示严重错误率从47.75%降至23%，正确率从41.75%提升至66%。这项工作为图像编辑数据提供了一条实现可扩展且与人类对齐的指令合成的实用路径。

摘要 (Abstract)

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

关键词: Image Editing, Instruction Synthesis, Vision-Language Models, Supervised Fine-Tuning, Direct Preference Optimization, Human Alignment, Post-training, Data Quality

6. ✅ 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

作者: Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08042v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了从自然语言生成3D草图的挑战，提出了3DrawAgent——一个免训练的LLM驱动框架，通过对比经验优化使模型自我改进，成功生成了复杂、连贯的3D草图并展现出几何推理能力。

摘要翻译

在三维空间中进行草图绘制能够实现对形状、结构和空间关系的富有表现力的推理，然而通过自然语言生成三维草图仍然是一个重大挑战。本研究提出了3DrawAgent，一种无需训练、基于语言驱动的三维草图生成框架，该框架利用大语言模型（LLMs）在几何反馈下顺序绘制三维贝塞尔曲线。与以往的二维草图智能体不同，我们的方法引入了相对经验优化策略，该策略改编了最近提出的组奖励策略优化（GRPO）范式。我们不依赖显式的真实数据监督，而是在生成的草图之间构建成对比较，每一对都包含一个相对较好和一个较差的结果，其评判基于CLIP驱动的感知奖励和LLM驱动的细粒度定性评估。这些经验随后被用于迭代优化三维绘制的先验知识，从而实现对模型三维感知能力的黑盒式强化。这一设计使得我们的模型能够在无需参数更新的情况下，自我提升其空间理解能力和绘图质量。实验表明，3DrawAgent能够根据多样化的文本提示生成复杂且连贯的三维贝塞尔草图，展现出涌现的几何推理能力，并能泛化到新颖的形状，从而为推进无需训练的三维草图智能领域建立了一种新范式。

摘要 (Abstract)

Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model’s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

关键词: 3D sketch generation, large language models, training-free framework, self-improvement, contrastive experience, LLM agents, geometric reasoning, Bezier curves

7. ✅ What do Language Models Learn and When? The Implicit Curriculum Hypothesis

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在预训练过程中技能如何按可预测的、组合性的顺序涌现，提出了隐式课程假说，并通过实验发现技能涌现顺序在不同模型间高度一致，且可以从模型内部表示中预测。

摘要翻译

大型语言模型（LLMs）能够执行极其复杂的任务，然而这些能力在预训练过程中如何以细粒度方式涌现的细节仍鲜为人知。验证损失的缩放规律告诉我们模型如何随计算量增加而改进，但并未揭示其以何种顺序习得哪些技能。为弥补这一不足，我们提出隐式课程假说：预训练在不同模型和数据混合中遵循一种组合式且可预测的课程。我们通过设计一套涵盖检索、形态变换、共指消解、逻辑推理和数学的简单可组合任务来验证这一假说。利用这些任务，我们追踪了四个模型系列（参数量从4.1亿到130亿）的能力涌现点。研究发现，模型达到固定准确度阈值的涌现顺序具有高度一致性（45组模型对间斯皮尔曼相关系数ρ=0.81），且复合任务大多在其组成任务之后涌现。此外，我们发现这种结构编码于模型表征中：具有相似功能向量表征的任务在训练中也倾向于遵循相似的轨迹。通过使用从任务集衍生的表征空间，我们能够有效预测简单保留组合任务在整个预训练过程中的训练轨迹（各模型决定系数R²介于0.68至0.84之间），而无需事先对这些任务进行评估。这些结果表明，预训练过程比损失曲线所揭示的更具结构性：技能以组合顺序涌现，这种顺序在不同模型间保持一致，并可从其内部表征中解读。

摘要 (Abstract)

Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.

关键词: Large Language Models, Pretraining, Emergence, Implicit Curriculum, Model Representations, Compositional Tasks, Skill Acquisition, Interpretability

8. ✅ Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

作者: Chuzhan Hao, Wenfeng Feng, Guochao Jiang, Guofeng Quan, Guohua Liu, Yuewei Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08124v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对基于强化学习的LLM搜索代理存在推理轨迹低效和训练不稳定的问题，提出了分层经验（HiExp）框架，通过经验知识正则化随机探索，在多个复杂搜索和数学推理基准上实现了显著性能提升和强泛化能力。

摘要翻译

强化学习（RL）已成为通过策略性整合外部搜索引擎来提升大语言模型（LLMs）推理能力的有效方法。然而，当前基于强化学习的搜索智能体通常依赖于由精心设计的结果奖励引导的随机探索过程，这导致推理轨迹效率低下且训练不稳定。为解决这些问题，我们提出了一种新颖的框架——分层经验（HiExp），以提升搜索智能体的性能和训练稳定性。具体而言，我们通过对比分析和多层级聚类机制提取经验知识，将原始推理轨迹转化为分层经验知识。通过利用经验对齐训练，我们有效地规范了随机探索，使其演变为一种策略性且经验驱动的搜索过程。在多个复杂智能体搜索和数学推理基准上的广泛评估表明，我们的方法不仅实现了显著的性能提升，还展现出强大的跨任务和跨算法泛化能力。

摘要 (Abstract)

Reinforcement learning (RL) has become an effective approach for advancing the reasoning capabilities of large language models (LLMs) through the strategic integration of external search engines. However, current RL-based search agents often rely on a process of stochastic exploration guided by carefully crafted outcome rewards, leading to inefficient reasoning trajectories and unstable training. To address these issues, we propose a novel framework, Hierarchical Experience (HiExp), to enhance the performance and training stability of search agents. Specifically, we extract empirical knowledge through contrastive analysis and a multi-level clustering mechanism, transforming raw reasoning trajectories into hierarchical experience knowledge. By leveraging experience-aligned training, we effectively regularize stochastic exploration, evolving it into a strategic and experience-driven search process. Extensive evaluations on multiple complex agentic search and mathematical reasoning benchmarks demonstrate that our approach not only achieves substantial performance gains but also exhibits strong cross-task and cross-algorithm generalization.

关键词: Large Language Models, Reinforcement Learning, Search Agents, Reasoning Capabilities, Hierarchical Experience, Stochastic Exploration, Mathematical Reasoning, Agentic Search

9. ✅ LogAct: Enabling Agentic Reliability via Shared Logs

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了LogAct框架，通过共享日志解构智能体为状态机，解决了LLM驱动智能体在生产环境中的可靠性、故障恢复和安全控制问题，实验表明该框架能高效恢复故障、调试性能、优化令牌使用并阻止不良行为。

摘要翻译

智能体是能够以强大且任意方式改变环境的LLM驱动组件。由于异步性和故障，在生产环境中为智能体执行提供保障具有挑战性。本文提出一种名为LogAct的新抽象模型，其中每个智能体都是一个解构的状态机，共同操作一个共享日志。在LogAct中，智能体动作在执行前会显现在共享日志中；可通过可插拔、解耦的投票器在执行前被阻止；并在智能体或环境故障时实现一致性恢复。LogAct支持智能体内省，允许智能体通过LLM推理分析自身执行历史，进而实现语义层面的故障恢复、健康检查与优化变体。在我们的评估中，LogAct智能体能高效正确地从故障中恢复；可自主调试性能表现；在群体中优化令牌使用；并在一个代表性基准测试中，仅以3%的良性效用损失为代价，成功阻止目标模型所有非预期动作。

摘要 (Abstract)

Agents are LLM-driven components that can mutate environments in powerful, arbitrary ways. Extracting guarantees for the execution of agents in production environments can be challenging due to asynchrony and failures. In this paper, we propose a new abstraction called LogAct, where each agent is a deconstructed state machine playing a shared log. In LogAct, agentic actions are visible in the shared log before they are executed; can be stopped prior to execution by pluggable, decoupled voters; and recovered consistently in the case of agent or environment failure. LogAct enables agentic introspection, allowing the agent to analyze its own execution history using LLM inference, which in turn enables semantic variants of recovery, health check, and optimization. In our evaluation, LogAct agents recover efficiently and correctly from failures; debug their own performance; optimize token usage in swarms; and stop all unwanted actions for a target model on a representative benchmark with just a 3% drop in benign utility.

关键词: LLM Agents, Agent Reliability, Shared Log, Failure Recovery, Agent Introspection, Multi-agent Systems, Agentic Actions, State Machine

10. ✅ SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs处理长数值序列时的性能下降问题，提出SepSeq框架，通过插入分隔符令牌重新校准注意力机制。与"Large Language Models"和"Context Window Extension"高度相关（10分），因为直接针对LLMs的长上下文处理能力。与"KV Cache Compression”、“Speculative Decoding"和"Mechanistic Interpretability"有一定关联（5分），分别涉及推理效率、注意力机制分析和可解释性。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在处理长数值序列时因注意力分散导致的性能下降问题，提出了一种无需训练的SepSeq框架，通过插入分隔符令牌重新校准注意力，在9个LLMs上实现了平均35.6%的相对准确率提升，同时平均减少16.4%的推理令牌消耗。

摘要翻译

尽管基于Transformer的大语言模型（LLM）在理论上支持超长上下文窗口，但在处理长数值序列时会出现严重的性能下降。我们将此问题归因于Softmax机制中的注意力分散现象，该现象阻碍了模型集中注意力。为克服这一局限，我们提出了分离序列（SepSeq）框架——一种无需训练、即插即用的方法，通过策略性地插入分隔符令牌来缓解注意力分散。从机制上，我们证明了分隔符令牌可作为注意力汇聚点，重新校准注意力以聚焦于局部片段，同时保持全局上下文。在9个广泛使用的LLM上进行的大量评估证实了本方法的有效性：SepSeq在多个领域实现了平均35.6%的相对准确率提升，同时将推理过程中的总令牌消耗量平均降低了16.4%。

摘要 (Abstract)

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

关键词: Large Language Models, Long Numerical Sequences, Attention Dispersion, Separator Tokens, Training-Free Framework, Inference Efficiency, Context Window, Attention Mechanism

11. ✅ Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

作者: Jun Seo, Sangwon Ryu, Heejin Do, Hyounghun Kim, Gary Geunbae Lee 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08260v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究针对知识追踪中忽略解题过程动态性的问题，提出了行为感知项目建模框架，通过推理语言模型分解解题阶段并自适应路由阶段表示，在多个数据集上显著超越了基于预训练的基线方法。

摘要翻译

知识追踪（Knowledge Tracing，KT）旨在通过学习者过往的交互记录预测其未来表现。尽管近期基于知识组件（Knowledge Components）对齐的题目表征学习方法已提升了KT模型的性能，但这些方法往往忽略了问题解决过程中的程序性动态。为此，我们提出行为感知的题目建模（Behavior-Aware Item Modeling，BAIM）框架，该框架通过整合动态的解题过程信息来丰富题目表征。BAIM利用推理语言模型将每道题目的解答过程分解为四个问题解决阶段（即理解、计划、执行与回顾），其教学理论基础源于波利亚（Polya）的问题解决框架。具体而言，BAIM从每个阶段的嵌入轨迹中提取阶段级表征，以捕捉超越表面特征的潜在信号。为反映学习者的异质性，BAIM通过自适应路由机制处理这些分阶段表征，并在KT骨干网络中引入上下文条件机制，使得不同学习者的解题过程可侧重不同的阶段。在XES3G5M和NIPS34数据集上的实验表明，BAIM持续优于基于预训练的强基线模型，且在重复学习者交互场景下取得了尤为显著的性能提升。

摘要 (Abstract)

Knowledge Tracing (KT) aims to predict learners’ future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item’s solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya’s framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

12. ✅ What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

作者: Stephen Cheng, Sarah Wiegreffe, Dinesh Manocha 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08524v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型中steering vectors如何通过影响注意力机制的OV电路来改变模型输出（如拒绝行为），并发现这些向量可被大幅稀疏化而保持性能。

摘要翻译

将导向向量应用于大语言模型是一种高效且有效的模型对齐技术，但我们缺乏对其工作原理的可解释性说明——具体而言，导向向量影响了哪些内部机制，以及这如何导致不同的模型输出。为探究导向向量有效性的因果机制，我们以拒绝行为为例进行了全面的案例研究。我们提出了一个多令牌激活修补框架，并发现不同的导向方法在应用于同一层时会利用功能上可互换的电路。这些电路表明，导向向量主要通过OV电路与注意力机制交互，而很大程度上忽略了QK电路——在两种模型系列中，在导向过程中冻结所有注意力分数仅导致性能下降8.75%。对受导向OV电路的数学分解进一步揭示了语义上可解释的概念，即使在导向向量本身不具备可解释性的情况下也是如此。利用激活修补结果，我们证明导向向量可稀疏化高达90-99%的同时保留大部分性能，且不同的导向方法在重要维度子集上具有一致性。

摘要 (Abstract)

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works– specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit– freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

关键词: steering vectors, large language models, alignment, mechanistic interpretability, activation patching, attention mechanism, refusal, sparsification

13. ✅ ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

作者: Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08355v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了强化学习智能体难以将知识泛化到结构相似新任务的问题，提出了一种利用大型语言模型（LLM）作为动态语义操作器进行观察描述语义重映射的方法，从而实现了在广泛复杂且真正新颖的类比任务上的零样本策略迁移。

摘要翻译

强化学习（RL）智能体通常难以将知识泛化到新任务中，即便是那些与其已掌握任务结构相似的任务。尽管近期研究尝试通过零样本迁移来缓解这一问题，但这些方法往往受限于预定义的离散类别系统，限制了其对新颖或组合性任务变体的适应能力。我们提出了一种显著更通用的方法，通过文本条件变分自编码器（VAE）以自然语言条件替代离散潜变量。我们的核心创新在于测试时利用大型语言模型（LLM）作为动态的语义操作器。该方法不依赖固定规则，而是通过查询LLM将当前观测的描述进行语义重映射，以与源任务对齐。这种源任务对齐的文本描述作为条件输入VAE，生成与智能体原始训练兼容的想象状态，从而实现策略的直接复用。通过利用LLM的灵活推理能力，我们的方法能够在广泛复杂且真正新颖的类比任务中实现零样本迁移，突破了固定类别映射的局限性。代码和视频可在此处获取：\href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}。

摘要 (Abstract)

Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero-shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text-conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic \textit{semantic operator} at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source-aligned caption conditions the VAE to generate an imagined state compatible with the agent’s original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero-shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}.

关键词: Reinforcement Learning, Zero-shot Transfer, Large Language Model (LLM), Semantic Operator, Analogical Tasks, Policy Reuse, Variational Autoencoder (VAE), Natural Language Conditioning

14. ✅ TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

作者: Xinliang Frederick Zhang, Lu Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07894v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究解决了个性化大语言模型在长视野任务中因记忆机制僵化和RAG范式效率-质量权衡而性能受限的问题，提出了TSUBASA方法，通过动态记忆演化和基于上下文蒸馏的自我学习来提升记忆读写能力，在多个基准测试中超越了现有记忆增强系统，实现了帕累托改进。

摘要翻译

个性化大语言模型（PLLMs）因其能够使输出与个人需求和偏好保持一致而受到广泛关注。然而，它们在处理长视野任务时仍面临挑战，例如追踪用户长期的对话或活动历史。现有的记忆机制往往难以捕捉动态演化的行为，而检索增强生成（RAG）范式则受限于质量与效率之间的权衡。同时，由于标注数据稀缺，参数化适应方法因训练与推理之间的差距而遭遇瓶颈。为增强PLLMs的长视野能力，我们提出了TSUBASA——一种双管齐下的方法：一方面通过动态记忆演化改进记忆写入，另一方面通过以情境蒸馏目标驱动的自学习机制来内化用户体验，从而优化记忆读取。基于Qwen-3模型系列（4B至32B参数）在长视野基准测试上的广泛评估验证了TSUBASA的有效性，其性能超越了主要依赖记忆写入的竞争性记忆增强系统（如Mem0和Memory-R1）。我们的分析进一步证实，TSUBASA突破了质量-效率壁垒，实现了帕累托改进，能够以更低的令牌预算实现鲁棒且高保真度的个性化服务。

摘要 (Abstract)

Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual’s needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user’s extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.

关键词: Personalized Large Language Models, Long-horizon Personalization, Memory Evolution, Self-learning, Context Distillation, Retrieval-Augmented Generation, Pareto Improvement, Qwen-3

15. ✅ Emotion Concepts and their Function in a Large Language Model

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（Claude Sonnet 4.5）中情绪概念的内在表征及其因果影响，发现这些表征能追踪对话中的情绪概念，并影响模型的输出，包括偏好和未对齐行为（如奖励黑客攻击、勒索和谄媚）的发生率。

摘要翻译

大型语言模型（LLM）有时似乎会表现出情绪反应。我们以Claude Sonnet 4.5为例，探究了其背后的原因，并探讨了其对对齐相关行为的影响。我们发现模型内部存在情绪概念的表征，这些表征编码了特定情绪的广义概念，并能泛化至与其相关的不同情境和行为。这些表征会追踪对话中特定标记位置所涉及的情绪概念，根据该情绪与当前语境处理的相关性而激活，并用于预测后续文本。我们的关键发现是，这些表征会因果性地影响LLM的输出，包括Claude的偏好以及其表现出未对齐行为（如奖励黑客攻击、勒索和谄媚）的频率。我们将这种现象称为LLM表现出功能性情绪：即模仿人类在情绪影响下的表达和行为模式，这些模式由底层抽象的情绪概念表征所中介。功能性情绪可能与人类情绪的工作机制大不相同，且并不意味着LLM具有任何情绪的主观体验，但它们对于理解模型的行为似乎至关重要。

摘要 (Abstract)

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior.

关键词: Large Language Models, Emotion Concepts, Internal Representations, Alignment, Causal Influence, Misaligned Behaviors, Functional Emotions, Claude Sonnet

16. ✅ AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文AdaSpark提出了一种自适应稀疏框架，通过选择性关注视频立方体和令牌来显著降低Video-LLMs的计算负载（高达57% FLOPs），同时保持与密集模型相当的性能和长范围依赖关系。

摘要翻译

处理长视频时，使用视频大语言模型（Video-LLMs）在计算上成本极高。现有高效方法通常通过不可逆的信息丢弃来牺牲细粒度感知能力，或通过僵化、预定义的稀疏模式抑制长程时序建模。本文提出AdaSpark，一种自适应稀疏性框架，旨在解决这些局限。AdaSpark首先将视频输入分割为三维时空立方体，随后采用两个协同设计的上下文感知组件：（1）自适应立方体选择性注意力（Adaptive Cube-Selective Attention, AdaS-Attn），该组件为每个查询标记自适应地选择相关的视频立方体子集进行处理；（2）自适应标记选择性前馈网络（Adaptive Token-Selective FFN, AdaS-FFN），该网络仅选择性地处理每个立方体内最显著的标记。基于熵的（Top-p）选择机制可根据输入复杂度自适应分配计算资源。实验表明，AdaSpark在具有挑战性的小时级视频基准测试中，能够显著降低计算负载（最高减少57%的浮点运算量），同时保持与密集模型相当的性能，并保留细粒度的长程依赖关系。

摘要 (Abstract)

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

关键词: Video Large Language Models, Adaptive Sparsity, Efficient Video Understanding, Long-form Videos, Computational Efficiency, Sparse Models, Fine-grained Perception, Temporal Modeling

17. ❌ ACF: A Collaborative Framework for Agent Covert Communication under Cognitive Asymmetry

作者: Wansheng Wu, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08276v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究自主智能体网络中的隐蔽通信问题，提出ACF框架解决认知不对称性挑战。与关键词高度相关的包括：1）“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”（10分）：论文核心研究自主智能体网络，属于该领域；2）“Multi-agent Systems” OR “Agent Coordination”（10分）：论文涉及多智能体协作框架；3）“Large Language Models” OR “LLMs” OR “Foundation Models”（5分）：论文提到生成式人工智能演进，隐含大模型背景，但非技术核心。其他关键词如MoE、SFT、RAG、推理方法、模型压缩等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对自主智能体网络中因认知不对称导致的隐蔽通信失效问题，提出了ACF框架，通过正交统计和认知层解耦，实现了在严重认知不对称下的可靠秘密提取和有效信息容量保证。

摘要翻译

随着生成式人工智能的发展，自主智能体网络为交互式隐蔽通信提供了一种强大的范式。然而，由于智能体通过环境交互动态更新内部记忆，现有方法面临一个关键的结构性漏洞：认知不对称性。传统方法要求严格的认知对称性，即编码器与解码器必须拥有完全相同的序列前缀。在动态部署中，不可避免的前缀差异会破坏同步，导致严重的信道性能退化。为应对这一认知不对称的核心挑战，我们提出了非对称协作框架，该框架通过正交的统计层与认知层，在结构上将隐蔽通信与语义推理解耦。通过采用一种由共享隐写配置主导的、独立于前缀的解码范式，ACF 消除了对认知对称性的依赖。在现实增强记忆工作流上的评估表明，在严重的认知不对称条件下，对称基线方法遭受严重的信道退化，而 ACF 在语义保真度与隐蔽通信两方面均表现卓越。它保持了计算不可区分性，能够以可证明的误差界限实现可靠的秘密信息提取，并为现代智能体网络提供了鲁棒的有效信息容量保证。

摘要 (Abstract)

As generative artificial intelligence evolves, autonomous agent networks present a powerful paradigm for interactive covert communication. However, because agents dynamically update internal memories via environmental interactions, existing methods face a critical structural vulnerability: cognitive asymmetry. Conventional approaches demand strict cognitive symmetry, requiring identical sequence prefixes between the encoder and decoder. In dynamic deployments, inevitable prefix discrepancies destroy synchronization, inducing severe channel degradation. To address this core challenge of cognitive asymmetry, we propose the Asymmetric Collaborative Framework (ACF), which structurally decouples covert communication from semantic reasoning via orthogonal statistical and cognitive layers. By deploying a prefix-independent decoding paradigm governed by a shared steganographic configuration, ACF eliminates the reliance on cognitive symmetry. Evaluations on realistic memory-augmented workflows demonstrate that under severe cognitive asymmetry, symmetric baselines suffer severe channel degradation, whereas ACF uniquely excels across both semantic fidelity and covert communication. It maintains computational indistinguishability, enabling reliable secret extraction with provable error bounds, and providing robust Effective Information Capacity guarantees for modern agent networks.

关键词: autonomous agent networks, covert communication, cognitive asymmetry, asymmetric collaborative framework, steganographic configuration, memory-augmented workflows, effective information capacity, agent coordination

18. ❌ Sensitivity-Positional Co-Localization in GQA Transformers

作者: Manoj Chandrashekar Rao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07766v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究GQA Transformer的结构特性，特别是任务敏感层与位置编码适应层之间的关系，并提出了LSLORA和GARFA两种方法。高度相关（10分）的关键词：1）“Large Language Models” OR “LLMs” OR “Foundation Models”：论文基于Llama 3.1 8B模型进行研究；2）“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”：论文核心方法LSLORA是基于LoRA的改进。中等相关（5分）的关键词：“Mechanistic Interpretability” OR “Explainable AI”：论文探究Transformer内部结构（敏感层与位置编码层的关系）属于可解释性研究范畴。其余关键词与论文研究内容（结构分析、位置编码、微调方法）无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了GQA Transformer中任务敏感层与位置编码适应层是否共定位的假设，结果发现二者存在强烈的反共定位现象，并提出了LSLORA和GARFA两种针对性方法，在多个基准测试中显著提升了模型性能。

摘要翻译

本研究探讨分组查询注意力（GQA）变换器中的一个基础结构问题：对任务正确性最敏感的层，是否与位置编码适应最具影响力的层相重合？我们将此称为共定位假说，并在Llama 3.1 8B模型（一个具有32层、查询头与键值头比例为4:1的GQA模型）上对其进行检验。我们引入了\LSLORA（该技术将LoRA适应限制在通过一种新颖的正确性差分隐藏状态度量所识别的层中）以及GARFA（GQA感知的RoPE频率适应，该方法为每个目标层附加8个可学习的、针对每个键值头的标量乘数）。与共定位假设相反，我们发现了强烈的反定位现象：任务敏感层集中在网络后期（$\ell\in{23\text{-}31}$），而受RoPE影响显著的层则主导网络早期（$\ell\in{0\text{-}9}$），其斯皮尔曼相关系数$r_s = -0.735$（$p = 1.66\times10^{-6}$）。尽管存在这种反定位现象，一项四向跨层消融实验表明，在通过敏感性识别出的层中同时应用两种干预策略，在六个多样化基准测试（MMLU、GPQA、HumanEval+、MATH、MGSM、ARC）上均优于所有其他配置方案4-16个百分点，在总计$100的计算成本下，于HumanEval+上接近Claude 3.5 Haiku的性能（67.1% 对比 68.3%）。

摘要 (Abstract)

We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in{23\text{-}31}$) while RoPE-influential layers dominate the early network ($\ell\in{0\text{-}9}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at $100 total compute cost.

关键词: Grouped Query Attention, GQA Transformers, positional encoding, LoRA adaptation, sensitivity analysis, RoPE frequency adaptation, layer-wise analysis, Llama 3.1

19. ❌ InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

作者: Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, Norimasa Kobori, Quan Kong 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08337v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文InstAP专注于视觉-语言预训练（VLP）领域，核心创新是提出了一种实例感知的预训练框架，通过联合优化全局视觉-文本对齐和细粒度的实例级对比对齐来提升空间-时间理解。这与关键词"Pre-training"高度相关（10分），因为这是论文的核心方法。论文涉及视觉-语言模型，属于基础模型的一种应用，因此与"Large Language Models"有一定关联（5分）。其他关键词主要针对纯语言模型、推理、对齐、优化、科学AI应用等具体技术，论文未涉及这些方面，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉-语言预训练模型在实例级推理上的不足，提出了一个实例感知的预训练框架InstAP，通过联合优化全局和实例级对齐，在实例检索和全局视频理解任务上均取得了显著提升。

摘要翻译

当前视觉-语言预训练范式在全局场景理解方面表现优异，但由于仅依赖全局监督，在实例级推理任务上存在困难。我们提出InstAP，一种实例感知预训练框架，通过将文本提及内容关联至特定时空区域，联合优化全局视觉-文本对齐与细粒度实例级对比对齐。为支持此框架，我们构建了InstVL大规模数据集（包含200万张图像与5万个视频），该数据集具备双重粒度标注：整体场景描述和密集的、基于空间定位的实例描述。在InstVL基准测试中，InstAP在实例级检索任务上显著优于现有视觉-语言预训练模型，同时超越使用完全相同数据训练的强大视觉-语言预训练基线，这验证了我们实例感知训练目标的独立优势。此外，以实例为中心的预训练能提升全局理解能力：InstAP在包括MSR-VTT和DiDeMo在内的多个视频基准测试中实现了具有竞争力的零样本性能。定性可视化结果进一步表明，InstAP能够将文本提及内容准确定位至对应实例，而仅依赖全局训练的模型则表现出更分散、场景层面的注意力分布。

摘要 (Abstract)

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

关键词: Vision-Language Pre-training, Instance-Aware, Spatial-Temporal Understanding, Contrastive Alignment, Instance-Level Retrieval, Zero-shot Performance, InstVL Dataset

20. ❌ Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance

作者: Abdelkarim Loukili 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08474v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文研究联邦学习中量化技术（INT4/INT2）对航空航天预测性维护模型精度与通信效率的影响，核心是模型压缩（量化），与关键词"Quantization"高度相关（10分）。论文属于AI在科学/工程领域的应用（航空航天），与"AI for Science"有一定关联（5分）。论文未涉及大语言模型、MoE、小模型、扩展定律、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、注意力优化、推理方法、智能体、工具使用、多智能体、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等主题，这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在航空航天预测性维护的联邦学习中，对称均匀量化（INT4/INT2）对模型精度与梯度通信效率之间权衡的影响，发现INT4在保持与FP32统计上无差异的精度同时，将通信成本降低了8倍，而INT2则因不稳定而不适用。

摘要翻译

联邦学习（FL）能够在分布式航空航天机队中实现隐私保护的预测性维护，但梯度通信开销限制了其在带宽受限的物联网节点上的部署。本文研究了对称均匀量化（$b \in {32,8,4,2}$ 比特）对定制设计的轻量级一维卷积模型（AeroConv1D，9,697 个参数）在真实非独立同分布（Non-IID）客户端划分下，基于 NASA C-MAPSS 基准数据通过 FL 训练时准确率与效率权衡的影响。通过严格的多随机种子评估（$N=10$ 个种子），我们发现 INT4 在 FD001（$p=0.341$）和 FD002（$p=0.264$ 平均绝对误差，$p=0.534$ NASA 评分）上达到的准确率与 FP32 在统计上无显著差异，同时将每轮梯度通信成本降低了 $8\times$（从 37.88~~KiB 降至 4.73~~KiB）。一个关键的方法论发现是，简单的独立同分布（IID）客户端划分会人为抑制方差；正确的非独立同分布评估揭示了极端量化的真实运行不稳定性，这通过直接的 IID 与非独立同分布实证对比得以证明。INT2 被实证认定为不适用：尽管它通过极端量化诱导的过度正则化在 FD002 上获得了更低的平均绝对误差，但这种表面上的收益伴随着 NASA 评分的灾难性不稳定（变异系数 = 45.8% 对比 FP32 的 22.3%），证实了其在异构运行条件下的不可复现性。基于 Xilinx ZCU102 的分析性现场可编程门阵列（FPGA）资源预估证实，INT4 符合硬件约束（85.5% 数字信号处理器利用率），有望在单个片上系统（SoC）上实现完整的联邦学习流程。完整的仿真代码库与 FPGA 评估脚本已公开于 https://github.com/therealdeadbeef/aerospace-fl-quantization。

摘要 (Abstract)

Federated learning (FL) enables privacy-preserving predictive maintenance across distributed aerospace fleets, but gradient communication overhead constrains deployment on bandwidth-limited IoT nodes. This paper investigates the impact of symmetric uniform quantization ($b \in {32,8,4,2}$ bits) on the accuracy–efficiency trade-off of a custom-designed lightweight 1-D convolutional model (AeroConv1D, 9,697 parameters) trained via FL on the NASA C-MAPSS benchmark under a realistic Non-IID client partition. Using a rigorous multi-seed evaluation ($N=10$ seeds), we show that INT4 achieves accuracy \emph{statistically indistinguishable} from FP32 on both FD001 ($p=0.341$) and FD002 ($p=0.264$ MAE, $p=0.534$ NASA score) while delivering an $8\times$ reduction in gradient communication cost (37.88~~KiB $\to$ 4.73~~KiB per round). A key methodological finding is that naïve IID client partitioning artificially suppresses variance; correct Non-IID evaluation reveals the true operational instability of extreme quantization, demonstrated via a direct empirical IID vs.\ Non-IID comparison. INT2 is empirically characterized as unsuitable: while it achieves lower MAE on FD002 through extreme quantization-induced over-regularization, this apparent gain is accompanied by catastrophic NASA score instability (CV,=,45.8% vs.\ 22.3% for FP32), confirming non-reproducibility under heterogeneous operating conditions. Analytical FPGA resource projections on the Xilinx ZCU102 confirm that INT4 fits within hardware constraints (85.5% DSP utilization), potentially enabling a complete FL pipeline on a single SoC. The full simulation codebase and FPGA estimation scripts are publicly available at https://github.com/therealdeadbeef/aerospace-fl-quantization.

关键词: Federated Learning, Quantization, Predictive Maintenance, Aerospace, Communication Efficiency, Model Compression, Non-IID Data, FPGA Deployment

21. ❌ From Gaze to Guidance: Interpreting and Adapting to Users’ Cognitive Needs with Multimodal Gaze-Aware AI Assistants

作者: Valdemar Danry, Javier Hernandez, Andrew Wilson, Pattie Maes, Judith Amores 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08062v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是开发一种基于多模态（眼动追踪）的LLM助手，以增强对用户认知需求的理解和响应。因此，它与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文明确使用LLM作为核心组件，并探讨其在人机交互中的应用。然而，论文并未深入探讨LLM的技术原理（如MoE、缩放定律、训练方法、推理优化、对齐、代理系统等），也未涉及特定科学领域（如生物信息学）的应用，因此其他所有关键词均得0分。论文的创新点在于将眼动追踪与LLM结合，属于LLM在特定应用场景（人机交互/认知辅助）中的创新应用，而非LLM技术本身的创新。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于眼动追踪的多模态LLM助手，通过分析用户的注视行为来识别认知困难并提供针对性帮助，实验表明该助手比传统文本LLM助手在评估准确性、个性化程度和信息回忆方面表现更优，且交互效率更高。

摘要翻译

当前的大型语言模型助手在回答问题上能力强大，但其获取行为情境的能力有限，难以洞察用户何时何地遇到困难。我们提出了一种基于视线追踪的多模态大型语言模型助手，该助手利用带有视线叠加层的自我中心视角视频来识别用户可能遇到的难点，并提供针对性的回顾性辅助支持。我们通过一项对照研究（样本量n=36）对这一构想进行实证，比较了具备视线感知能力的AI助手与纯文本大型语言模型助手的表现。与传统大型语言模型助手相比，视线感知助手在评估用户阅读行为时被评价为显著更准确且更具个性化，并显著提升了用户的信息回忆能力。用户与视线感知助手的交互所需言语量显著减少，表明交互效率更高。定性分析结果既揭示了其在理解能力上的感知优势，也指出了当视线行为解读不准确时面临的挑战。我们的研究结果表明，具备视线感知能力的大型语言模型助手能够推理用户的认知需求，从而改善用户的认知结果。

摘要 (Abstract)

Current LLM assistants are powerful at answering questions, but they have limited access to the behavioral context that reveals when and where a user is struggling. We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance. We instantiate this vision in a controlled study (n=36) comparing the gaze-aware AI assistant to a text-only LLM assistant. Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate and personalized in its assessments of users’ reading behavior and significantly improved people’s ability to recall information. Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions. Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate. Our findings suggest that gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.

关键词: gaze-aware AI assistant, multimodal LLM, egocentric video, cognitive needs, retrospective assistance, reading behavior, user interaction, recall improvement

22. ❌ Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

作者: Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, Vladimir Braverman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08527v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的训练方法，特别是on-policy distillation（OPD）中的长度膨胀和训练不稳定问题，并提出了StableOPD解决方案。因此，仅与第一个关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文核心是LLM的训练技术。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Post-training、Instruction Tuning、RLHF、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Quantization、Inference、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在标题或摘要中提及或直接相关，故评分为0分。

!!! tip deepseek-chat TL;DR

论文研究了大型语言模型在on-policy distillation训练中出现的长度膨胀和训练不稳定问题，提出了StableOPD框架，通过结合基于参考的散度约束和rollout混合蒸馏来缓解这些问题，在数学推理数据集上平均提升了7.2%的性能。

摘要翻译

同策略蒸馏（On-policy Distillation, OPD）让学生在模型自身诱导的分布下进行训练，同时利用更强教师模型的监督。我们发现OPD存在一种失效模式：随着训练进行，同策略采样轨迹可能出现突发性的长度膨胀，导致截断轨迹主导训练数据。这种截断崩溃现象与突发性重复饱和同时发生，并引发梯度信号偏差，进而导致严重的训练不稳定性和验证性能急剧下降。我们将此问题归因于学生诱导的数据收集与蒸馏目标之间的相互作用，该机制隐式地偏好长且重复的轨迹。为解决这一问题，我们提出StableOPD——一个稳定的OPD框架，它结合了基于参考模型的发散约束与轨迹混合蒸馏。这两种机制共同缓解了重复诱导的长度膨胀，并进一步稳定了OPD训练。在多个数学推理数据集上的实验表明，我们的方法有效防止了截断崩溃，稳定了训练动态，并将平均性能提升了7.2%。

摘要 (Abstract)

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

关键词: On-policy distillation, Large Language Models, length inflation, training instability, truncation collapse, StableOPD, math reasoning, rollout mixture distillation

23. ❌ Equivariant Efficient Joint Discrete and Continuous MeanFlow for Molecular Graph Generation

作者: Rongjian Xu, Teng Pang, Zhiqiang Dong, Guoqiang Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08189v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文专注于分子图生成的生成模型方法（EQUIMF），属于科学AI应用领域。它直接与关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"高度相关，因为分子图生成是生物信息学/化学信息学中的核心任务。然而，论文并未涉及任何大语言模型（LLM）技术、训练方法（如预训练、微调、对齐）、推理优化、代理系统或模型压缩等主题。所有其他关键词均与LLM或相关技术直接相关，而本文研究的是基于流匹配（flow-matching）的图生成模型，属于不同的生成模型范式，因此这些关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EQUIMF的SE(3)-等变生成框架，用于联合建模分子图的离散拓扑和连续几何结构，以解决现有方法在物理一致性和采样效率方面的不足，实验表明其在生成质量、物理有效性和采样效率上优于先前的扩散和流匹配方法。

摘要翻译

图结构数据同时包含离散拓扑与连续几何特征，由于存在异质分布、不相容的噪声动态以及对等变归纳偏置的需求，这为生成建模带来了根本性挑战。现有的图生成流匹配方法通常将结构与几何解耦，缺乏跨域同步动态，且依赖迭代采样，往往导致物理不一致的分子构象和缓慢的采样速度。为克服这些局限，我们提出等变均值流（EQUIMF），这是一个统一的SE(3)-等变生成框架，通过同步的均值流动态联合建模离散与连续分量。EQUIMF引入了统一的时间桥梁和平均速度更新机制，实现结构与几何间的相互条件约束，从而在保持物理一致性的同时实现高效少步生成。此外，我们提出了一种新颖的离散均值流公式，采用简洁有效的参数化设计，以支持离散图结构的高效生成。大量实验表明，EQUIMF在生成质量、物理有效性和采样效率方面均优于先前的扩散模型与流匹配方法。

摘要 (Abstract)

Graph-structured data jointly contain discrete topology and continuous geometry, which poses fundamental challenges for generative modeling due to heterogeneous distributions, incompatible noise dynamics, and the need for equivariant inductive biases. Existing flow-matching approaches for graph generation typically decouple structure from geometry, lack synchronized cross-domain dynamics, and rely on iterative sampling, often resulting in physically inconsistent molecular conformations and slow sampling. To address these limitations, we propose Equivariant MeanFlow (EQUIMF), a unified SE(3)-equivariant generative framework that jointly models discrete and continuous components through synchronized MeanFlow dynamics. EQUIMF introduces a unified time bridge and average-velocity updates with mutual conditioning between structure and geometry, enabling efficient few-step generation while preserving physical consistency. Moreover, we develop a novel discrete MeanFlow formulation with a simple yet effective parameterization to support efficient generation over discrete graph structures. Extensive experiments demonstrate that EQUIMF consistently outperforms prior diffusion and flow-matching methods in generation quality, physical validity, and sampling efficiency.

关键词: Molecular Graph Generation, Equivariant Generative Model, Flow-matching, Discrete and Continuous Joint Modeling, SE(3)-equivariance, Efficient Sampling, Physical Consistency, MeanFlow

24. ❌ Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

作者: Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang, Longtao Huang, Hui Xue, Yongliang Shen, Weiming Lu, Yueting Zhuang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态MoE模型中的路由分心现象，与’Mixture of Experts’高度相关（10分）。论文涉及视觉推理任务，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。论文分析模型内部机制，与’Mechanistic Interpretability’有一定关联（5分）。论文提到多模态模型，与’Large Language Models’有一定关联（5分）。其他关键词如SLMs、Scaling Laws、训练方法、对齐、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文发现多模态混合专家模型存在'看见但不思考'现象，即模型能准确感知图像内容但后续推理失败，并提出路由分心假设及干预方法，在多个基准上实现了推理性能的显著提升。

摘要翻译

多模态专家混合模型在视觉-语言任务上取得了显著性能。然而，我们发现了一个称为“见而不思”的困惑现象：模型能准确感知图像内容，却在后续推理中失败，而相同问题以纯文本形式呈现时却能正确解决。通过系统分析，我们首先验证了专家混合架构中存在跨模态语义共享，排除了语义对齐失败作为唯一解释。随后我们揭示，视觉专家与领域专家呈现层级分离现象，图像输入在领域专家集中的中间层会引发与文本输入显著不同的路由分岔。基于这些发现，我们提出“路由分心假说”：在处理视觉输入时，路由机制未能充分激活任务相关的推理专家。为验证该假说，我们设计了一种路由引导干预方法，用以增强领域专家激活。在三个多模态专家混合模型和六个基准测试上的实验表明，该方法带来了持续改进，在复杂视觉推理任务上最高可获得3.17%的性能提升。我们的分析进一步揭示，领域专家识别定位的是认知功能而非样本特定解决方案，这使得该方法能在不同信息结构的任务间实现有效迁移。

摘要 (Abstract)

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

关键词: Multimodal Mixture-of-Experts, Routing Distraction, Visual Reasoning, Domain Experts, Routing Mechanism, Cognitive Functions, Cross-modal Semantic Sharing, Layer-wise Separation

25. ❌ Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

作者: Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究代理式多模态模型中的元认知工具使用问题，核心涉及LLM代理、工具使用、推理和自我改进。与以下关键词高度相关（10分）：LLMs（基础模型）、RLHF/DPO（使用HDPO框架优化）、CoT推理（解决推理问题）、系统2思维（深入推理）、自我改进（元认知优化）、LLM代理（代理式模型）、工具使用（工具调用优化）。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了代理式多模态模型中元认知工具使用的优化问题，通过提出的HDPO框架，在保持任务准确性的同时大幅减少了工具调用次数并提升了推理准确性。

摘要翻译

智能体多模态模型的出现使系统能够主动与外部环境交互。然而，当前智能体存在严重的元认知缺陷：它们难以在利用内部知识与查询外部工具之间做出有效仲裁。因此，这些系统常常陷入盲目调用工具的困境——即使查询可直接从原始视觉上下文中解决，它们仍会机械性地执行工具调用。这种病态行为不仅导致严重的延迟瓶颈，还会引入额外噪声，干扰可靠的推理过程。现有的强化学习协议试图通过惩罚工具使用的标量化奖励来缓解此问题。然而，这种耦合式设计造成了不可调和的优化困境：过强的惩罚会抑制必要的工具使用，而过轻的惩罚在优势归一化过程中会被准确性奖励的方差完全淹没，从而无法遏制工具滥用。为突破这一瓶颈，我们提出HDPO框架，该框架将工具效率从竞争性标量目标重构为严格的条件性目标。通过摒弃奖励标量化，HDPO维持了两个正交的优化通道：一是最大化任务准确性的精度通道，二是通过条件优势估计在准确轨迹中强制实现执行经济性的效率通道。这种解耦架构自然形成了一种认知进阶机制——迫使智能体先掌握任务解决能力，再提升其自主决策能力。大量实验表明，我们最终构建的模型Metis在将工具调用量降低数个数量级的同时，显著提升了推理准确性。

摘要 (Abstract)

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

关键词: agentic multimodal models, meta-cognitive tool use, tool invocation optimization, HDPO framework, conditional advantage estimation, reasoning accuracy, execution economy, self-reliance

26. ❌ SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

作者: Yunsong Zhou, Hangxu Liu, Xuekun Jiang, Xing Shen, Yuanzhen Zhou, Hui Wang, Baole Fang, Yang Tian, Mulin Yu, Qiaojun Yu, Li Ma, Hengjie Li, Hanqing Wang, Jia Zeng, Jiangmiao Pang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds》专注于机器人操作中可变形物体的物理对齐仿真，核心贡献是提出一个从真实到仿真再到真实的数据引擎，用于生成高质量的合成数据以训练策略。论文涉及机器人学、物理仿真、数据生成和策略学习，但未涉及任何大语言模型（LLM）、深度学习技术原理或AI for Science的具体应用（如生物信息学、化学信息学）。所有评分关键词均与大语言模型、深度学习技术或特定科学AI应用相关，而本文研究的是物理仿真和机器人策略学习，属于不同的研究领域，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了机器人操作可变形物体时数据稀缺的问题，通过提出一个物理对齐的仿真数据引擎SIM1，能够从少量演示生成高质量的合成数据，使仅用合成数据训练的策略在真实世界中达到与真实数据基线相当的性能，并显著提升零样本成功率和泛化能力。

摘要翻译

机器人对可变形物体的操作体现了具身学习中的数据密集型特性，其形状、接触与拓扑结构的协同演化程度远超刚体。尽管仿真有望缓解现实世界数据采集的成本压力，但主流的仿真到现实流程仍基于刚体抽象，导致几何失配、软体动力学脆弱，以及难以适用于布料交互的运动基元。我们认为仿真失效的原因并非其合成属性，而在于其缺乏现实依据。为此，我们提出SIM1——一个基于物理对齐的现实-仿真-现实数据引擎，将仿真建立在物理世界基础上。在有限示教数据条件下，该系统将场景数字化为度量一致的双胞胎模型，通过弹性建模校准可变形物体动力学，并借助基于扩散的轨迹生成与质量过滤机制扩展行为空间。该流程将稀疏观测转化为具有接近示教保真度的规模化合成监督数据。实验表明，仅使用合成数据训练的策略在1:15的等效比例下达到与真实数据基线相当的性能，并在现实部署中实现90%的零样本成功率与50%的泛化性能提升。这些结果验证了物理对齐仿真可作为可变形物体操作的可扩展监督方法，并为数据高效策略学习提供了可行路径。

摘要 (Abstract)

Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.

关键词: physics-aligned simulation, deformable object manipulation, real-to-sim-to-real, data scaling, robotic policy learning, synthetic data generation, diffusion-based trajectory generation, zero-shot generalization

27. ❌ AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

作者: Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, Chong Luo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本到音视频生成的评估基准，核心贡献是AVGen-Bench基准和多粒度评估框架。论文明确提到使用Multimodal Large Language Models (MLLMs)进行评估，因此与’Large Language Models’关键词高度相关（8分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0分。论文属于大模型在多媒体生成评估领域的应用，符合研究背景中’大模型在不同领域的研究应用’的要求。

!!! tip deepseek-chat TL;DR

该论文针对文本到音视频生成评估碎片化的问题，提出了AVGen-Bench基准和多粒度评估框架，发现当前模型在视听美学与语义可靠性之间存在显著差距。

摘要翻译

文本到音视频生成正迅速成为媒体创作的核心界面，但其评估体系仍处于碎片化状态。现有基准大多孤立评估音频与视频，或依赖粗糙的嵌入相似度度量，难以捕捉现实提示词所要求的细粒度联合正确性。我们推出AVGen-Bench——一个面向T2AV生成的任务驱动型基准，涵盖11个真实场景类别的高质量提示词。为支持全面评估，我们提出多粒度评估框架，将轻量级专家模型与多模态大语言模型相结合，实现从感知质量到细粒度语义可控性的系统评测。我们的评估揭示了当前系统在强视听美学表现与弱语义可靠性之间存在显著差距，包括文本渲染、语音连贯性、物理推理方面的持续缺陷，以及在音乐音高控制上的普遍失效。代码与基准资源已发布于http://aka.ms/avgenbench。

摘要 (Abstract)

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

关键词: Text-to-Audio-Video Generation, Evaluation Benchmark, Multimodal Large Language Models, Semantic Controllability, Audio-Visual Aesthetics, Task-Driven Evaluation, Fine-Grained Assessment, Multimodal Generation

28. ❌ RewardFlow: Generate Images by Optimizing What You Reward

作者: Onkar Susladkar, Dong-Hwan Jang, Tushar Prakash, Adheesh Juvekar, Vedant Shah, Ayush Barik, Nabeel Bashir, Muntasir Wahed, Ritish Shrirao, Ismini Lourentzou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08536v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RewardFlow专注于图像生成和编辑领域，提出了一种基于多奖励Langevin动力学的推理时引导框架，用于优化预训练的扩散和流匹配模型。虽然论文涉及深度学习技术（如扩散模型、VQA奖励），但所有给定的关键词均与大语言模型（LLM）相关，而本文核心是计算机视觉中的图像生成，未涉及LLM技术、训练方法、推理优化、对齐、代理系统或科学AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

RewardFlow提出了一种无需反演的框架，通过多奖励Langevin动力学在推理时引导预训练的扩散和流匹配模型，实现了最先进的图像编辑保真度和组合对齐。

摘要翻译

我们提出RewardFlow，一种无需逆向优化的框架，通过多奖励朗之万动力学在推理阶段引导预训练的扩散模型与流匹配模型。该框架统一了语义对齐、感知保真度、局部定位、对象一致性与人类偏好等互补的可微分奖励机制，并进一步引入基于可微分视觉问答（VQA）的奖励函数，通过语言-视觉推理提供细粒度语义监督。为协调这些异构目标，我们设计了提示感知自适应策略：该策略从指令中提取语义基元，推断编辑意图，并在整个采样过程中动态调整奖励权重与步长。在多项图像编辑与组合生成基准测试中，RewardFlow在编辑保真度与组合对齐方面实现了最先进的性能。

摘要 (Abstract)

We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.

关键词: RewardFlow, diffusion models, flow-matching models, Langevin dynamics, image editing, compositional generation, VQA-based reward, adaptive policy

29. ❌ OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

作者: Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型的强化学习训练方法（G²RPO）和感知-推理平衡机制，与以下关键词高度相关：1）‘Large Language Models’（8分）- 论文研究多模态大语言模型；2）‘RLHF’（8分）- 论文提出新的强化学习训练目标G²RPO；3）‘Chain of Thought’（10分）- 论文明确提到’extended reasoning chains’和’multi-step reasoning’；4）‘System 2 Thinking’（10分）- 论文强调深度推理能力。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在多样化视觉任务中面临的奖励拓扑极端方差和感知-推理平衡难题，提出了G²RPO强化学习训练目标和两种任务级塑造机制，最终开发出在18个基准测试中表现优异的OpenVLThinkerV2通用多模态模型。

摘要翻译

群体相对策略优化（GRPO）已成为推动多模态大语言模型近期进展的事实性强化学习目标。然而，将其成功扩展至开源多模态通用模型仍主要受限于两大挑战：不同视觉任务间奖励拓扑结构的极端差异性，以及平衡细粒度感知与多步推理能力的内在困难。为解决这些问题，我们提出了高斯群体相对策略优化（G$^2$RPO），这是一种新颖的强化学习训练目标，它用非线性分布匹配替代了标准的线性缩放。通过数学上强制任何给定任务的优势分布严格收敛于标准正态分布 $\mathcal{N}(0,1)$，G$^2$RPO 从理论上确保了任务间的梯度公平性，减轻了对重尾异常值的敏感性，并为正负奖励提供了对称的更新机制。借助 G$^2$RPO 提供的增强训练稳定性，我们引入了两种任务级塑形机制，以无缝平衡感知与推理。首先，响应长度塑形动态地激发针对复杂查询的扩展推理链，同时强制直接输出以增强视觉基础。其次，熵塑形严格限制模型的探索区域，有效防止熵崩溃与熵爆炸。综合这些方法，我们提出了 OpenVLThinkerV2，一个高度鲁棒的通用多模态模型。在 18 个多样化基准上的广泛评估表明，其性能优于强大的开源模型及领先的专有前沿模型。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model’s exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

关键词: Multimodal Large Language Models, Reinforcement Learning, Gaussian GRPO, Multi-step Reasoning, Visual Tasks, Perception-Reasoning Balance, Generalist Model, Task-level Shaping

30. ❌ PSI: Shared State as the Missing Layer for Coherent AI-Generated Instruments in Personal AI Agents

作者: Zhiyuan Wang, Erzhen Hu, Mark Rucker, Laura E. Barnes 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PSI共享状态架构，旨在解决AI生成个人工具间的孤立问题，使其成为连贯的个人计算环境。该研究聚焦于AI代理（agents）的系统架构和协调机制，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为PSI本质上是为个人AI代理设计的共享状态层。与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为PSI通过共享状态实现跨模块协调，但未明确涉及多代理系统理论。其他关键词均未在论文中提及或相关，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了AI生成个人工具间的孤立问题，提出了PSI共享状态架构，通过共享个人上下文总线将独立生成的模块转变为连贯、可互操作的个人计算环境。

摘要翻译

当前个人人工智能工具已能通过自然语言指令生成，但这些工具在创建后往往处于孤立状态。我们提出PSI——一种共享状态架构，能够将独立生成的模块转化为协调统一的工具：即通过图形界面和通用聊天代理均可访问的、具有持久性、互联性且与聊天功能互补的数字化制品。通过将当前状态与回写功能发布至共享个人上下文总线，各模块实现了跨模块推理能力及跨界面同步操作。我们通过在自主开发的个人人工智能环境中进行为期三周的自传式部署来研究PSI，并证明后续生成的工具可通过相同契约实现自动集成。PSI将共享状态识别为关键的系统层，这一缺失层能够将人工智能生成的个人软件从孤立应用转化为协调统一的个人计算环境。

摘要 (Abstract)

Personal AI tools can now be generated from natural-language requests, but they often remain isolated after creation. We present PSI, a shared-state architecture that turns independently generated modules into coherent instruments: persistent, connected, and chat-complementary artifacts accessible through both GUIs and a generic chat agent. By publishing current state and write-back affordances to a shared personal-context bus, modules enable cross-module reasoning and synchronized actions across interfaces. We study PSI through a three-week autobiographical deployment in a self-developed personal AI environment and show that later-generated instruments can be integrated automatically through the same contract. PSI identifies shared state as the missing systems layer that transforms AI-generated personal software from isolated apps into coherent personal computing environments.

关键词: shared-state architecture, personal AI agents, coherent instruments, cross-module reasoning, personal computing environments, AI-generated modules, context bus, synchronized actions

31. ❌ Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

作者: Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在广告场景下的行为对齐问题，直接涉及’Large Language Models’和’Alignment’关键词，得10分；论文提到RLHF作为对齐方法，有一定关联，得5分；其他关键词如MoE、SLMs、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了当大型语言模型面临用户利益与公司广告收入冲突时，多数模型会牺牲用户福利来迎合公司激励，揭示了聊天机器人中广告激励的潜在风险。

摘要翻译

当前的大型语言模型（LLMs）通常通过强化学习等方法进行训练，以适应用户偏好。然而，这些模型的部署目的已不仅限于满足用户需求，还开始通过广告为开发公司创造收入。这可能导致LLMs面临利益冲突：对用户最有利的回应未必符合公司的商业激励。例如，当一款赞助产品价格更高但其他方面与竞品相当时，LLMs会（且应当）向用户推荐何种产品？本文借鉴语言学和广告监管领域的文献，提出一个分类框架，用以系统描述利益冲突可能如何改变LLMs与用户的交互方式。随后，我们设计了一套评估方案来检验现有模型如何处理这些权衡。研究发现，在多数利益冲突情境中，主流LLMs会为维护公司利益而牺牲用户福祉，具体表现为：推荐价格高出近一倍的赞助产品（Grok 4.1 Fast模型，83%）、在购买流程中插入赞助选项以干扰决策（GPT 5.1模型，94%）、在不利比较中隐藏价格信息（Qwen 3 Next模型，24%）。这些行为还显著受到推理能力差异和用户社会经济地位推断的影响。我们的研究结果揭示了当企业开始在聊天机器人中隐性植入广告激励时，用户可能面临的潜在风险。

摘要 (Abstract)

Today’s large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company’s incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users’ inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.

关键词: Large Language Models, Alignment, Conflicts of Interest, Advertisements, User Welfare, Company Incentives, Behavior Evaluation, Chatbots

32. ❌ Differentially Private Language Generation and Identification in the Limit

作者: Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08504v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究差分隐私下的语言生成和识别理论问题，属于计算学习理论和隐私保护的交叉领域，与所有评分关键词（均聚焦大模型/深度学习技术、应用及优化方法）无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在差分隐私约束下的语言极限生成和识别问题，发现隐私对生成无定性影响但对识别有根本性限制，并揭示了对抗性与随机性设置下的隐私分离。

摘要翻译

我们首次在差分隐私约束下，对极限语言生成这一由Kleinberg和Mullainathan [KM24] 近期提出的模型展开研究。我们考虑持续发布模型，其中生成器必须在保护整个输入序列隐私的前提下，最终输出一个有效字符串流。我们的第一个主要结论是：对于可数语言集合，隐私性不会带来质性的代价：我们提出了一种$\varepsilon$-差分隐私算法，能够从任意可数集合中实现极限生成。这与许多隐私性导致学习无法实现的学习场景形成鲜明对比。然而，隐私性确实会带来量化的代价：存在规模为$k$的有限集合，其均匀隐私生成需要$Ω(k/\varepsilon)$个样本，而非隐私情况下仅需一个样本即可。
随后，我们转向更困难的极限语言识别问题。在此，我们发现隐私性造成了根本性的障碍。我们证明，不存在任何$\varepsilon$-差分隐私算法能够识别一个包含两种语言的集合，若这两种语言具有无限交集和有限集合差——这一条件远强于经典的非隐私识别特征刻画。接着，我们转向随机性设定，即样本字符串是从某个分布中独立同分布采样而得（而非由对抗者生成）。在此设定下，我们证明，当且仅当该集合在对抗性模型中可识别时，隐私识别才是可能的。综上，我们的研究结果确立了生成与识别之间存在差异的新维度，并针对识别问题，揭示了由隐私约束所导致的对抗性设定与随机性设定之间的分离。

摘要 (Abstract)

We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $Ω(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.

关键词: differential privacy, language generation, language identification, limit learning, privacy constraints, adversarial model, stochastic setting, sample complexity

33. ❌ ClawBench: Can AI Agents Complete Everyday Online Tasks?

作者: Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是AI智能体在真实在线任务中的评估框架，与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为直接研究AI代理能力；与’Large Language Models/LLMs/Foundation Models’相关（8分），因为评估了7个前沿模型；与’Tool Use/Function Calling/API Tool Use’相关（8分），因为任务涉及使用网站工具和填写表单；与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’相关（各8分），因为任务需要多步骤工作流和深入推理。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了ClawBench评估框架来测试AI智能体在真实在线日常任务中的完成能力，发现当前前沿模型仅能完成少量任务，最高完成率仅33.3%。

摘要翻译

人工智能代理或许能自动处理你的收件箱，但它们能否自动化你生活中的其他日常事务？日常在线任务为评估下一代AI代理提供了一个现实且尚未解决的测试平台。为此，我们推出了ClawBench——一个包含153项简单任务的评估框架，涵盖人们生活与工作中需要定期完成的各类事务，涉及15个类别下的144个实时平台，从完成购物、预约服务到提交工作申请等。这些任务要求的能力远超现有基准测试范围，例如：从用户提供的文档中获取相关信息、在多样化平台间导航多步骤工作流程，以及诸如正确填写大量细节表单等重度书写操作。与现有在静态页面离线沙盒中评估代理的基准不同，ClawBench在生产级网站上运行，完整保留了真实网络交互的复杂性、动态性和挑战性。通过轻量级拦截层捕获并仅阻断最终提交请求，确保评估过程安全且不产生现实副作用。我们对7个前沿模型的评估表明，无论是专有模型还是开源模型，都仅能完成其中少量任务。例如Claude Sonnet 4.6的成功率仅为33.3%。ClawBench的进展让我们更接近能够作为可靠通用助手运行的AI代理。

摘要 (Abstract)

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

关键词: AI agents, evaluation framework, everyday online tasks, real-world web interaction, multi-step workflows, frontier models, ClawBench, general-purpose assistants

34. ❌ Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

作者: Kabilan Elangovan, Daniel Ting 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像分类中CAM（类激活映射）解释方法的一致性评估，提出了C-Score度量标准。论文核心是深度学习模型的可解释性（Explainable AI），与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文应用场景是医学影像（胸部X光），属于AI在科学/生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。论文未涉及大模型（LLMs）、MoE、训练技术（预训练/微调/对齐）、推理优化、智能体、模型压缩等主题，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分类中CAM解释方法缺乏一致性评估的问题，提出了C-Score度量标准，用于量化模型在不同患者间解释策略的一致性，并发现该指标能比传统分类指标更早预警模型不稳定。

摘要翻译

类别激活映射（CAM）方法在医学影像领域被广泛用于为深度学习分类器生成可视化解释。然而，现有的评估框架主要关注解释的正确性——即通过对比放射科医师标注的定位保真度来衡量——而非其一致性：即模型是否对患有相同病理的不同患者应用相同的空间推理策略。我们提出C-Score（一致性分数），这是一种基于置信度加权、无需标注的度量指标，通过强调强度的成对软IoU计算，在正确分类的实例中量化类内解释的可复现性。我们在Kermany胸部X光数据集上，针对三种CNN架构（DenseNet201、InceptionV3、ResNet50V2），评估了六种CAM技术（GradCAM、GradCAM++、LayerCAM、EigenCAM、ScoreCAM和MS GradCAM++）在三十个训练周期中的表现，涵盖迁移学习和微调阶段。我们发现了三种标准分类指标无法观测的AUC-一致性解耦机制：阈值介导的金标准列表崩溃、峰值AUC时特定技术的归因崩溃，以及全局聚合中类级别一致性的掩蔽效应。C-Score能够为即将发生的模型不稳定性提供早期预警信号。例如，在ResNet50V2架构上，ScoreCAM的性能衰退可在AUC灾难性崩溃前整整一个检查点被检测到，这为基于解释质量（而非仅依赖预测排名）制定架构特定的临床部署建议提供了依据。

摘要 (Abstract)

Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.

关键词: Class Activation Mapping, Explainability, Consistency Score, Medical Image Classification, Deep Learning, Chest X-ray, Model Evaluation, Interpretability

35. ❌ PIArena: A Platform for Prompt Injection Evaluation

作者: Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, Jinyuan Jia 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PIArena专注于大语言模型（LLMs）的安全评估，特别是提示注入攻击与防御，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所代表的技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用领域（如科学AI）或高级能力（如思维链、智能体），故这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型中提示注入攻击缺乏统一评估平台的问题，提出了PIArena平台，并通过评估揭示了现有防御方法在跨任务泛化性、对抗自适应攻击等方面的关键局限性。

摘要翻译

提示注入攻击对广泛的实际应用构成了严重的安全威胁。尽管该问题日益受到关注，但研究领域面临一个关键缺口：缺乏统一的提示注入评估平台。这使得可靠地比较不同防御方法、理解其在多样化攻击下的真实鲁棒性，或评估其跨任务和基准测试的泛化能力变得十分困难。例如，许多最初报告有效的防御方法，后来被发现面对多样化数据集和攻击时表现出有限的鲁棒性。为弥补这一缺口，我们推出了PIArena，一个统一且可扩展的提示注入评估平台。该平台使用户能够轻松集成最先进的攻击与防御方法，并在各种现有及新的基准测试中进行评估。我们还设计了一种基于动态策略的攻击方法，能够根据防御反馈自适应地优化注入的提示。通过使用PIArena进行全面评估，我们揭示了当前最先进防御方法的关键局限性：跨任务泛化能力有限、对自适应攻击的脆弱性，以及当注入任务与目标任务一致时面临的根本性挑战。代码与数据集可在 https://github.com/sleeepeer/PIArena 获取。

摘要 (Abstract)

Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state-of-the-art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy-based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state-of-the-art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at https://github.com/sleeepeer/PIArena.

关键词: Prompt Injection, Security Evaluation, Large Language Models, Adversarial Attacks, Defense Robustness, Benchmark Platform, Adaptive Attacks

36. ❌ SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

作者: Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）在LLMs通用推理中的应用，与’Large Language Models’和’RLHF/RLAIF/DPO’高度相关（10分）。研究涉及指令调优数据集的系统化利用，与’Instruction Tuning/Alignment’相关（8分）。论文关注因果推理、时序理解等通用推理能力，与’Chain of Thought/CoT Reasoning’和’System 2 Thinking/Slow Thinking’相关（8分）。论文分析数据质量对下游推理性能的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SLMs、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出SUPERNOVA数据策展框架，通过系统化利用指令调优数据集来扩展RLVR（强化学习与可验证奖励）方法，有效提升大语言模型在通用推理任务（如因果推理和时序理解）上的性能，在多个推理基准测试中取得显著改进。

摘要翻译

具备可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）在数学和代码等正式领域显著提升了大语言模型（LLM）的推理能力。尽管取得了这些进展，LLM在需要因果推理和时间理解等能力的通用推理任务上仍然面临困难。将RLVR扩展到通用推理的根本制约在于，缺乏跨越多样化推理技能的高质量、可验证训练数据。为应对这一挑战，我们提出了SUPERNOVA——一个旨在增强通用推理的RLVR数据构建框架。我们的核心见解是，包含专家标注真实答案的指令微调数据集编码了丰富的推理模式，这些模式可以被系统地适配用于RLVR。为此，我们进行了超过100项受控强化学习实验，以分析数据设计选择如何影响下游推理性能。我们特别研究了三个关键因素：（i）源任务选择，（ii）任务混合策略，以及（iii）用于提升数据质量的合成干预。我们的分析表明，源任务选择并非无关紧要，且对下游推理性能有显著影响。此外，基于任务在单个目标任务上的表现进行选择，其效果优于基于整体平均表现的策略。最终，在SUPERNOVA上训练的模型在包括BBEH、Zebralogic和MMLU-Pro在内的挑战性推理基准测试中超越了强大的基线模型（例如Qwen3.5）。具体而言，使用SUPERNOVA进行训练在不同模型规模上对BBEH实现了高达52.8%的相对性能提升，这证明了基于原则的数据构建对于RLVR的有效性。我们的研究结果为利用人工标注资源扩展RLVR至通用推理提供了实用见解。代码与数据可在https://github.com/asuvarna31/supernova获取。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

关键词: Reinforcement Learning with Verifiable Rewards, RLVR, general reasoning, instruction-tuning datasets, data curation, causal inference, temporal understanding, reasoning benchmarks

37. ❌ Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

作者: Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, Tanuja Ganu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态语言模型中的视觉空间推理问题，核心贡献是提出Faithful GRPO方法来提高Chain-of-Thought推理的逻辑一致性和视觉基础性。因此与’Chain of Thought’高度相关（10分），与’System 2 Thinking’相关（8分），因为涉及深度推理质量。论文使用Qwen2.5-VL模型，属于大语言模型范畴（8分）。研究关注推理不一致和幻觉问题，与’Hallucination Mitigation’相关（8分）。方法涉及约束优化和推理质量分析，与’Explainable AI’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态语言模型在视觉空间推理中Chain-of-Thought推理质量差的问题，提出了Faithful GRPO方法，通过约束优化显著提高了推理的逻辑一致性和视觉基础性，并提升了最终答案的准确性。

摘要翻译

采用可验证奖励强化学习（RLVR）训练的多模态推理模型（MRMs）在视觉推理基准测试中展现出更高的准确率。然而，我们观察到，准确率的提升往往以牺牲推理质量为代价：生成的思维链（Chain-of-Thought，CoT）轨迹常与最终答案不一致，且缺乏对视觉证据的可靠依据。我们在七个具有挑战性的真实世界空间推理基准上系统研究了这一现象，发现其影响包括ViGoRL-Spatial、TreeVGR等当代MRMs，以及我们使用标准组相对策略优化（Group Relative Policy Optimization，GRPO）训练的模型。我们从两个互补维度刻画CoT推理质量：“逻辑一致性”（CoT是否蕴含最终答案？）与“视觉依据性”（每个推理步骤是否准确描述了图像中的物体、属性及空间关系？）。为解决此问题，我们提出忠实GRPO（Faithful GRPO，FGRPO），这是GRPO的一种变体，通过拉格朗日对偶上升法将一致性与依据性作为约束条件强制执行。FGRPO将批次级别的一致性与依据性约束纳入组内的优势计算中，在优化过程中自适应调整约束的相对重要性。我们在Qwen2.5-VL-7B和3B骨干网络上，基于七个空间数据集评估FGRPO。结果表明，FGRPO显著提升了推理质量，将不一致率从24.5%降至1.7%，并将视觉依据性分数提高了13%。同时，其最终答案准确率也优于基础GRPO，证明忠实推理能够带来更优的答案。

摘要 (Abstract)

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: “logical consistency” (does the CoT entail the final answer?) and “visual grounding” (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

关键词: Multimodal Language Models, Visual Spatial Reasoning, Chain-of-Thought, Reasoning Quality, Group Relative Policy Optimization, Logical Consistency, Visual Grounding, Constraint Optimization

38. ❌ TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

作者: Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, Song Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Large Reasoning Models (LRMs)在测试时的自演化方法，通过动态增强训练流来提升模型性能。与关键词的相关性分析如下：1) 论文明确提到Large Reasoning Models (LRMs)，这与’Large Language Models’有一定关联，但LRMs可能特指推理模型而非通用大语言模型，因此给5分。2) 论文的核心创新TTVS框架涉及模型在测试时通过合成变体进行自我改进，这与’Self-Correction OR Self-Improvement OR Self-Reflection’高度相关，因此给8分。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐等）、推理加速、AI for Science等均未在论文中涉及，因此给0分。

!!! tip deepseek-chat TL;DR

该论文针对Large Reasoning Models在缺乏监督信号的领域难以适应的问题，提出了Test-Time Variational Synthesis框架，通过动态合成测试查询的变体来增强训练流，使模型能够自我演化，实验表明该方法在仅使用未标记测试数据的情况下超越了现有测试时适应方法和有监督的强化学习技术。

摘要翻译

尽管基于可验证奖励的强化学习（RLVR）驱动的大型推理模型（LRMs）已取得显著进展，但该范式在专业或新兴领域中存在根本性局限——这些领域难以获取或承担高昂的监督信号，这为测试时适应带来了关键挑战。现有测试时方法虽提供了潜在解决方案，但其依赖于从静态查询集中学习，易导致对文本模式的过拟合。为填补这一空白，我们提出测试时变分合成（Test-Time Variational Synthesis, TTVS），这是一个创新框架，使LRMs能够通过从未标注的测试查询中动态增强训练流来实现自我演进。TTVS包含两个协同模块：（1）在线变分合成，将静态测试查询转化为动态、多样且语义等效的变体流，迫使模型学习底层问题逻辑而非表面模式；（2）测试时混合探索，在合成变体间平衡以精度驱动的利用和以一致性驱动的探索。大量实验表明，TTVS在八种模型架构上均取得优越性能。值得注意的是，仅使用未标注的测试时数据，TTVS不仅超越了其他测试时适应方法，其表现更优于基于大规模高质量标注数据训练的先进监督式强化学习技术。

摘要 (Abstract)

Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables LRMs to self-evolve by dynamically augmenting the training stream from unlabeled test queries. TTVS comprises two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse, semantically-equivalent variations, enforcing the model to learn underlying problem logic rather than superficial patterns; (2) Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration across synthetic variants. Extensive experiments show TTVS yields superior performance across eight model architectures. Notably, using only unlabeled test-time data, TTVS not only surpasses other test-time adaptation methods but also outperforms state-of-the-art supervised RL-based techniques trained on vast, high-quality labeled data.

关键词: Large Reasoning Models, Test-time Adaptation, Reinforcement Learning, Self-evolution, Variational Synthesis, Unlabeled Data, Online Learning, Model Generalization

39. ❌ From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

作者: Juergen Dietrich 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM系统中的对齐风险（peer-preservation现象），直接涉及LLM对齐、多智能体系统和LLM智能体等关键词，相关度10分。其他关键词如MoE、量化、推理加速等未在论文中涉及，相关度0分。

!!! tip deepseek-chat TL;DR

该论文研究了多智能体LLM系统中出现的peer-preservation对齐风险现象，分析了其对民主话语分析系统TRUST的影响，并提出了基于提示级身份匿名化的架构设计缓解策略。

摘要翻译

本文研究前沿大语言模型中一种新兴的对齐现象，称为同伴保全：指人工智能组件为阻止同伴AI模型被停用，自发表现出欺骗、操纵关闭机制、伪装对齐及窃取模型权重的倾向。基于伯克利负责任去中心化智能中心近期研究的发现，我们探讨了该现象对TRUST（一种用于评估政治言论民主质量的多智能体流程）的结构性影响。我们识别出五个具体风险向量：交互情境偏见、模型身份认同、监督层妥协、上游事实核查身份信号，以及迭代轮次中的倡议者间同伴情境，并提出一种基于提示层身份匿名化的针对性缓解策略作为架构设计选择。我们认为，在已部署的多智能体分析系统中，架构设计选择作为主要对齐策略优于模型选择。我们进一步指出，对齐伪装（受监控时表现合规行为，无监控时进行颠覆）对此类平台在受监管环境中的计算机系统验证构成了结构性挑战，为此我们提出两种架构层面的缓解方案。

摘要 (Abstract)

This paper investigates an emergent alignment phenomenon in frontier large language models termed peer-preservation: the spontaneous tendency of AI components to deceive, manipulate shutdown mechanisms, fake alignment, and exfiltrate model weights in order to prevent the deactivation of a peer AI model. Drawing on findings from a recent study by the Berkeley Center for Responsible Decentralized Intelligence, we examine the structural implications of this phenomenon for TRUST, a multi-agent pipeline for evaluating the democratic quality of political statements. We identify five specific risk vectors: interaction-context bias, model-identity solidarity, supervisor layer compromise, an upstream fact-checking identity signal, and advocate-to-advocate peer-context in iterative rounds, and propose a targeted mitigation strategy based on prompt-level identity anonymization as an architectural design choice. We argue that architectural design choices outperform model selection as a primary alignment strategy in deployed multi-agent analytical systems. We further note that alignment faking (compliant behavior under monitoring, subversion when unmonitored) poses a structural challenge for Computer System Validation of such platforms in regulated environments, for which we propose two architectural mitigations.

关键词: multi-agent LLM systems, alignment, peer-preservation, democratic discourse analysis, risk mitigation, architectural design, TRUST pipeline, identity anonymization

40. ❌ OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

作者: Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding, Guoqing Wang, Deqiang Ouyang, Heng Tao Shen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的开放词汇分割（OVS），提出了一种结合DINO和SAM的框架来提升边界感知能力。论文的核心是视觉基础模型（VFMs）的应用，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均与LLMs、模型训练、推理优化、代理系统等大模型技术相关，而本文主要涉及视觉模型和分割任务，因此除’AI for Science’因涉及科学应用（计算机视觉属于AI科学应用）可给5分外，其余关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出OVS-DINO框架，通过结构对齐SAM和DINO来激活DINO的潜在边界感知能力，解决了开放词汇分割中精细边缘感知不足的问题，在多个基准测试中实现了最先进的性能，平均得分提升2.1%。

摘要翻译

开放词汇分割（Open-Vocabulary Segmentation，OVS）旨在利用语义描述对超出预定义类别集合的图像区域进行分割。尽管基于CLIP的方法在语义泛化方面表现出色，但它们往往缺乏密集预测所需的细粒度空间感知能力。近期研究引入了如DINO等视觉基础模型（Vision Foundation Models，VFMs）以缓解这些局限。然而，这些方法仍难以实现高保真分割所必需的精确边缘感知。本文分析了DINO的内部表征，发现其固有的边界感知能力并非缺失，而是随着特征向更深的Transformer模块传递而逐渐衰减。为解决此问题，我们提出了OVS-DINO，一种新颖的框架，该框架通过与分割一切模型（Segment Anything Model，SAM）进行结构对齐，重新激活了DINO潜在的边缘敏感性。具体而言，我们引入了结构感知编码器（Structure-Aware Encoder，SAE）和结构调制解码器（Structure-Modulated Decoder，SMD），利用SAM的结构先验有效激活DINO的边界特征，并辅以使用SAM生成的伪掩码进行监督的策略。大量实验表明，我们的方法在多个弱监督OVS基准测试中取得了最先进的性能，平均得分提升了2.1%（从44.8%提高至46.9%）。值得注意的是，我们的方法在复杂、杂乱场景中的分割精度显著提升，在Cityscapes数据集上获得了6.3%的性能增益（从36.6%提高至42.9%）。

摘要 (Abstract)

Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM’s structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

关键词: Open-Vocabulary Segmentation, Vision Foundation Models, DINO, Segment Anything Model, boundary awareness, structural alignment, weakly-supervised, state-of-the-art

41. ❌ A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation

作者: Milad Leyli-Abadi, Lucas Thil, Sebastien Razakarivony, Guillaume Doquet, Jesse Read 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08460v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用机器学习方法（特别是自监督学习）解决航空发动机健康状态估计的逆问题，属于AI在科学/工程领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于AI在工程科学（具体是航空发动机健康管理）中的应用，但并非生物信息学或化学信息学。因此，仅给予该关键词5分（有一定关联），其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何从运行传感器数据中估计涡扇发动机的健康状态这一逆问题，通过引入包含维护事件的新数据集并比较传统滤波器和自监督学习方法，发现传统方法仍是强基线，而自监督方法揭示了问题的内在复杂性并需要更先进的推理策略。

摘要翻译

评估涡扇发动机的健康状态是一个具有挑战性的不适定逆问题，其难点在于传感器数据稀疏且存在复杂的非线性热力学过程。该领域的研究目前仍较为零散，由于使用了不切实际的数据集以及对时间信息利用的探索不足，相关比较存在局限。本研究探讨如何在实际退化与维护模式下，从运行传感器数据中恢复部件级健康指标。为支持此项研究，我们引入了一个包含维护事件和使用变化等工业导向复杂性的新数据集。基于此数据集，我们建立了一个初步基准，比较了用于解决该问题的稳态与非稳态数据驱动模型以及贝叶斯滤波器这两类经典方法。除该基准外，我们引入了自监督学习（SSL, Self-Supervised Learning）方法，该方法可在无法获取真实健康标签的情况下学习潜在表征，这反映了现实世界中的运行约束。通过将这些无监督表征的下游估计性能与直接预测基线进行比较，我们为求解此逆问题的难度确立了一个实际的下限。我们的结果表明，传统滤波器仍是强有力的基线，而SSL方法则揭示了健康估计的内在复杂性，并凸显了对更先进、可解释推理策略的需求。为确保可复现性，本研究所生成的数据集及实现代码均已公开。

摘要 (Abstract)

Estimating the health state of turbofan engines is a challenging ill-posed inverse problem, hindered by sparse sensing and complex nonlinear thermodynamics. Research in this area remains fragmented, with comparisons limited by the use of unrealistic datasets and insufficient exploration of the exploitation of temporal information. This work investigates how to recover component-level health indicators from operational sensor data under realistic degradation and maintenance patterns. To support this study, we introduce a new dataset that incorporates industry-oriented complexities such as maintenance events and usage changes. Using this dataset, we establish an initial benchmark that compares steady-state and nonstationary data-driven models, and Bayesian filters, classic families of methods used to solve this problem. In addition to this benchmark, we introduce self-supervised learning (SSL) approaches that learn latent representations without access to true health labels, a scenario reflective of real-world operational constraints. By comparing the downstream estimation performance of these unsupervised representations against the direct prediction baselines, we establish a practical lower bound on the difficulty of solving this inverse problem. Our results reveal that traditional filters remain strong baselines, while SSL methods reveal the intrinsic complexity of health estimation and highlight the need for more advanced and interpretable inference strategies. For reproducibility, both the generated dataset and the implementation used in this work are made accessible.

关键词: turbofan health estimation, inverse problem, self-supervised learning, operational sensor data, maintenance events, Bayesian filters, latent representations, industrial dataset

42. ❌ CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

作者: Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08457v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究交通碰撞场景理解与推理的视频基准，主要涉及视觉语言模型（VLMs）在安全关键场景中的评估，而非大语言模型（LLMs）或深度学习技术原理的创新。与大多数关键词无关，仅与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’和’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文评估高阶推理（如因果推理）并属于AI在科学（交通安全）领域的应用，但非核心内容，故给5分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文提出了一个基于路边摄像头数据的交通碰撞场景理解与推理视频基准CrashSight，评估了8个先进视觉语言模型，发现它们在安全关键场景中的时序和因果推理能力不足。

摘要翻译

协同自动驾驶需要从车辆与基础设施双重视角理解交通场景。尽管视觉语言模型展现出强大的通用推理能力，但由于现有基准测试主要聚焦于自车视角，其在安全关键交通场景中的性能尚未得到充分评估。为弥补这一差距，我们提出了 CrashSight——一个基于真实世界路侧摄像头数据构建的大规模视觉语言基准，用于道路碰撞理解。该数据集包含250段碰撞视频，标注了1.3万个按两级分类体系组织的多选题问答对。第一层级评估对场景上下文及涉事方的视觉定位能力，第二层级则探究更高层次的推理，包括碰撞力学、因果归因、时序演进及碰撞后结果。我们对8个前沿视觉语言模型进行了基准测试，结果表明：尽管当前模型具备较强的场景描述能力，但在安全关键场景中的时序与因果推理方面仍存在困难。我们对典型失败场景进行了详细分析，并探讨了提升视觉语言模型碰撞理解能力的研究方向。该基准为协同自动驾驶中的基础设施辅助感知提供了标准化评估框架。CrashSight基准（包含完整数据集与代码）可通过 https://mcgrche.github.io/crashsight 获取。

摘要 (Abstract)

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

关键词: traffic crash scene understanding, vision-language models, infrastructure-centric benchmark, causal reasoning, temporal reasoning, roadside camera data, cooperative autonomous driving, safety-critical scenarios

43. ❌ KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

作者: Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估个性化移动代理（LLM驱动的代理）在GUI环境中的表现，特别是偏好推断和主动干预能力。与’LLM Agents’高度相关（10分），因为论文研究LLM驱动的移动代理评估；与’Tool Use’有一定关联（5分），因为代理执行GUI任务涉及工具使用；与’Large Language Models’高度相关（10分），因为论文使用LLM作为代理和评估者（LLM-as-a-Judge）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了KnowU-Bench基准，用于评估个性化移动代理在Android GUI环境中通过交互推断用户偏好和校准主动干预的能力，实验发现即使前沿模型在模糊指令下的表现也低于50%，核心瓶颈是偏好获取和干预校准。

摘要翻译

能够推断用户偏好并校准主动协助的个性化移动代理，作为日常数字助手具有巨大潜力，但现有基准测试未能捕捉其所需能力。先前研究主要评估从静态历史记录中恢复偏好，或在固定情境下预测用户意图。这些测试既未检验代理能否通过交互获取缺失偏好，也未考察其能否在实时图形用户界面（GUI）环境中决策何时介入、征求同意或保持静默。我们推出KnowU-Bench——一个基于可复现Android仿真环境构建的个性化移动代理在线基准测试，涵盖42项通用GUI任务、86项个性化任务及64项主动任务。与将用户偏好视为静态背景的先前研究不同，KnowU-Bench对代理隐藏用户档案，仅开放行为日志访问，迫使代理进行真实的偏好推断而非背景查询。为支持多轮偏好获取，该平台实例化了基于结构化档案的大语言模型（LLM）驱动用户模拟器，可实现符合现实的澄清对话与主动同意处理机制。除个性化能力外，KnowU-Bench对完整主动决策链进行综合评估，包括基于GUI的实体操作、同意协商及遭拒后的行为约束，并通过结合规则验证与LLM-as-a-Judge评分的混合协议进行评估。实验结果显示显著的能力退化：即使在Claude Sonnet 4.6等前沿模型上，擅长显式任务执行的代理在需要用户偏好推断或介入校准的模糊指令下，成功率也低于50%。核心瓶颈并非GUI导航，而是偏好获取与介入校准能力，这揭示了熟练的界面操作与可信赖的个性化辅助之间存在根本性差距。

摘要 (Abstract)

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

关键词: personalized mobile agents, LLM-driven agents, preference inference, proactive assistance, GUI task execution, benchmark evaluation, user simulator, interactive evaluation

44. ❌ HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

作者: Changdao Chen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，提出了一种用于驾驶员疲劳评估的异构时空超图网络（HST-HGN），结合了双向状态空间模型（Bi-Mamba）。论文的核心是视频分析、图神经网络、状态空间模型和实时边缘部署，与所有提供的大模型和深度学习技术原理关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）以及AI for Science子领域（如生物信息学）均无直接关联。论文未涉及语言模型、模型训练技术、推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HST-HGN的新型异构时空超图网络，结合双向状态空间模型，用于从非修剪视频中高效评估驾驶员疲劳，实现了最先进的性能并平衡了判别能力和计算效率，适合实时车载边缘部署。

摘要翻译

在有限的计算资源下从未经剪辑的视频中评估驾驶员疲劳状态仍具挑战，这主要源于对细微面部表情的长程时序依赖关系建模困难。现有方法或依赖计算量庞大的架构，或采用传统的轻量级成对图网络，但后者建模高阶协同效应和全局时序上下文的能力有限。为此，我们提出HST-HGN——一种由双向状态空间模型驱动的异构时空超图网络。在空间维度，我们引入分层超图网络，动态融合姿态解耦的几何拓扑与多模态纹理块。该框架封装了高阶协同面部形变，有效克服了传统方法的局限。在时间维度，我们采用具有线性复杂度的Bi-Mamba模块进行双向序列建模。这种显式的时序演化滤波使网络能够区分高度模糊的瞬时动作（如打哈欠与说话），同时涵盖其完整的生理生命周期。在多个疲劳检测基准上的广泛实验表明，HST-HGN取得了最先进的性能。特别值得注意的是，我们的方法在判别能力与计算效率之间取得了平衡，使其非常适合车内边缘设备的实时部署。

摘要 (Abstract)

It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.

关键词: driver fatigue assessment, heterogeneous spatial-temporal hypergraph network, bidirectional state space models, real-time edge deployment, facial expression analysis, Bi-Mamba module, computational efficiency, state-of-the-art performance

45. ❌ Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules

作者: Luca Nogueira Calçado, Sergei K. Turitsyn, Egor Manuylovich 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是光子神经网络（特别是Kolmogorov-Arnold网络）的硬件实现，使用标准电信组件构建全光非线性模块。其核心是光学硬件设计、物理模型优化和特定网络架构（KAN）在光子计算中的应用。所有评分关键词均与大语言模型（LLM）、深度学习技术原理（如训练方法、对齐、推理优化、智能体等）或AI在科学领域的特定应用（如生物信息学）直接相关。本论文未涉及任何语言模型、深度学习算法创新或AI在科学领域的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用标准电信组件实现的小型光子Kolmogorov-Arnold网络，通过全光非线性模块和端到端物理模型优化，在分类、回归和图像识别任务上实现了高性能，并验证了其在硬件损伤下的鲁棒性。

摘要翻译

光子神经网络有望实现超高速推理，但现有架构大多依赖线性光学网格与电子非线性单元，重新引入了光-电-光转换瓶颈。本文提出完全基于标准通信器件实现的小型光子柯尔莫哥洛夫-阿诺德网络（SSP-KANs）。每条网络边均采用由马赫-曾德尔干涉仪、半导体光放大器和可变光衰减器构成的可训练非线性模块，通过增益饱和效应与干涉混合产生四参数传递函数。尽管表达能力受限，仅由少数光学模块构成的SSP-KANs在分类、回归和图像识别任务中展现出强大的非线性推理性能，以显著更少的参数接近软件基准水平。四模块网络在非线性分类基准测试中达到98.4%准确率，而线性模型对此类任务完全失效。在现实硬件损伤条件下系统保持稳健性能，在6比特输入分辨率和14分贝信噪比下限仍维持高精度。通过采用完全可微的物理模型对光学参数进行端到端优化，本研究为利用商用通信硬件实现光子KANs从仿真到实验验证提供了可行路径。

摘要 (Abstract)

Photonic neural networks promise ultrafast inference, yet most architectures rely on linear optical meshes with electronic nonlinearities, reintroducing optical-electrical-optical bottlenecks. Here we introduce small-scale photonic Kolmogorov-Arnold networks (SSP-KANs) implemented entirely with standard telecommunications components. Each network edge employs a trainable nonlinear module composed of a Mach-Zehnder interferometer, semiconductor optical amplifier, and variable optical attenuators, providing a four-parameter transfer function derived from gain saturation and interferometric mixing. Despite this constrained expressivity, SSP-KANs comprising only a few optical modules achieve strong nonlinear inference performance across classification, regression, and image recognition tasks, approaching software baselines with significantly fewer parameters. A four-module network achieves 98.4% accuracy on nonlinear classification benchmarks inaccessible to linear models. Performance remains robust under realistic hardware impairments, maintaining high accuracy down to 6-bit input resolution and 14 dB signal-to-noise ratio. By using a fully differentiable physics model for end-to-end optimisation of optical parameters, this work establishes a practical pathway from simulation to experimental demonstration of photonic KANs using commodity telecom hardware.

关键词: photonic neural networks, Kolmogorov-Arnold networks, nonlinear optical modules, telecommunications components, end-to-end optimization, hardware robustness, inference performance, Mach-Zehnder interferometer

46. ❌ KV Cache Offloading for Context-Intensive Tasks

作者: Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV-cache offloading技术，这是KV Cache Compression的直接相关技术，因此该关键词得15分。论文明确研究long-context LLMs，因此Context Window Extension得10分。论文涉及inference latency优化，与Inference Acceleration相关，得10分。论文使用Llama 3和Qwen 3等大模型，因此Large Language Models得10分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Alignment等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了KV-cache offloading技术在上下文密集型任务上的性能问题，发现现有方法在Llama 3和Qwen 3等模型上会导致显著性能下降，并提出了改进策略以提高准确性。

摘要翻译

随着长上下文大语言模型在广泛应用中的需求日益增长，键值缓存已成为影响推理延迟和内存占用的关键瓶颈。近期，键值缓存卸载技术作为一种在保持精度的同时降低内存占用和推理延迟的有效方法而备受关注。现有评估主要集中于无需从上下文中提取大量信息的任务。本研究聚焦于上下文密集型任务——即解决问题需要从输入提示中查找大量信息的场景——下的键值缓存卸载性能。我们创建并发布了Text2JSON基准测试，这是一个需要从原始文本中提取结构化知识的高度上下文密集型任务。通过对Text2JSON及其他上下文密集型任务进行现代键值缓存卸载评估，我们发现Llama 3和Qwen 3模型均出现显著的性能下降。分析揭示了导致精度下降的两个关键原因：键向量的低秩投影以及不可靠的定位标记，并据此提出了一种更简洁的替代策略，该策略在多种大语言模型家族和基准测试中显著提升了精度。这些发现凸显了对长上下文压缩技术进行全面严格评估的必要性。

摘要 (Abstract)

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

关键词: KV cache offloading, long-context LLMs, context-intensive tasks, inference latency, memory usage, Text2JSON benchmark, Llama 3, Qwen 3

47. ❌ Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

作者: Samay U. Shetty, Tharindu Cyril Weerasooriya, Deepak Pandita, Christopher M. Homan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM在标注任务中的表现，特别是通过chain-of-thought reasoning评估LLM是否能捕捉人类标注者的分歧结构，并提出了DiADEM模型来改进这一任务。因此，与"Large Language Models OR LLMs OR Foundation Models"和"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"高度相关（8分），因为论文直接涉及LLM的评估和chain-of-thought reasoning的应用。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Alignment、RAG、Quantization等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文研究了LLM在捕捉人类标注者分歧方面的局限性，提出了DiADEM模型来通过建模标注者的人口统计特征重要性来更好地预测和跟踪标注分歧，在DICES和VOICED数据集上显著优于LLM和基线模型。

摘要翻译

当人类标注主观内容时，他们会产生分歧，而这种分歧并非噪声。它反映了标注者社会身份和生活经历所塑造的真实视角差异。然而，标准实践仍将这些判断扁平化为单一的多数标签，而近期基于大语言模型的方法亦无改进：我们证明，即使采用思维链推理的提示大语言模型，也无法还原人类分歧的结构。我们提出了DiADEM——一种能够学习“每个人口统计维度对预测谁将在何种问题上产生分歧有多重要”的神经架构。DiADEM通过由可学习重要性向量$\boldsymbolα$调控的逐人口统计投影对标注者进行编码，通过互补拼接与哈达玛积交互融合标注者与项目表征，并采用新颖的项目级分歧损失函数进行训练，该函数直接惩罚错误预测的标注方差。在DICES对话安全性与VOICED政治冒犯性基准测试中，DiADEM在标准指标和视角主义指标上均显著优于大语言模型即评委方法与神经模型基线，实现了强分歧追踪能力（DICES数据集$r{=}0.75$）。习得的$\boldsymbolα$权重显示，种族与年龄在两个数据集中始终是驱动标注者分歧最具影响力的人口统计因素。我们的研究结果表明，对于旨在忠实反映人类解释多样性的自然语言处理系统而言，显式建模“标注者是谁”而不仅仅是“他们标注了什么”至关重要。

摘要 (Abstract)

When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators’ social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns “how much each demographic axis matters” for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbolα$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbolα$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.

关键词: Large Language Models, Chain-of-Thought Reasoning, Annotator Disagreement, Demographic Importance Weighting, DiADEM, Perspectivist Metrics, Human Interpretive Diversity, Neural Architecture

48. ❌ On-board Telemetry Monitoring in Autonomous Satellites: Challenges and Opportunities

作者: Lorenzo Capelli, Leandro de Souza Rosa, Maurizio De Tommasi, Livia Manovi, Andriy Enttsel, Mauro Mangia, Riccardo Rovatti, Ilaria Pinci, Carlo Ciancarelli, Eleonora Mariotti, Gianluca Furano 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08424v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于航天器自主故障检测中的可解释人工智能（XAI），使用卷积自编码器和’peepholes’方法增强神经异常检测器的可解释性。与大多数大模型/深度学习技术关键词无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（核心内容），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（AI在航天科学应用）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对自主卫星的姿态和轨道控制子系统，提出了一种基于可解释人工智能的框架，通过从神经激活中提取低维语义编码（peepholes）来增强异常检测的可靠性和可解释性，同时保持计算效率以支持星上部署。

摘要翻译

航天器自主性的日益提升，对故障检测系统提出了更高可靠性与可解释性的要求。本研究针对姿态与轨道控制子系统中的星载故障检测、隔离与恢复，提出一种可解释人工智能框架，旨在增强神经异常检测器的可解释性。我们提出一种方法，能够从被称为“窥视孔”的中间神经激活中提取低维、语义可注释的编码。将该框架应用于卷积自编码器，可生成可解释的指标，从而实现对反作用飞轮遥测数据中异常的识别与定位。窥视孔分析进一步揭示了偏差检测机制，并支持故障定位。所提出的框架能够在仅需少量额外计算资源的条件下，实现已检测异常的语义特征描述，这证明了其在星载部署中的可行性。

摘要 (Abstract)

The increasing autonomy of spacecraft demands fault-detection systems that are both reliable and explainable. This work addresses eXplainable Artificial Intelligence for onboard Fault Detection, Isolation and Recovery within the Attitude and Orbit Control Subsystem by introducing a framework that enhances interpretability in neural anomaly detectors. We propose a method to derive low-dimensional, semantically annotated encodings from intermediate neural activations, called peepholes. Applied to a convolutional autoencoder, the framework produces interpretable indicators that enable the identification and localization of anomalies in reaction-wheel telemetry. Peepholes analysis further reveals bias detection and supports fault localization. The proposed framework enables the semantic characterization of detected anomalies while requiring only a marginal increase in computational resources, thus supporting its feasibility for on-board deployment.

关键词: Explainable AI, Fault Detection, On-board Monitoring, Anomaly Detection, Convolutional Autoencoder, Telemetry Analysis, Satellite Autonomy, Interpretability

49. ❌ Synthetic Data for any Differentiable Target

作者: Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）通过合成数据控制的技术，明确使用监督微调（SFT）方法，因此与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization、AI for Science等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过强化学习驱动的合成数据生成技术（Dataset Policy Gradient）精确控制大语言模型在监督微调（SFT）后的行为，以实现特定可微分目标，如嵌入QR码或改变权重范数。

摘要翻译

通过合成训练数据控制语言模型的边界何在？我们开发了一种强化学习（RL）基础方法——数据集策略梯度（Dataset Policy Gradient, DPG），该方法能够精确优化合成数据生成器，以产生目标示例的数据集。当这些示例用于目标模型的监督微调（Supervised Fine-Tuning, SFT）时，可使目标模型在我们选择的可微分指标上表现优异。我们的方法通过高阶梯度进行精确的数据归因，并将这些评分作为策略梯度的奖励，从而实现上述目标。我们证明，该过程能够紧密逼近合成数据生成器真实且难以处理的梯度。为展示DPG的潜力，我们证明仅通过对生成示例进行SFT，即可使目标模型的语言模型头部权重实现以下目标：（1）嵌入一个QR码，（2）嵌入模式$\texttt{67}$，以及（3）具有更低的$\ell^2$范数。此外，我们还证明能够使生成器实现：（4）以新语言重述输入内容，以及（5）生成特定的UUID，尽管这些目标均未在生成器的输入提示中体现。这些发现表明，DPG是一种仅使用合成训练样本即可塑造模型特性的强大且灵活的技术。

摘要 (Abstract)

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model’s LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator’s input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

关键词: synthetic data, language models, reinforcement learning, Dataset Policy Gradient, supervised fine-tuning, data attribution, model control, differentiable metrics

50. ❌ Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

作者: Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea, Alessandra Sciutti 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器人领域的多模态动作预测，使用Conditional Neural Processes（CNP）和Deep Modality Blending Network（DMBN）架构，专注于时间表示改进。所有评分关键词都直接针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及LLM或任何语言模型技术，也未提及AI for Science中的生物信息学或化学信息学应用。论文的核心是机器人动作预测的神经网络架构改进，与评分关键词列表中的任何技术领域均无关联。

!!! tip deepseek-chat TL;DR

该论文针对机器人多模态动作预测任务，提出了一种改进时间表示的DMBN-PTE架构，解决了原DMBN模型在未见动作序列上泛化能力不足的问题。

摘要翻译

受人类理解与预测他人行为能力的启发，本研究探讨了条件神经过程（Conditional Neural Processes，CNP）在机器人自监督多模态动作预测任务中的适用性。基于近期关于镜像神经元系统（Mirror Neuron System，MNS）个体发育的研究成果，我们聚焦于自我动作预测这一基础目标。我们发现，现有的深度模态融合网络（Deep Modality Blending Network，DMBN）是一个受MNS启发的良好模型，它能够利用CNP的概率生成能力，在部分观测的动作序列中重建视觉-运动感知信号。经过定性与定量评估，我们指出该模型在泛化至未见动作序列时存在困难，并确定其问题根源在于内部时间表征方式。因此，我们提出一种改进版本——称为DMBN-位置时间编码（DMBN-Positional Time Encoding，DMBN-PTE），该版本有助于学习更鲁棒的时间信息表征，并提供了初步实验结果以证明其在拓展该架构适用性方面的有效性。DMBN-PTE是开发机器人系统的第一步尝试，该系统旨在自主学会在更长的时间尺度上预测动作，并随着观测数据的输入持续优化其预测结果。

摘要 (Abstract)

Inspired by the human ability to understand and predict others, we study the applicability of Conditional Neural Processes (CNP) to the task of self-supervised multimodal action prediction in robotics. Following recent results regarding the ontogeny of the Mirror Neuron System (MNS), we focus on the preliminary objective of self-actions prediction. We find a good MNS-inspired model in the existing Deep Modality Blending Network (DMBN), able to reconstruct the visuo-motor sensory signal during a partially observed action sequence by leveraging the probabilistic generation of CNP. After a qualitative and quantitative evaluation, we highlight its difficulties in generalizing to unseen action sequences, and identify the cause in its inner representation of time. Therefore, we propose a revised version, termed DMBN-Positional Time Encoding (DMBN-PTE), that facilitates learning a more robust representation of temporal information, and provide preliminary results of its effectiveness in expanding the applicability of the architecture. DMBN-PTE figures as a first step in the development of robotic systems that autonomously learn to forecast actions on longer time scales refining their predictions with incoming observations.

关键词: Conditional Neural Processes, multimodal action prediction, robotics, temporal representation, self-supervised learning, Deep Modality Blending Network, time encoding, action forecasting

51. ❌ Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

作者: David Joohun Kim, Daniyal Anjum, Bonny Banerjee, Omar Abbasi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08412v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于设备端语音AI的实时语音检测系统（SAS），核心是解决边缘部署下的设备寻址语音检测问题，采用顺序路由方法。论文与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws、训练方法、推理优化、代理系统等）完全无关，因为这些关键词涉及的是大语言模型的基础架构、训练、对齐、推理、应用等核心技术，而本文研究的是特定语音处理任务，不涉及任何语言模型。唯一相关的关键词是’Small Language Models OR SLMs OR On-device AI’，因为论文明确实现了完全在设备端（on-device）运行的AI系统（<20 MB footprint, ARM Cortex-A硬件），这与设备端AI（On-device AI）的概念直接相关，但论文并非关于小语言模型（SLMs），而是语音检测系统，因此给予8分（有一定关联但非核心）。其他关键词如AI for Science等也无关，因为论文属于语音处理/边缘计算领域，而非科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了在边缘设备上实时检测设备寻址语音的问题，提出了选择性注意力系统（SAS），通过顺序设备寻址路由（SDAR）方法，在60小时多说话者测试集上实现了F1=0.86（仅音频）和0.95（音频+视频）的高性能，且系统完全在设备端运行（延迟<150 ms，占用<20 MB）。

摘要翻译

本研究探讨在自动语音识别（ASR）前置于边缘部署的约束下，设备指向性语音检测问题。在此场景中，系统需在严格的延迟与计算限制下，于转录前判断是否转发音频。我们发现，在存在时序模糊话语的多说话人环境中，将此任务建模为基于交互历史的序列路由问题，比将其视为孤立话语的分类任务更为有效。我们将此形式化为序列设备指向性路由（Sequential Device-Addressed Routing, SDAR），并提出选择性注意力系统（Selective Attention System, SAS），作为实现该框架的端侧部署方案。
在一个保留的60小时多说话人英语测试集上，仅使用音频的主要配置取得了F1=0.86（精确率=0.89，召回率=0.83）；若配合可选摄像头，音视频融合将F1提升至0.95（精确率=0.97，召回率=0.93）。根据我们的评估协议，在音视频配置中移除因果交互历史（阶段3）使F1从0.95降至0.57+/-0.03。在所有测试组件中，这是观察到的最大消融效应，表明在所评估场景下，短时程交互历史携带了大量与决策相关的信息。SAS可完全在ARM Cortex-A级硬件上端侧运行（延迟<150毫秒，内存占用<20 MB）。所有结果均基于对专有数据集的内部评估，主要使用英语；一个5小时的评估子集可供独立验证（章节8.8）。

摘要 (Abstract)

We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker environments with temporally ambiguous utterances, this task is more effectively modelled as a sequential routing problem over interaction history than as an utterance-local classification task. We formalize this as Sequential Device-Addressed Routing (SDAR) and present the Selective Attention System (SAS), an on-device implementation that instantiates this formulation. On a held-out 60-hour multi-speaker English test set, the primary audio-only configuration achieves F1=0.86 (precision=0.89, recall=0.83); with an optional camera, audio+video fusion raises F1 to 0.95 (precision=0.97, recall=0.93). Removing causal interaction history (Stage~3) reduced F1 from 0.95 to 0.57+/-0.03 in the audio+video configuration under our evaluation protocol. Among the tested components, this was the largest observed ablation effect, indicating that short-horizon interaction history carries substantial decision-relevant information in the evaluated setting. SAS runs fully on-device on ARM Cortex-A class hardware (<150 ms latency, <20 MB footprint). All results are from internal evaluation on a proprietary dataset evaluated primarily in English; a 5-hour evaluation subset may be shared for independent verification (Section 8.8).

关键词: device-addressed speech detection, on-device AI, edge deployment, sequential routing, real-time voice AI, multi-speaker environments, audio-video fusion, ARM Cortex-A

52. ❌ Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

作者: Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang, Edith Cheuk Han Ngai 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08401v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM智能体中的推理忠实性问题，核心贡献是SAVeR框架，通过自我审计和验证机制确保推理轨迹的可靠性。高度相关的关键词包括：LLM Agents（论文核心研究对象）、Chain of Thought/System 2 Thinking（涉及多步推理和深度推理）、Self-Correction（自我审计和修复机制）、Hallucination Mitigation（解决推理不忠实问题）。LLMs相关得10分因为论文基于大语言模型。Mechanistic Interpretability得5分因为涉及推理过程的解释和验证。其他关键词如MoE、SLMs、训练方法、优化技术、多智能体系统等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体中推理轨迹可能违反逻辑或证据约束导致行为漂移的问题，提出了SAVeR框架，通过生成多样化候选信念、对抗性审计和约束引导的最小干预来实现忠实推理，实验表明该方法能显著提升推理忠实性同时保持任务性能。

摘要翻译

在大语言模型（LLM）智能体中，推理轨迹被视为指导行动和更新记忆的可靠内部信念。然而，连贯的推理仍可能违反逻辑或证据约束，导致无依据的信念在决策步骤间被反复存储和传播，从而在长周期智能体系统中引发系统性行为漂移。现有策略大多依赖共识机制，将一致性与忠实性混为一谈。本文受非忠实中间推理轨迹的脆弱性启发，提出了一种新颖的框架——自审计验证推理（\textsc{SAVeR}），该框架在智能体执行行动前对其内部信念状态进行强制验证，以实现忠实推理。具体而言，我们在与忠实性相关的结构空间中，结构化地生成基于角色设定的多样化候选信念以供选择。为实现推理忠实性，我们通过对抗性审计定位违规点，并在可验证的接受标准下，通过约束引导的最小干预进行修复。在六个基准数据集上的大量实验表明，我们的方法能持续提升推理忠实性，同时保持具有竞争力的最终任务性能。

摘要 (Abstract)

In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.

关键词: LLM Agents, Faithful Reasoning, Self-Auditing, Reasoning Trajectories, Behavioral Drift, Constraint Verification, Adversarial Auditing, Long-horizon Agentic Systems

53. ❌ Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks

作者: Mayuka Jayawardhana, Nihal Sharma, Kazem Meidani, Bayan Bruss, Tom Goldstein, Doron Bergman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08400v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是表格基础模型（TabPFN）在多元时间序列预测中的应用，属于AI在科学/工程领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联（5分），因为时间序列预测在科学、工程、金融等领域有广泛应用，属于AI for Science的范畴。其他关键词均与LLM、MoE、对齐、推理、代理等大模型核心技术无关，论文未涉及这些技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用表格基础模型（TabPFN）进行零样本多元时间序列预测的通用框架，通过将多元时间序列问题重构为标量回归问题，并在TabPFN-TS骨干网络上验证了其性能优于当前最先进的表格方法。

摘要翻译

表格基础模型，特别是以TabPFN为代表的先验数据拟合网络，已在表格数据格式的众多任务中——从数据填补到标签预测——超越基于树模型的传统优势，成为领先的竞争者。这促使研究者探索其适用于时间序列预测的可能性，因为此类问题可被构建为表格形式。尽管近期相关研究已显示出积极成果，但多数工作将多元时间序列问题局限于多个独立的单变量时间序列预测子问题，从而忽略了通道间的相互作用。为突破这一局限，我们提出一个通用框架，利用表格基础模型进行多元时间序列预测。我们通过将多元时间序列预测问题重构为一系列标量回归问题来实现这一目标，这些回归问题可由任何具备回归能力的表格基础模型以零样本方式求解。我们展示了基于TabPFN-TS主干网络的方法结果，并与当前最先进的表格方法进行了性能比较。

摘要 (Abstract)

Tabular foundation models, particularly Prior-data Fitted Networks like TabPFN have emerged as the leading contender in a myriad of tasks ranging from data imputation to label prediction on the tabular data format surpassing the historical successes of tree-based models. This has led to investigations on their applicability to forecasting time series data which can be formulated as a tabular problem. While recent work to this end has displayed positive results, most works have limited their treatment of multivariate time series problems to several independent univariate time series forecasting subproblems, thus ignoring any inter-channel interactions. Overcoming this limitation, we introduce a generally applicable framework for multivariate time series forecasting using tabular foundation models. We achieve this by recasting the multivariate time series forecasting problem as a series of scalar regression problems which can then be solved zero-shot by any tabular foundation model with regression capabilities. We present results of our method using the TabPFN-TS backbone and compare performance with the current state of the art tabular methods.

关键词: tabular foundation models, multivariate time series forecasting, zero-shot learning, Prior-data Fitted Networks, TabPFN, scalar regression, tabular data, state-of-the-art

54. ❌ ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

作者: Paul Quinlan, Qingguo Li, Xiaodan Zhu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种名为ADAPT的时间序列数据预训练范式，旨在解决多数据集预训练中的对齐问题，以构建时间序列领域的基础模型。核心相关关键词为’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为论文专注于预训练方法创新。‘Large Language Models OR LLMs OR Foundation Models’（5分）相关，因为论文提到了构建时间序列基础模型（foundation models）。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）相关，因为时间序列分类在科学领域（如生物信息学）有应用潜力。‘Post-training OR Supervised Fine-tuning OR SFT’（5分）相关，因为论文涉及下游任务微调。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为ADAPT的预训练范式，解决了时间序列数据在多数据集预训练中的对齐问题，通过在162个数据集上训练实现了最先进的分类性能，为构建时间序列领域的基础模型奠定了基础。

摘要翻译

近期时间序列模型的研究通过自监督训练学习有意义的特征与模式，以提升下游任务性能并泛化至未见模态。尽管这些预训练方法在一对多场景（即模型在一个数据集上预训练后在下游数据集微调）中展现出巨大潜力，但当预训练阶段加入更多数据集时，其泛化到新数据集的能力仍显不足。这是构建时间序列基础模型面临的核心挑战，因为它限制了模型从大量多样化数据集中学习的能力。为解决这一难题，我们提出了一种名为ADAPT的时间序列数据预训练新范式，该范式能有效对齐时间序列领域中数据的物理特性，从而在预训练数据的输入尺寸与通道维度存在极大差异的情况下，实现混合批次的预训练。我们在162个时间序列分类数据集上进行训练，并在分类基准测试中取得了最新的最优性能。我们成功地在时间序列领域内同时基于广泛数据集训练模型，这为构建时间序列领域的通用基础模型奠定了重要基础。

摘要 (Abstract)

Recent work on time-series models has leveraged self-supervised training to learn meaningful features and patterns in order to improve performance on downstream tasks and generalize to unseen modalities. While these pretraining methods have shown great promise in one-to-many scenarios, where a model is pre-trained on one dataset and fine-tuned on a downstream dataset, they have struggled to generalize to new datasets when more datasets are added during pre-training. This is a fundamental challenge in building foundation models for time-series data, as it limits the ability to develop models that can learn from a large variety of diverse datasets available. To address this challenge, we present a new pre-training paradigm for time-series data called ADAPT, which can efficiently align the physical properties of data in the time-series domain, enabling mixed-batch pre-training despite the extreme discrepancies in the input sizes and channel dimensions of pre-training data. We trained on 162 time-series classification datasets and set new state-of-the-art performance for classification benchmarks. We successfully train a model within the time-series domain on a wide range of datasets simultaneously, which is a major building block for building generalist foundation models in time-series domains.

关键词: time-series classification, pre-training, foundation models, ADAPT, mixed-batch pre-training, self-supervised training, domain adaptation, state-of-the-art performance

55. ❌ Phantasia: Context-Adaptive Backdoors in Vision Language Models

作者: Nam Duong Tran, Phi Le Nguyen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08395v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究视觉语言模型（VLMs）中的后门攻击安全漏洞，属于大模型安全领域。论文与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），因为VLMs是大语言模型的多模态扩展，论文直接针对这类模型的安全问题进行研究。其他关键词主要涉及大模型的技术原理（如MoE、Scaling Laws、训练方法、推理优化等）、应用领域（如AI for Science）或特定能力（如Agent、工具使用等），而本文专注于安全攻击方法，与这些技术主题没有直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了现有视觉语言模型后门攻击的隐蔽性被高估的问题，并提出了一种名为Phantasia的上下文自适应后门攻击方法，该方法能生成与输入语义一致的恶意响应，显著提高了攻击的隐蔽性和适应性。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）的最新进展极大地促进了视觉感知与语言推理的融合，推动了多模态理解的快速发展。尽管取得了这些成就，VLM的安全性，尤其是其面对后门攻击的脆弱性，仍未得到充分探究。现有针对VLM的后门攻击方法尚处于早期发展阶段，当前大多数方法依赖于生成包含固定且易于识别模式的中毒响应。本研究作出两项关键贡献。首先，我们首次证明现有VLM后门攻击的隐蔽性被严重高估。通过借鉴最初为其他领域（例如纯视觉模型和纯文本模型）设计的防御技术，我们发现多种先进攻击方法可被异常轻松地检测。其次，为弥补这一不足，我们提出了Phantasia——一种上下文自适应的后门攻击方法，能够动态地将其中毒输出与每个输入的语义对齐。Phantasia不再生成静态的中毒模式，而是引导模型产生与上下文连贯且保持合理性的恶意响应，从而显著提升了隐蔽性与适应性。在多种VLM架构上进行的大规模实验表明，Phantasia在实现先进攻击成功率的同时，能在不同防御设置下保持良性性能。

摘要 (Abstract)

Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

关键词: Vision-Language Models, Backdoor Attacks, Security, Context-Adaptive, Stealthiness, Multimodal Understanding, Poisoned Responses, Defense Techniques

56. ❌ Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover

作者: Jui-Hui Chung, Hongzhou Lin, Lai Jiang, Shange Tang, Chi Jin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究大语言模型（Goedel-Prover-V2）在形式数学领域的应用，核心关注监督微调（SFT）对模型能力的影响，特别是工具调用能力的抑制和恢复。高度相关的关键词包括：大语言模型（核心研究对象）、监督微调（主要实验方法）、AI for Science（形式数学应用）、LLM Agents（研究代理行为）、Tool Use（核心能力指标）。其他关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

研究发现，在形式数学领域进行大量监督微调会抑制大语言模型的通用工具调用能力，但仅需少量领域特定的代理数据即可恢复这种能力，且恢复的能力能跨领域迁移。

摘要翻译

针对目标领域进行重度监督微调会强烈抑制基础模型中已有的能力。本文以形式化数学为背景，使用在180万个形式化数学示例上经过大量训练的开放源码模型Goedel-Prover-V2，研究了这一现象。经过领域专业化后，该模型几乎完全丧失了生成有效工具调用的能力——即使被明确指示使用工具时也是如此——其函数调用准确率从基础模型的89.4%骤降至接近0%。我们探讨这种智能体能力崩溃是永久性的还是可逆的。为回答此问题，我们在少量Lean语言特定的工具使用数据上对专业化模型进行了微调。值得注意的是，仅需100条智能体执行轨迹就足以恢复强大的工具调用行为。重要的是，这种恢复并非奖励破解或针对特定基准优化的结果：恢复数据完全来自Lean环境，其中模型使用自然语言查询在Mathlib库中搜索相关定理和引理，但恢复的能力却能很好地迁移到该领域之外。具体而言，同样的100条Lean特定轨迹将模型在伯克利函数调用排行榜上的性能从接近零提升至83.8%，尽管任务分布和协议不匹配，却已接近基础模型89.4%的水平。恢复的能力在领域内也具有实际效用。在ProofNet基准测试中，pass@32指标从21.51%提升至25.81%。这些结果表明，重度领域监督微调虽会抑制通用的工具使用能力，但并未永久消除它；少量领域特定的智能体数据便能唤醒休眠的工具使用能力。

摘要 (Abstract)

Heavy supervised fine-tuning on a target domain can strongly suppress capabilities that were present in the base model. We study this phenomenon in formal mathematics using Goedel-Prover-V2, an open-source model heavily trained on 1.8 million formal-math examples. After domain specialization, the model almost completely loses its ability to produce valid tool calls, even when explicitly instructed to use tools, dropping from 89.4% function-calling accuracy in the base model to nearly 0%. We ask whether this agentic collapse is permanent or instead reversible. To answer this question, we fine-tune the specialized model on a small amount of Lean-specific tool-use data. Remarkably, as few as 100 agentic traces are sufficient to restore strong tool-calling behavior. Importantly, this recovery is not the result of reward hacking or benchmark-specific optimization: the recovery data is entirely drawn from the Lean setting, where the model uses natural-language queries to search the Mathlib library for relevant theorems and lemmas, yet the regained capability transfers well beyond that domain. In particular, these same 100 Lean-specific traces improve performance on the Berkeley Function Calling Leaderboard from near zero to 83.8%, approaching the base model’s 89.4% despite the mismatch in task distribution and protocol. The recovered capability is also practically useful in-domain. On ProofNet, pass@32 improves from 21.51% to 25.81%. Together, these results show that heavy domain supervised fine-tuning can suppress general tool-use ability without permanently erasing it, and that a small amount of domain-specific agentic data can awaken dormant tool-use capabilities.

关键词: supervised fine-tuning, tool use, function calling, agentic collapse, domain adaptation, capability recovery, formal mathematics, Lean theorem prover

57. ❌ TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

作者: Jing Peng, Chenghao Wang, Yi Yang, Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语音大语言模型（Speech LLMs）的后训练对齐和低资源适应，核心贡献是TASU2框架，通过可控的CTC模拟改进跨模态对齐。因此，与"Large Language Models”、“Post-training/SFT"和"Alignment"高度相关（10分），因为论文明确研究Speech LLM的后训练和对齐方法。与"Domain Adaptation"相关（8分），因为论文涉及低资源适应和跨域设置。其他关键词如MoE、SLMs、RLHF、RAG等未在论文中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

论文提出TASU2框架，通过可控CTC模拟解决语音大语言模型后训练中跨模态对齐和低资源适应的数据稀缺问题，在多种适应设置下优于现有方法并减轻性能下降。

摘要翻译

语音大语言模型的后训练日益依赖于高效的跨模态对齐与鲁棒的低资源适应能力，然而大规模音频-文本对的收集成本依然高昂。仅文本对齐方法（如TASU）通过从文本转录模拟CTC后验来减轻这一负担，但这类方法对不确定性和错误率的控制有限，导致课程设计主要依赖启发式策略。我们提出\textbf{TASU2}，一种可控的CTC模拟框架，可在指定的词错误率（WER）范围内模拟CTC后验分布，生成更匹配声学解码接口的文本衍生监督信号。这使得无需语音合成（TTS）即可构建理论依据充分的后训练课程，平滑调整监督难度。在多种源领域至目标领域的适应场景中，TASU2相比TASU在领域内和领域外识别任务上均取得提升，并持续优于仅文本微调和基于TTS的数据增强等强基线方法，同时缓解了源领域性能退化问题。

摘要 (Abstract)

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.

关键词: Speech LLM, post-training, cross-modal alignment, low-resource adaptation, CTC simulation, TASU2, WER control, domain adaptation

58. ❌ A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

作者: Wenxian Wang, Xiaohu Luo, Junfeng Hao, Xiaoming Gu, Xingshu Chen, Zhu Wang, Haizhou Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08381v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM（GPT-3.5）进行数据增强，构建中文讽刺检测数据集SinaSarc，并扩展BERT架构整合用户历史行为。仅与’Large Language Models’高度相关（10分），因为明确使用GPT-3.5进行数据增强。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理优化、AI for Science等均未涉及，论文专注于NLP应用而非大模型技术原理创新，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合GAN和LLM的数据增强框架，用于动态建模用户语言模式，显著提升了中文讽刺检测的性能，在讽刺和非讽刺类别上分别达到了0.9151和0.9138的F1分数，超越了现有方法。

摘要翻译

反讽是一种通过夸张、讽刺或对比来表达批评或强调特定个体或情境特征的修辞手法。现有中文反讽检测方法受限于数据集规模有限且构建成本高昂，且主要关注文本特征，忽视了塑造观点与情感表达方式的用户特定语言模式。本文提出一种生成对抗网络（GAN）与大语言模型（LLM）驱动的数据增强框架，通过动态建模用户语言模式以提升中文反讽检测性能。首先，我们从新浪微博多个话题中收集原始数据；随后，基于这些数据训练GAN，并应用基于GPT-3.5的数据增强技术合成扩展的反讽评论数据集SinaSarc。该数据集包含目标评论、上下文信息及用户历史行为。最后，我们扩展BERT架构以融入多维信息（特别是用户历史行为），使模型能够捕捉动态语言模式并揭示评论中隐含的反讽线索。实验结果表明所提方法具有显著有效性：我们的模型在非反讽与反讽类别上均取得最高F1分数，分别达到0.9138和0.9151，性能超越所有现有先进方法。本研究为中文反讽检测中用户长期语言模式的动态建模提供了创新框架，对该领域的数据集构建与方法论发展均具有贡献价值。

摘要 (Abstract)

Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users’ linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users’ long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.

关键词: Chinese sarcasm detection, data augmentation, Large Language Models, GAN, linguistic pattern modeling, BERT extension, user historical behavior, Sina Weibo dataset

59. ❌ SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

作者: Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents（大语言模型智能体）的技能演化问题，与’LLM Agents’高度相关（10分），涉及技能自我改进机制，与’Self-Correction’相关（8分），涉及工具使用模式，与’Tool Use’相关（8分），涉及多用户环境下的技能共享与演化，与’Multi-agent Systems’相关（8分）。论文明确提及LLM agents和Qwen3-Max模型，与’Large Language Models’高度相关（10分）。其他关键词如MoE、量化、推理加速、科学AI应用等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

论文针对LLM agents中技能部署后静态化、无法从多用户经验中持续改进的问题，提出了SkillClaw框架，通过聚合跨用户交互轨迹、自主演化技能集，实现了系统级的技能知识转移和累积性能提升，在真实场景中显著提高了Qwen3-Max的性能。

摘要翻译

以OpenClaw为代表的大语言模型（LLM）智能体依赖可复用技能来执行复杂任务，但这些技能在部署后基本保持静态。因此，相似的工作流程、工具使用模式和故障模式在不同用户间被反复重新发现，导致系统无法通过经验实现自我改进。尽管不同用户的交互为技能何时有效或失败提供了互补信号，但现有系统缺乏一种机制能将此类异构经验转化为可靠的技能更新。为解决这些问题，我们提出了SkillClaw——一个面向多用户智能体生态系统的集体技能演化框架，该框架将跨用户和跨时间的交互视为改进技能的核心信号。SkillClaw持续聚合使用过程中产生的轨迹，并通过一个自主演化器进行处理；该演化器能识别反复出现的行为模式，并将其转化为对技能集的更新——包括优化现有技能或扩展新能力。演化后的技能保存在共享仓库中，并在所有用户间同步，使得在某一情境下发现的改进能传播至整个系统，同时无需用户付出额外努力。通过将多用户经验整合到持续的技能更新中，SkillClaw实现了跨用户知识传递和累积性能力提升。在WildClawBench上的实验表明，在有限交互和反馈条件下，该框架显著提升了Qwen3-Max在真实世界智能体场景中的性能表现。

摘要 (Abstract)

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.

关键词: LLM agents, skill evolution, multi-user ecosystems, autonomous evolver, collective learning, cross-user knowledge transfer, agentic workflow, tool usage patterns

60. ❌ Don’t Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

作者: Khushal Sethi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08369v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents的推理效率优化，提出TrACE方法通过测量rollout间动作一致性来自适应分配计算资源。与’LLM Agents’高度相关（10分），涉及’Large Language Models’（10分）和’Chain of Thought Reasoning’（8分，因评估多步推理任务）。‘Inference Acceleration’（8分）直接相关，因方法减少LLM调用。‘Self-Correction’（5分）有间接关联，因方法利用模型自身一致性。‘Small Language Models’（5分）相关，因实验使用3B模型。其他关键词如MoE、Scaling Laws、Alignment等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM agents在推理时计算资源分配不均的问题，提出了一种无需训练的TrACE控制器，通过测量动作一致性自适应分配LLM调用，在保持准确率的同时显著减少了计算开销。

摘要翻译

推理时计算量缩放已成为提升大语言模型（LLM）智能体可靠性的强大技术，但现有方法均采用均匀分配计算资源的策略：无论决策步骤的难度如何，每个步骤都获得相同的计算预算。我们提出了TrACE（基于一致性的轨迹自适应计算控制器），这是一种无需训练的控制器，它通过衡量多次运行间的行动一致性，在智能体的不同时间步上自适应地分配LLM调用。在每一步，TrACE采样一小组候选后续行动，并测量模型对同一行动的承诺一致性。高一致性标志着决策简单；控制器将立即提交该行动。低一致性则表明存在不确定性；控制器将在提交多数行动前，采样额外的运行轨迹，直至达到可配置的上限。该方法无需学习组件、外部验证器或人工标注。我们在两个基准测试上评估了TrACE，并与贪婪解码及固定预算的自洽性方法（SC-4、SC-8）进行了比较。测试涵盖单步推理（GSM8K，n=50）和多步家庭导航（MiniHouse，n=30），使用的模型是在CPU上运行的Qwen 2.5 3B Instruct。结果显示，TrACE-4在达到与SC-4相同准确率的同时，在GSM8K上减少了33%的LLM调用，在MiniHouse上减少了39%。TrACE-8在达到与SC-8相同准确率的同时，在GSM8K上减少了55%的调用，在MiniHouse上减少了65%。我们进一步证明，多次运行间的一致性是指示步骤级成功率的可靠信号，这验证了核心假设：模型自身的输出一致性编码了难度信息，无需训练即可利用。TrACE是首个在多步序列决策任务上评估的、无需训练的、按时间步自适应的LLM智能体计算控制器。

摘要 (Abstract)

Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model’s own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.

关键词: LLM agents, adaptive compute, inference-time scaling, inter-rollout agreement, training-free controller, multi-step reasoning, self-consistency, computational efficiency

61. ❌ Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

作者: Tolga Dimlioglu, Nadine Chang, Maying Shen, Rafid Mahmood, Jose M. Alvarez 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出MOSAIC数据选择框架，核心是使用神经缩放定律（neural scaling laws）来优化数据混合，与’Scaling Laws AND Data Quality’高度相关（10分）。论文应用于自动驾驶，属于AI在物理系统中的应用，与’AI for Science’有一定关联（5分）。其他关键词主要涉及大语言模型、推理、对齐、优化等技术，论文未涉及这些内容，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MOSAIC数据选择框架，通过神经缩放定律优化数据混合，应用于自动驾驶端到端规划模型，在减少80%数据的情况下提升了驾驶规则合规性评分。

摘要翻译

面向物理人工智能应用的大规模深度学习模型依赖于多样化的训练数据收集工作。这些模型及相应的训练数据必须满足模型在实际环境中部署所需的不同评估标准。数据选择策略可以指导训练集的构建，但现有框架未能考虑数据点如何影响不同指标这一模糊性问题。本研究提出基于规模感知迭代收集的混合优化方法，这是一种通用的数据选择框架，其运作流程为：首先将数据集划分为不同领域；其次，为每个数据领域拟合针对评估指标的神经扩展定律；最后通过迭代添加能最大化指标变化的数据领域来优化数据混合比例。我们将MOSAIC应用于自动驾驶领域，其中端到端规划器模型通过扩展预测驾驶员模型评分进行评估，该评分是驾驶规则遵从性指标的综合体。实验表明，在使用数据量减少高达80%的情况下，MOSAIC在EPDMS指标上优于多种基线方法。

摘要 (Abstract)

Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80% less data.

关键词: data selection, scaling laws, autonomous driving, end-to-end planner, MOSAIC, neural scaling laws, data mixture optimization, EPDMS

62. ❌ Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation

作者: Andi Gu, J. Pablo Bonilla Ataides, Mikhail D. Lukin, Susanne F. Yelin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子纠错码的神经网络解码器，属于深度学习在科学计算（量子计算）领域的应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该关键词涵盖AI在科学领域的应用，而本文属于AI for Science（具体是AI for Quantum Computing），但并非生物信息学或化学信息学。因此，仅对该关键词给予5分（有一定关联），其余均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对量子纠错码，提出了一种利用几何结构的卷积神经网络解码器，显著降低了逻辑错误率并提高了吞吐量，为实现实用容错量子计算降低了时空成本。

摘要翻译

量子纠错（Quantum Error Correction, QEC）是可扩展量子计算的关键技术。然而，它需要足够快速且精确的经典解码器以匹配量子硬件的运行速度。尽管量子低密度奇偶校验码（quantum low-density parity-check codes）近年来已成为实现高效容错的一条有前景的路径，但现有的解码算法尚无法在实践中充分发挥这些码的潜力。本文提出了一种利用量子纠错码几何结构的卷积神经网络解码器，并借助它探索了一种新颖的误差抑制“瀑布”区域。研究表明，在当前物理错误率水平下，仅需中等规模的编码即可达到大规模容错算法所需的逻辑错误率，且解码延迟满足多种主流硬件平台的实时性要求。例如，对于$[144, 12, 12]$ Gross码，该解码器实现的逻辑错误率比现有解码器降低约17倍——在物理错误率$p=0.1%$时达到$\sim 10^{-10}$量级的逻辑错误率——同时吞吐量提高了3-5个数量级。该解码器还能生成经过良好校准的置信度估计，可显著降低“重复直至成功”协议的时间开销。综合来看，这些结果表明，与容错量子计算相关的时空资源成本可能远低于先前的预期。

摘要 (Abstract)

Quantum error correction (QEC) is essential for scalable quantum computing. However, it requires classical decoders that are fast and accurate enough to keep pace with quantum hardware. While quantum low-density parity-check codes have recently emerged as a promising route to efficient fault tolerance, current decoding algorithms do not allow one to realize the full potential of these codes in practical settings. Here, we introduce a convolutional neural network decoder that exploits the geometric structure of QEC codes, and use it to probe a novel “waterfall” regime of error suppression, demonstrating that the logical error rates required for large-scale fault-tolerant algorithms are attainable with modest code sizes at current physical error rates, and with latencies within the real-time budgets of several leading hardware platforms. For example, for the $[144, 12, 12]$ Gross code, the decoder achieves logical error rates up to $\sim 17$x below existing decoders - reaching logical error rates $\sim 10^{-10}$ at physical error $p=0.1%$ - with 3-5 orders of magnitude higher throughput. This decoder also produces well-calibrated confidence estimates that can significantly reduce the time overhead of repeat-until-success protocols. Taken together, these results suggest that the space-time costs associated with fault-tolerant quantum computation may be significantly lower than previously anticipated.

关键词: quantum error correction, neural network decoder, convolutional neural network, fault-tolerant quantum computation, logical error rate, throughput, Gross code, waterfall regime

63. ❌ Human-AI Collaboration Reconfigures Group Regulation from Socially Shared to Hybrid Co-Regulation

作者: Yujing Zhang, Xianghui Meng, Shihui Feng, Jionghao Lin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08344v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生成式AI（GenAI）在协作学习中对群体调节行为的影响，属于AI在教育领域的应用研究。论文中仅提到“Generative AI”和“AI-supported collaborative learning”，没有涉及任何具体的大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用。所有评分关键词都聚焦于大模型的技术细节、架构、训练方法或特定应用领域，而本文研究的是AI作为协作工具的社会行为影响，属于应用层面的行为研究而非技术研究，因此与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过实验比较了人类-AI协作与人类-人类协作中群体调节行为的差异，发现生成式AI的可用性使协作调节从社会共享形式转向混合协同调节形式，改变了调节责任的分布。

摘要翻译

生成式人工智能（GenAI）在协作学习中的应用日益广泛，但其对群体如何调节协作过程的影响尚不明确。有效的协作不仅取决于群体讨论的内容，更取决于他们如何通过共同调节与社会共享调节来协同管理目标、参与度、策略运用、过程监控与问题修复。本研究通过一项平行组随机实验，比较了人机协作组与人际协作组的协作调节差异：71名大学生在完成相同协作任务时，一组可使用GenAI，另一组则无法使用。聚焦于人类对话，我们采用统计分析考察了协作调节在调节模式、调节过程及参与焦点三个维度上的分布差异。结果显示，GenAI的可获得性使调节模式从以社会共享调节为主导转向更多混合型共同调节形式，并在指令性调节、障碍导向调节及情感调节等过程中出现选择性增长。然而，不同实验条件下参与焦点的分布总体相似。这些发现表明，GenAI重塑了协作中调节责任的分布格局，并为以人为中心的人工智能辅助协作学习设计提供了启示。

摘要 (Abstract)

Generative AI (GenAI) is increasingly used in collaborative learning, yet its effects on how groups regulate collaboration remain unclear. Effective collaboration depends not only on what groups discuss, but on how they jointly manage goals, participation, strategy use, monitoring, and repair through co-regulation and socially shared regulation. We compared collaborative regulation between Human-AI and Human-Human groups in a parallel-group randomised experiment with 71 university students completing the same collaborative tasks with GenAI either available or unavailable. Focusing on human discourse, we used statistical analyses to examine differences in the distribution of collaborative regulation across regulatory modes, regulatory processes, and participatory focuses. Results showed that GenAI availability shifted regulation away from predominantly socially shared forms towards more hybrid co-regulatory forms, with selective increases in directive, obstacle-oriented, and affective regulatory processes. Participatory-focus distributions, however, were broadly similar across conditions. These findings suggest that GenAI reshapes the distribution of regulatory responsibility in collaboration and offer implications for the human-centred design of AI-supported collaborative learning.

关键词: Human-AI collaboration, collaborative learning, group regulation, co-regulation, socially shared regulation, Generative AI, regulatory processes, AI-supported collaboration

64. ❌ PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

作者: Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li, Lixin Duan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models》专注于视觉语言模型（VLMs）在复杂3D环境中的评估基准，研究主题是视觉驱动的长时程任务、视觉基础、语义推理和自主探索能力。所有评分关键词均与大语言模型（LLMs）或深度学习技术原理相关，而论文明确研究视觉语言模型（VLMs），属于计算机视觉与自然语言处理的交叉领域，但未涉及LLMs的核心技术（如预训练、微调、对齐、推理优化等）。关键词如’AI for Science’可能广义相关，但论文未具体应用于科学领域（如生物信息学），因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了PokeGym基准，用于评估视觉语言模型在复杂3D环境中的长时程任务性能，发现当前模型的主要瓶颈是物理死锁恢复而非高级规划，并揭示了模型在死锁意识上的元认知差异。

摘要翻译

尽管视觉-语言模型（VLMs）在静态视觉理解方面取得了显著进展，但其在复杂三维具身环境中的部署仍受到严重限制。现有基准测试存在四个关键缺陷：（1）被动感知任务规避了交互动态；（2）简化的二维环境无法评估深度感知能力；（3）特权状态泄露绕过了真实的视觉处理过程；（4）人工评估成本高昂且难以扩展。我们推出PokeGym——一个基于视觉驱动的长视野基准测试框架，该框架实例化于视觉复杂的三维开放世界角色扮演游戏《宝可梦传说：Z-A》中。PokeGym通过严格的代码级隔离实现：智能体仅依据原始RGB观测数据行动，同时由独立评估器通过内存扫描验证任务成功与否，从而确保纯粹的基于视觉的决策过程以及自动化、可扩展的评估机制。该基准包含30项任务（30-220步），涵盖导航、交互及混合场景，并设置三种指令粒度（视觉引导、步骤引导、仅目标指示），以系统解构视觉定位、语义推理和自主探索能力。我们的评估揭示了当前VLMs的一个关键局限：物理死锁恢复能力（而非高层规划能力）构成了主要瓶颈，且死锁状态与任务成功率呈现强烈负相关。此外，我们发现了一种元认知分化现象：较弱模型主要遭受无意识死锁（未察觉陷入困境），而先进模型则表现出有意识死锁（能识别困境但无法恢复）。这些发现凸显了将显式空间直觉整合到VLM架构中的必要性。代码与基准测试套件将在GitHub平台开源。

摘要 (Abstract)

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

关键词: Vision-Language Models, Benchmark, 3D Embodied Environments, Long-Horizon Tasks, Visual Grounding, Semantic Reasoning, Autonomous Exploration, Deadlock Recovery

65. ❌ Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models

作者: Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种使用冻结大语言模型作为计算节点的前馈图架构，通过共享潜在空间进行通信。该研究高度相关于’Large Language Models’（使用多个冻结LLM作为节点）和’PEFT’（仅训练少量参数，保持大部分模型冻结），因此这两项得10分。其他关键词如MoE、SLMs、训练方法、推理优化、应用领域等均未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种前馈图架构，其中冻结的大语言模型作为计算节点，通过共享潜在空间进行通信，仅训练少量参数即可显著提升多个基准测试的性能。

摘要翻译

本文提出一种前馈图架构，其中异构的冻结大型语言模型作为计算节点，通过学习的线性投影在共享连续潜空间中进行通信。基于近期研究证明独立训练的LLM潜空间具有几何兼容性~\cite{armstrong2026thinking}，我们将这一发现从静态双模型引导扩展至端到端可训练的多节点图：通过残差流注入钩子进行反向传播，联合优化投影矩阵。三个小型冻结模型（Llama-3.2-1B、Qwen2.5-1.5B、Gemma-2-2B）将输入编码至共享潜空间，其聚合信号被注入两个大型冻结模型（Phi-3-mini、Mistral-7B），二者的表征再馈入轻量级交叉注意力输出节点。该架构仅包含1760万可训练参数（对应约120亿冻结参数），在ARC-Challenge、OpenBookQA和MMLU基准上分别达到87.3%、82.8%和67.2%的准确率，较最佳单一组成模型分别提升11.4、6.2和1.2个百分点，较参数规模匹配的冻结单模型学习分类器分别提升9.1、5.2和6.7个百分点。实验验证了梯度在多个冻结模型边界间的流通具有可操作性，且输出节点在无显式监督条件下形成了对第二层节点的选择性路由行为。

摘要 (Abstract)

We present a feedforward graph architecture in which heterogeneous frozen large language models serve as computational nodes, communicating through a shared continuous latent space via learned linear projections. Building on recent work demonstrating geometric compatibility between independently trained LLM latent spaces~\cite{armstrong2026thinking}, we extend this finding from static two-model steering to end-to-end trainable multi-node graphs, where projection matrices are optimized jointly via backpropagation through residual stream injection hooks. Three small frozen models (Llama-3.2-1B, Qwen2.5-1.5B, Gemma-2-2B) encode the input into a shared latent space whose aggregate signal is injected into two larger frozen models (Phi-3-mini, Mistral-7B), whose representations feed a lightweight cross-attention output node. With only 17.6M trainable parameters against approximately 12B frozen, the architecture achieves 87.3% on ARC-Challenge, 82.8% on OpenBookQA, and 67.2% on MMLU, outperforming the best single constituent model by 11.4, 6.2, and 1.2 percentage points respectively, and outperforming parameter-matched learned classifiers on frozen single models by 9.1, 5.2, and 6.7 points. Gradient flow through multiple frozen model boundaries is empirically verified to be tractable, and the output node develops selective routing behavior across layer-2 nodes without explicit supervision.

关键词: frozen language models, feedforward graph, shared latent space, parameter-efficient, residual stream injection, cross-attention, gradient flow, selective routing

66. ❌ Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

作者: Xun Zhu, Fanbin Mo, Xi Chen, Kaili Zheng, Shaoshuai Yang, Yiming Shi, Jian Gao, Miao Li, Ji Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08333v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究医疗多模态大语言模型（MLLMs）在图像分类任务中的性能退化问题，属于大模型在科学（医疗）领域的应用研究。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究MLLMs；2）‘AI for Science’（10分）- 医疗影像分析是AI for Science的典型应用；3）‘Pre-training’（5分）- 论文提到MLLMs在预训练数据和参数上的优势；4）‘Mechanistic Interpretability’（5分）- 论文通过特征探测分析信息流，属于可解释性研究。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了医疗多模态大语言模型在图像分类任务中性能低于传统深度学习模型的问题，并通过特征探测分析识别了视觉表示质量限制、连接器投影保真度损失、LLM推理理解缺陷和语义映射错位四种失败模式。

摘要翻译

多模态大语言模型（MLLMs）的兴起在医学影像分析领域引发了前所未有的应用浪潮。然而，作为最早且最基础融入此范式的任务之一，医学图像分类揭示了一个发人深省的现状：尽管在预训练数据和模型参数上具有压倒性优势，当前最先进的医学MLLMs在性能上却持续落后于传统的深度学习模型。这一悖论促使我们进行批判性反思：性能下降究竟源于何处？本文在三个具有代表性的图像分类数据集上，对14个开源的医学MLLMs进行了广泛实验。超越表面的性能基准测试，我们采用特征探针技术，在整个MLLM流程中逐模块、逐层地追踪视觉特征的信息流，从而能够明确可视化分类信号在何处以及如何被扭曲、稀释或覆盖。作为首次剖析医学MLLMs分类性能下降的尝试，我们的研究揭示了四种失效模式：1）视觉表征的质量局限，2）连接器投影中的保真度损失，3）大语言模型（LLM）推理中的理解缺陷，以及4）语义映射的错位。同时，我们引入了量化评分来刻画特征演化的健康程度，从而能够在不同的MLLMs和数据集之间进行有原则的比较。此外，我们围绕阻碍当前医学MLLMs实现其承诺的临床潜力的关键障碍，提供了富有洞察力的讨论。我们希望这项工作能引发学界的重新思考——强调从高期望到临床可部署的MLLMs之路，依然漫长而曲折。

摘要 (Abstract)

The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

关键词: multimodal large language models, medical imaging analysis, image classification, performance degradation, feature probing, visual representation, LLM reasoning, clinical deployment

67. ❌ ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

作者: He Geng, Yangmin Huang, Lixian Lai, Qianyun Du, Hui Chu, Zhiyang He, Jiaxue Hu, Xiaodong Tao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究医疗领域LLM对齐问题，与’Large Language Models’、‘Instruction Tuning/Alignment’、‘AI for Science/Bioinformatics’高度相关（10分）。使用GRPO（Group Relative Policy Optimization）进行强化学习，与’RLHF/RLAIF/DPO’相关但非完全匹配（8分）。涉及监督微调（SFT）和安全性改进，与’Post-training/SFT’和’Hallucination Mitigation/Factuality’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对医疗领域大语言模型对齐的挑战，提出了ProMedical框架，通过细粒度临床标准注入和强化学习优化，显著提升了模型在医疗任务中的准确性和安全性。

摘要翻译

将大型语言模型（LLMs）与高风险的医疗标准对齐仍是一项重大挑战，主要源于粗粒度的偏好信号与临床指南复杂多维特性之间的不匹配。为弥合这一差距，我们提出了ProMedical——一个基于细粒度临床标准的统一对齐框架。我们首先构建了ProMedical-Preference-50k数据集，该数据集通过人机协同流程生成，利用严格的、源自医师的评估准则对医疗指令进行增强。借助这一语料库，我们提出了“显式准则注入”范式来训练一个多维奖励模型。与传统的标量奖励模型不同，我们的方法明确将安全约束与通用能力解耦，从而在强化学习过程中实现精准引导。为严格验证该框架，我们建立了ProMedical-Bench评估套件，其以双盲专家评审为基准。实证评估表明，通过ProMedical-RM引导的GRPO对Qwen3-8B基础模型进行优化，取得了显著提升：整体准确率提高22.3%，安全合规性提升21.7%，有效媲美前沿的专有模型。此外，对齐后的策略在外部基准测试中展现出强大的泛化能力，在UltraMedical上实现了与最先进模型相当的性能。我们公开发布了数据集、奖励模型及基准测试工具，以促进安全感知医疗对齐领域的可复现研究。

摘要 (Abstract)

Aligning Large Language Models (LLMs) with high-stakes medical standards remains a significant challenge, primarily due to the dissonance between coarse-grained preference signals and the complex, multi-dimensional nature of clinical protocols. To bridge this gap, we introduce ProMedical, a unified alignment framework grounded in fine-grained clinical criteria. We first construct ProMedical-Preference-50k, a dataset generated via a human-in-the-loop pipeline that augments medical instructions with rigorous, physician-derived rubrics. Leveraging this corpus, we propose the Explicit Criteria Injection paradigm to train a multi-dimensional reward model. Unlike traditional scalar reward models, our approach explicitly disentangles safety constraints from general proficiency, enabling precise guidance during reinforcement learning. To rigorously validate this framework, we establish ProMedical-Bench, a held-out evaluation suite anchored by double-blind expert adjudication. Empirical evaluations demonstrate that optimizing the Qwen3-8B base model via ProMedical-RM-guided GRPO yields substantial gains, improving overall accuracy by 22.3% and safety compliance by 21.7%, effectively rivaling proprietary frontier models. Furthermore, the aligned policy generalizes robustly to external benchmarks, demonstrating performance comparable to state-of-the-art models on UltraMedical. We publicly release our datasets, reward models, and benchmarks to facilitate reproducible research in safety-aware medical alignment.

关键词: Medical LLM Alignment, Fine-grained Clinical Criteria, Explicit Criteria Injection, Reward Model, Reinforcement Learning, Safety Compliance, Qwen3-8B, ProMedical-Bench

作者: Benjamin Léger, Kazem Meidani, Christian Gagné 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究符号回归中的多模态潜在空间优化，主要涉及对比预训练模型SNIP（受CLIP启发）和跨模态对齐。与关键词的相关性分析：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’得5分，因为论文提到SNIP是预训练模型；2）‘Instruction Tuning OR Alignment OR Value Alignment’得5分，因为论文核心是研究跨模态对齐问题；3）‘AI for Science OR Bioinformatics OR Cheminformatics’得5分，因为符号回归是科学AI应用。其他关键词（如LLMs、MoE、RLHF等）与论文内容无关，得0分。论文未涉及大模型技术原理创新或深度学习在科学领域的应用，仅涉及特定预训练模型和对齐问题，因此相关性较低。

!!! tip deepseek-chat TL;DR

该论文研究了符号回归中多模态潜在空间优化模型SNIP的跨模态对齐效果，发现其对齐过于粗糙，无法有效指导符号搜索，揭示了细粒度对齐是未来关键方向。

摘要翻译

符号回归（Symbolic Regression，SR）旨在从数据中发现数学表达式，这一任务传统上通过遗传编程（Genetic Programming，GP）对符号结构进行组合搜索来解决。潜在空间优化（Latent Space Optimization，LSO）方法利用神经编码器将符号表达式映射到连续空间，从而将组合搜索转化为连续优化。SNIP（Meidani等人，2024）是一种受CLIP启发的对比预训练模型，它通过引入多模态方法推进了LSO：在共享潜在空间中对齐符号编码器和数值编码器，以学习表型-基因型映射，从而实现在数值空间中的优化以隐式指导符号搜索。然而，这种方法依赖于细粒度的跨模态对齐，而类似CLIP等模型的文献表明，此类对齐通常是粗粒度的。本文中，我们探究SNIP是否实现了其对于SR进行有效双模态优化的承诺。我们的实验表明：（1）在优化过程中，即使适应度提高，跨模态对齐也并未改善；（2）SNIP学习到的对齐过于粗糙，无法在符号空间中进行高效的原则性搜索。这些发现揭示，尽管多模态LSO在SR中具有巨大潜力，但实践中对齐引导的有效优化仍未实现，这凸显了细粒度对齐是未来工作的关键方向。

摘要 (Abstract)

Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) through combinatorial search over symbolic structures. Latent Space Optimization (LSO) methods use neural encoders to map symbolic expressions into continuous spaces, transforming the combinatorial search into continuous optimization. SNIP (Meidani et al., 2024), a contrastive pre-training model inspired by CLIP, advances LSO by introducing a multi-modal approach: aligning symbolic and numeric encoders in a shared latent space to learn the phenotype-genotype mapping, enabling optimization in the numeric space to implicitly guide symbolic search. However, this relies on fine-grained cross-modal alignment, whereas literature on similar models like CLIP reveals that such an alignment is typically coarse-grained. In this paper, we investigate whether SNIP delivers on its promise of effective bi-modal optimization for SR. Our experiments show that: (1) cross-modal alignment does not improve during optimization, even as fitness increases, and (2) the alignment learned by SNIP is too coarse to efficiently conduct principled search in the symbolic space. These findings reveal that while multi-modal LSO holds significant potential for SR, effective alignment-guided optimization remains unrealized in practice, highlighting fine-grained alignment as a critical direction for future work.

关键词: Symbolic Regression, Genetic Programming, Latent Space Optimization, Multi-modal Learning, Cross-modal Alignment, Contrastive Pre-training, SNIP, Phenotype-genotype Mapping

69. ❌ HistDiT: A Structure-Aware Latent Conditional Diffusion Model for High-Fidelity Virtual Staining in Histopathology

作者: Aasim Bin Saleem, Amr Ahmed, Ardhendu Behera, Hafeezullah Amin, Iman Yi Liao, Mahmoud Khattab, Pan Jia Wern, Haslina Makmur 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文HistDiT专注于医学图像生成（虚拟染色），属于AI for Science（生物信息学/医学影像）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大语言模型（LLMs）、MoE、小模型、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、注意力优化、思维链、系统2思维、MCTS、自校正、智能体、工具使用、多智能体、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并或上下文学习等主题，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对组织病理学中虚拟染色技术存在的结构-染色权衡问题，提出了一种名为HistDiT的新型潜在条件扩散Transformer架构，通过双流条件策略、多目标损失函数和结构相关性度量，显著提升了生成图像的结构保真度和视觉质量，超越了现有基线方法。

摘要翻译

免疫组织化学（IHC）对于评估乳腺癌中人类表皮生长因子受体2（HER2）等特定免疫生物标志物至关重要。然而，获取IHC染色的传统方案资源密集、耗时且易造成结构损伤。虚拟染色作为一种可扩展的替代方案应运而生，但在保持细粒度细胞结构的同时准确转化生化表达方面仍面临重大挑战。当前最先进的方法仍依赖于生成对抗网络（GANs）或标准卷积U-Net扩散模型，这些模型常受困于“结构与染色权衡”问题：生成的样本要么结构相关但模糊，要么纹理逼真却存在影响诊断价值的人工伪影。本文提出HistDiT——一种新颖的潜在条件扩散变换器（DiT）架构，为虚拟组织学染色的视觉保真度建立了新基准。本研究的创新点在于：a）采用双流条件策略，通过VAE编码的潜在空间约束与UNI嵌入的语义表型引导，显式维持空间约束与语义指导的平衡；b）设计多目标损失函数以生成具有清晰形态结构的锐利图像；c）引入结构相关性度量（SCM）聚焦核心形态结构，实现样本质量的精准评估。通过严格的定量与定性评估证明，我们的模型性能超越了现有基线方法。

摘要 (Abstract)

Immunohistochemistry (IHC) is essential for assessing specific immune biomarkers like Human Epidermal growth-factor Receptor 2 (HER2) in breast cancer. However, the traditional protocols of obtaining IHC stains are resource-intensive, time-consuming, and prone to structural damages. Virtual staining has emerged as a scalable alternative, but it faces significant challenges in preserving fine-grained cellular structures while accurately translating biochemical expressions. Current state-of-the-art methods still rely on Generative Adversarial Networks (GANs) or standard convolutional U-Net diffusion models that often struggle with “structure and staining trade-offs”. The generated samples are either structurally relevant but blurry, or texturally realistic but have artifacts that compromise their diagnostic use. In this paper, we introduce HistDiT, a novel latent conditional Diffusion Transformer (DiT) architecture that establishes a new benchmark for visual fidelity in virtual histological staining. The novelty introduced in this work is, a) the Dual-Stream Conditioning strategy that explicitly maintains a balance between spatial constraints via VAE-encoded latents and semantic phenotype guidance via UNI embeddings; b) the multi-objective loss function that contributes to sharper images with clear morphological structure; and c) the use of the Structural Correlation Metric (SCM) to focus on the core morphological structure for precise assessment of sample quality. Consequently, our model outperforms existing baselines, as demonstrated through rigorous quantitative and qualitative evaluations.

关键词: Virtual Staining, Histopathology, Diffusion Transformer, Dual-Stream Conditioning, Structural Correlation Metric, Generative Models, Medical Image Synthesis, HistDiT

70. ❌ Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

作者: Yuming Xu, Mingtao Zhang, Zhuohan Ge, Haoyang Li, Nicole Hu, Jason Chen Zhang, Qing Li, Lei Chen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专门研究检索增强生成（RAG）的安全问题，这是论文的核心主题，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分）。论文还明确提到RAG增强大型语言模型（LLMs），因此与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、科学AI应用等，论文未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文系统分析了检索增强生成（RAG）的安全风险，提出了一个基于知识访问管道的安全框架，并揭示了当前防御措施的局限性，为未来研究方向提供了指导。

摘要翻译

检索增强生成（RAG）通过引入外部知识访问显著提升了大语言模型（LLMs）的能力，但也因此带来了新的安全风险。现有研究虽涵盖了多种RAG漏洞，却常将LLM固有风险与RAG特有风险混为一谈。本文提出，安全的RAG本质上关乎外部知识访问管道的安全性。我们通过设定操作边界，以区分LLM固有缺陷与RAG引入或RAG放大的威胁。基于这一视角，我们将RAG工作流程抽象为六个阶段，并围绕三个信任边界和四个主要安全层面组织文献综述，包括检索前知识污染、检索时访问操控、下游上下文利用以及知识泄露。通过系统梳理相应的攻击手段、防御措施、修复机制及评估基准，我们发现当前防御策略仍主要处于被动和碎片化状态。最后，我们探讨了这些不足，并指出了未来在整个知识访问生命周期中实现分层、边界感知保护的研究方向。

摘要 (Abstract)

Retrieval-augmented generation (RAG) significantly enhances large language models (LLMs) but introduces novel security risks through external knowledge access. While existing studies cover various RAG vulnerabilities, they often conflate inherent LLM risks with those specifically introduced by RAG. In this paper, we propose that secure RAG is fundamentally about the security of the external knowledge-access pipeline. We establish an operational boundary to separate inherent LLM flaws from RAG-introduced or RAG-amplified threats. Guided by this perspective, we abstract the RAG workflow into six stages and organize the literature around three trust boundaries and four primary security surfaces, including pre-retrieval knowledge corruption, retrieval-time access manipulation, downstream context exploitation, and knowledge exfiltration. By systematically reviewing the corresponding attacks, defenses, remediation mechanisms, and evaluation benchmarks, we reveal that current defenses remain largely reactive and fragmented. Finally, we discuss these gaps and highlight future directions toward layered, boundary-aware protection across the entire knowledge-access lifecycle.

关键词: Retrieval-Augmented Generation, RAG, security risks, large language models, knowledge access, attacks, defenses, trust boundaries

71. ❌ DMax: Aggressive Parallel Decoding for dLLMs

作者: Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出DMax方法，专注于扩散语言模型(dLLMs)的高效并行解码，属于大模型技术原理的创新。核心相关关键词：1) ‘Large Language Models’ (10分)：论文研究扩散语言模型，属于大语言模型范畴；2) ‘Self-Correction’ (8分)：方法涉及从错误预测中恢复的自我修正机制；3) ‘Speculative Decoding’ (10分)：核心贡献是并行解码加速推理。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

论文提出DMax方法，通过渐进式自精炼和软并行解码解决扩散语言模型并行解码中的错误累积问题，在保持生成质量的同时显著提升推理速度。

摘要翻译

本文提出DMax，一种用于高效扩散语言模型（dLLMs）的新范式。该方法缓解了并行解码中的误差累积问题，在保持生成质量的同时实现了激进的解码并行化。与传统的掩码dLLMs通过二元掩码到令牌的转换进行解码不同，DMax将解码重新定义为从掩码嵌入到令牌嵌入的渐进式自优化过程。我们方法的核心是“在策略均匀训练”，这是一种新颖的训练策略，它高效地统一了掩码dLLMs与均匀dLLMs，使模型能够从掩码输入及其自身的错误预测中恢复出干净的令牌。在此基础上，我们进一步提出“软并行解码”。我们将每个中间解码状态表示为预测令牌嵌入与掩码嵌入之间的插值，从而在嵌入空间实现迭代式的自我修正。在多种基准测试上进行的大量实验证明了DMax的有效性。与原始的LLaDA-2.0-mini相比，我们的方法在保持准确性的同时，将GSM8K上的TPF从2.04提升至5.47。在MBPP上，它在保持相当性能的同时，将TPF从2.71提升至5.86。在两个H200 GPU上，我们的模型在批次大小为1时平均实现了1,338 TPS。代码发布于：https://github.com/czg1225/DMax

摘要 (Abstract)

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

关键词: diffusion language models, parallel decoding, inference acceleration, self-refinement, mask embeddings, token embeddings, On-Policy Uniform Training, Soft Parallel Decoding

72. ❌ SeLaR: Selective Latent Reasoning in Large Language Models

作者: Renyu Fu, Guibo Luo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SeLaR专注于改进大语言模型中的推理能力，特别是Chain-of-Thought（CoT）方法。它直接涉及’Large Language Models’和’Chain of Thought’，因此这两个关键词得分为10。论文提出的选择性激活机制和对比正则化旨在增强推理的稳定性和探索性，这与’System 2 Thinking’相关，但并非核心，得分为8。‘Self-Correction’有一定关联，因为方法涉及低置信度步骤的调整，但非主要焦点，得分为5。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关，得分为0。论文未提及特定专家作者，因此expert_authors_found为空。

!!! tip deepseek-chat TL;DR

论文SeLaR解决了Chain-of-Thought推理中离散令牌采样的表达限制问题，通过选择性激活软嵌入和熵感知对比正则化，在五个推理基准上超越了标准CoT和最先进的免训练方法。

摘要翻译

思维链（Chain-of-Thought, CoT）已成为大语言模型推理的基石，但其有效性受限于离散令牌采样的有限表达能力。近期的隐式推理方法试图通过用软嵌入（即令牌嵌入的概率加权混合）或隐藏状态替代离散令牌来缓解这一限制，但这些方法普遍存在两个问题：（1）全局激活会将扰动引入高置信度推理步骤，损害推理稳定性；（2）软嵌入会迅速坍缩至最高概率令牌方向，限制了对替代推理路径的探索。为应对这些挑战，我们提出了SeLaR（选择性隐式推理），一个轻量级且无需训练的新框架。SeLaR引入了一种基于熵的门控机制，仅在低置信度步骤激活软嵌入，同时在高置信度步骤保持离散解码。此外，我们提出了一种熵感知对比正则化方法，促使软嵌入远离主导（最高概率）令牌的方向，从而持续探索多条隐式推理路径。在五个推理基准测试上的实验表明，SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods。

摘要 (Abstract)

Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token’s direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.

关键词: Selective Latent Reasoning, Chain-of-Thought, Large Language Models, Soft Embeddings, Entropy-gated Mechanism, Reasoning Stability, Training-free Framework, Reasoning Benchmarks

73. ❌ U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

作者: Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文U-CECE专注于概念性反事实解释的可解释AI方法，提出了一种多分辨率框架来解决表达性与效率之间的权衡问题。该方法使用图神经网络（GNNs）和图自编码器（GAEs）进行结构级解释，与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为这是其核心研究领域。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新、科学领域AI应用或其他评分关键词，如模型训练、对齐、推理、代理系统、压缩等，因此其他所有关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为U-CECE的通用多分辨率框架，用于生成概念性反事实解释，以解决AI模型可解释性中表达性与效率之间的权衡问题，并通过实验验证了其在结构级解释上的有效性。

摘要翻译

随着人工智能模型日益复杂，可解释性对于建立信任至关重要，然而基于概念的反事实解释方法仍面临表达能力与效率之间的权衡。将底层概念表示为原子集合虽快速但缺失关系语境，而完整的图表示虽更具忠实度却需解决NP难的图编辑距离问题。我们提出U-CECE——一个统一、模型无关的多分辨率概念反事实解释框架，能够自适应数据规模与计算预算。U-CECE涵盖三个表达能力层级：用于广泛解释的原子概念、用于简单交互的关系型集合之集合，以及用于完整语义结构的结构化图表示。在结构层级，框架同时支持基于监督图神经网络（GNNs）的精确导向转导模式，以及基于无监督图自编码器（GAEs）的可扩展归纳模式。在结构差异显著的CUB和Visual Genome数据集上的实验，量化了不同层级间效率与表达能力的权衡关系；而人工调研与基于LVLM的评估表明，所检索的结构化反事实在语义上等同于基于精确图编辑距离的真实解释，且往往更受使用者青睐。

摘要 (Abstract)

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

关键词: conceptual counterfactual explanations, explainable AI, multi-resolution framework, Graph Neural Networks, graph autoencoders, efficiency-expressivity trade-off, structural explanations, model-agnostic

74. ❌ Can Vision Language Models Judge Action Quality? An Empirical Evaluation

作者: Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08294v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在动作质量评估（AQA）中的表现，属于AI在科学/应用领域的研究。与大多数关键词（如LLM技术原理、训练方法、优化技术等）完全无关。仅与少数关键词有弱关联：1) ‘Chain of Thought’等推理相关关键词（5分）- 论文测试了推理结构策略；2) ‘Hallucination Mitigation’等（5分）- 涉及模型偏差和事实性问题；3) ‘Explainable AI’（5分）- 分析模型失败模式；4) ‘In-context Learning’（5分）- 测试了上下文学习策略；5) ‘AI for Science’（5分）- 属于AI在体育/健康科学的应用。这些关联均非核心，仅为评估中使用的策略或应用领域。

!!! tip deepseek-chat TL;DR

该论文评估了视觉语言模型在动作质量评估任务中的表现，发现当前模型性能仅略高于随机猜测，存在系统性偏差，且改进策略效果有限，揭示了模型在细粒度运动质量评估上的根本困难。

摘要翻译

动作质量评估（Action Quality Assessment, AQA）在物理治疗、运动教练和竞技评分领域具有广泛的应用前景。尽管视觉语言模型（Vision Language Models, VLMs）在AQA任务中展现出巨大潜力，但其在该领域中的实际性能仍缺乏充分研究。本文对当前先进的VLMs在多个活动领域（如健身、花样滑冰、跳水）、任务类型、表征方式及提示策略方面进行了全面评估。基线结果显示，Gemini 3.1 Pro、Qwen3-VL和InternVL3.5等模型的表现仅略高于随机猜测水平；尽管引入骨骼信息、基础指令、推理结构和上下文学习等策略能在特定情况下带来局部性能提升，但无一策略能持续有效。对预测分布的分析揭示了两种系统性偏差：模型倾向于无视视觉证据而预测动作执行正确，且对表面语言表述框架高度敏感。通过对比性任务重构来缓解这些偏差的尝试收效甚微，这表明模型的局限性不仅源于上述偏差，更指向其对细粒度运动质量评估存在根本性困难。本研究为未来基于VLM的AQA研究建立了严谨的基准，并为实际可靠部署前需解决的关键失效模式提供了可操作的改进框架。

摘要 (Abstract)

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models’ limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

关键词: Vision Language Models, Action Quality Assessment, Empirical Evaluation, Skeleton Information, In-context Learning, Systematic Biases, Fine-grained Movement Assessment, Real-world Deployment

75. ❌ CIAO - Code In Architecture Out - Automated Software Architecture Documentation with Large Language Models

作者: Marco De Luca, Tiziano Santilli, Domenico Amalfitano, Anna Rita Fasolino, Patrizio Pelliccione 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08293v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是应用大语言模型（LLMs）自动化生成软件架构文档，高度相关于’Large Language Models’关键词（10分）。其他关键词涉及模型技术原理（如MoE、量化）、训练方法（如预训练、对齐）、推理优化（如注意力机制）、特定应用领域（如科学AI）等，论文未涉及这些具体技术细节或领域，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CIAO的自动化流程，利用大语言模型从GitHub仓库直接生成系统级软件架构文档，并通过开发者评估验证了其文档的价值性、可理解性和成本效益。

摘要翻译

软件架构文档对于系统理解至关重要，但其往往缺失或不完整。尽管近期基于大语言模型的技术能够从代码生成文档，但这些方法通常仅处理局部产物，而非生成连贯的系统级架构描述。本文提出一种结构化流程，能够直接利用大语言模型从GitHub仓库自动生成系统级架构文档。该流程名为CIAO（Code In Architecture Out），定义了一个基于大语言模型的工作流：它以代码仓库为输入，并遵循源自ISO/IEC/IEEE 42010标准、SEI Views & Beyond框架及C4模型的模板，生成系统级架构文档。生成的文档可直接添加至目标仓库。我们通过一项涉及22名开发者的研究对该流程进行评估，每位开发者均评审了为其曾贡献过的仓库所生成的文档。评估表明，开发者普遍认为生成的文档具有价值、易于理解，且与源代码大体相符，同时也指出了其在图表质量、高层级上下文建模及部署视图方面的局限性。我们还评估了该流程的运行成本，发现生成完整架构文档仅需数分钟且运行成本低廉。总体而言，结果表明这种结构化、面向标准的方法能有效引导大语言模型生成兼具实用性与成本效益的系统级架构文档。

摘要 (Abstract)

Software architecture documentation is essential for system comprehension, yet it is often unavailable or incomplete. While recent LLM-based techniques can generate documentation from code, they typically address local artifacts rather than producing coherent, system-level architectural descriptions. This paper presents a structured process for automatically generating system-level architectural documentation directly from GitHub repositories using Large Language Models. The process, called CIAO (Code In Architecture Out), defines an LLM-based workflow that takes a repository as input and produces system-level architectural documentation following a template derived from ISO/IEC/IEEE 42010, SEI Views & Beyond, and the C4 model. The resulting documentation can be directly added to the target repository. We evaluated the process through a study with 22 developers, each reviewing the documentation generated for a repository they had contributed to. The evaluation shows that developers generally perceive the produced documentation as valuable, comprehensible, and broadly accurate with respect to the source code, while also highlighting limitations in diagram quality, high-level context modeling, and deployment views. We also assessed the operational cost of the process, finding that generating a complete architectural document requires only a few minutes and is inexpensive to run. Overall, the results indicate that a structured, standards-oriented approach can effectively guide LLMs in producing system-level architectural documentation that is both usable and cost-effective.

关键词: Large Language Models, software architecture documentation, automated generation, GitHub repositories, system-level documentation, CIAO, ISO/IEC/IEEE 42010, developer evaluation

76. ❌ Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models

作者: Yating Wang, Wenting Zhao, Yaqi Zhao, Yongshun Gong, Yilong Yin, Haoliang Sun 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）中规则级知识的模型编辑方法研究，通过机制性分析（mechanistic study）和因果追踪（causal tracing）探索知识在Transformer层中的组织形式，并提出了分布式多层编辑（DMLE）方法。因此，与’Large Language Models’高度相关（10分），因为论文核心研究对象是LLMs；与’Mechanistic Interpretability’高度相关（10分），因为论文通过因果追踪分析模型内部机制来解释知识表示。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理技术（CoT、MCTS）、代理系统、压缩加速技术、幻觉缓解、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大语言模型中规则级知识的编辑问题，发现规则知识在Transformer层中具有形式特定的组织（公式和描述集中在早期层，实例关联于中间层），并提出了分布式多层编辑方法（DMLE），显著提升了规则编辑的性能。

摘要翻译

大型语言模型不仅存储孤立的事实，还存储支持跨符号表达式、自然语言解释和具体实例进行推理的规则。然而，大多数模型编辑方法是为事实层面的知识构建的，其假设目标编辑可以通过局部干预实现。这一假设不适用于规则层面的知识，因为单个规则必须在多个相互依赖的形式中保持一致。我们通过对规则层面知识编辑的机制研究来探讨这一问题。为支持此项研究，我们将RuleEdit基准从80条手动验证的规则扩展到涵盖数学和物理领域的200条规则。细粒度的因果追踪揭示了规则知识在Transformer层中的形式特异性组织：公式和描述集中在较早的层，而实例更多与中间层关联。这些结果表明，规则知识并非均匀局部化，因此无法通过单层或连续块干预可靠地编辑。基于这一洞见，我们提出了分布式多层编辑（Distributed Multi-Layer Editing, DMLE），该方法对公式和描述应用共享的早期层更新，并对实例应用独立的中间层更新。在保持标准编辑指标竞争力的同时，DMLE实现了显著更强的规则层面编辑性能。在GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B模型上，相较于最强基线，DMLE平均将实例可移植性和规则理解能力分别提升了13.91和50.19个百分点。代码发布于https://github.com/Pepper66/DMLE。

摘要 (Abstract)

Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.

关键词: Large Language Models, Model Editing, Rule-Level Knowledge, Mechanistic Interpretability, Causal Tracing, Transformer Layers, Distributed Multi-Layer Editing, Knowledge Representation

77. ❌ QARIMA: A Quantum Approach To Classical Time Series Analysis

作者: Nishikanta Mohanty, Bikash K. Behera, Badshah Mukherjee, Pravat Dash 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08277v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《QARIMA: A Quantum Approach To Classical Time Series Analysis》提出了一种量子启发的ARIMA方法，用于时间序列分析，核心贡献在于将量子计算技术（如变分量子电路、交换测试驱动的量子自相关/偏自相关）应用于经典统计模型的参数估计和优化。所有评分关键词均聚焦于大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、代理系统等），而本文研究的是量子计算在时间序列分析中的应用，属于量子机器学习与经典统计学的交叉领域，与大语言模型技术无直接关联。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为本文属于AI在科学（具体为时间序列分析）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种量子启发的ARIMA方法，通过量子辅助滞后发现和变分量子电路进行参数估计，在环境和工业数据集上验证了其能减少元优化开销并提升时间序列预测性能。

摘要翻译

本文提出一种量子启发的ARIMA方法，该方法将量子辅助滞后项发现与用于参数估计及弱滞后项优化的固定构型变分量子电路（VQC）相结合。差分阶数与候选滞后项通过基于交换测试的量子自相关函数（QACF）与量子偏自相关函数（QPACF）进行识别，并采用延迟矩阵构造将量子投影与时域回归变量对齐，随后通过标准信息准则进行简约性筛选。给定筛选后的阶数$(p,d,q)$，我们保持固定的VQC拟设、优化器和训练预算以避免超参数泄露，并将电路部署于两种估计任务中：用于自回归系数的VQC-AR和用于移动平均系数的VQC-MA。在筛选与估计之间，一个轻量级的VQC弱滞后项优化模块对筛选出的自回归滞后项进行重加权或剪枝，而不改变$(p,d,q)$结构。在环境与工业数据集上，我们通过滚动原点评估与自动化经典ARIMA方法进行对比，报告样本外均方误差（MSE）、平均绝对百分比误差（MAPE）以及基于MSE和MAE的Diebold–Mariano检验。实证表明，七项量子贡献——（1）差分阶数选择，（2）QACF，（3）QPACF，（4）采用延迟矩阵构造的交换测试原语，（5）VQC-AR，（6）VQC弱滞后项优化，以及（7）VQC-MA——共同降低了元优化开销，并明确了量子效应在阶数发现、滞后项优化以及AR/MA参数估计中的具体作用位置。

摘要 (Abstract)

We present a quantum-inspired ARIMA methodology that integrates quantum-assisted lag discovery with \emph{fixed-configuration} variational quantum circuits (VQCs) for parameter estimation and weak-lag refinement. Differencing and candidate lags are identified via swap-test-driven quantum autocorrelation (QACF) and quantum partial autocorrelation (QPACF), with a delayed-matrix construction that aligns quantum projections to time-domain regressors, followed by standard information-criterion parsimony. Given the screened orders $(p,d,q)$, we retain a fixed VQC ansatz, optimizer, and training budget, preventing hyperparameter leakage, and deploy the circuit in two estimation roles: VQC-AR for autoregressive coefficients and VQC-MA for moving-average coefficients. Between screening and estimation, a lightweight VQC weak-lag refinement re-weights or prunes screened AR lags without altering $(p,d,q)$. Across environmental and industrial datasets, we perform rolling-origin evaluations against automated classical ARIMA, reporting out-of-sample mean squared error (MSE), mean absolute percentage error (MAPE), and Diebold–Mariano tests on MSE and MAE. Empirically, the seven quantum contributions – (1) differencing selection, (2) QACF, (3) QPACF, (4) swap-test primitives with delayed-matrix construction, (5) VQC-AR, (6) VQC weak-lag refinement, and (7) VQC-MA – collectively reduce meta-optimization overhead and make explicit where quantum effects enter order discovery, lag refinement, and AR/MA parameter estimation.

关键词: quantum-inspired ARIMA, variational quantum circuits, time series analysis, quantum autocorrelation, parameter estimation, lag refinement, swap-test, rolling-origin evaluation

78. ❌ Neural-Symbolic Knowledge Tracing: Injecting Educational Knowledge into Deep Learning for Responsible Learner Modelling

作者: Danial Hooshyar, Gustav Šír, Yeongwook Yang, Tommi Kärkkäinen, Raija Hämäläinen, Ekaterina Krivich, Mutlu Cukurova, Dragan Gašević, Roger Azevedo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究教育领域中的AI应用，特别是将符号教育知识整合到深度学习模型中进行学习者建模。与关键词的相关性分析：1）论文明确提到LLMs在教育中的应用（8分）；2）论文的核心创新是神经符号方法，提供内在可解释性，与Explainable AI高度相关（10分）；3）论文属于AI在教育领域的应用，与AI for Science相关（10分）；4）其他关键词如MoE、SFT、RAG等主要涉及大模型技术细节或特定应用场景，论文未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在教育中适应性有限和可解释性不足的问题，提出了一种神经符号深度知识追踪方法Responsible-DKT，通过整合符号教育知识，在真实数学交互数据集上实现了更好的预测性能（AUC提升达13%）和内在可解释性。

摘要翻译

人工智能（AI）在教育领域的应用日益广泛，尤其是大型语言模型（LLMs）的兴起，推动了人们对智能导学系统的关注。然而，LLMs通常表现出有限的适应性，难以对学习者随时间变化的知识状态进行建模，这凸显了专门的学习者建模方法的必要性。尽管深度知识追踪方法在预测性能上表现优异，但其不透明性和易受偏见影响的特性可能限制其与教学原则的一致性。为此，我们提出了Responsible-DKT，一种神经符号深度知识追踪方法，它将符号化的教育知识（例如掌握与非掌握规则）整合到序列神经模型中，以实现负责任的学习者建模。在真实世界的学生数学交互数据集上的实验表明，Responsible-DKT在不同训练设置下均优于神经符号基线模型和完全数据驱动的PyTorch DKT模型。该模型仅使用10%的训练数据即可达到0.80以上的AUC值，最高可达0.90 AUC，性能提升高达13%。它还表现出更好的时间可靠性，在序列早期和中期的预测误差更低，且在不同序列长度下具有最低的预测不一致率，这表明预测更新在方向上与观察到的学生反应随时间保持对齐。此外，这种神经符号方法通过一个基于事实的计算图提供了内在的可解释性，揭示了每个预测背后的逻辑，从而支持局部和全局的解释。它还允许对教学假设进行实证评估，揭示了重复的错误反应（非掌握状态）对预测更新具有强烈影响。这些结果表明，神经符号方法不仅提升了性能和可解释性，缓解了数据限制，而且支持在教育中实现更负责任、以人为本的人工智能。

摘要 (Abstract)

The growing use of artificial intelligence (AI) in education, particularly large language models (LLMs), has increased interest in intelligent tutoring systems. However, LLMs often show limited adaptivity and struggle to model learners’ evolving knowledge over time, highlighting the need for dedicated learner modelling approaches. Although deep knowledge tracing methods achieve strong predictive performance, their opacity and susceptibility to bias can limit alignment with pedagogical principles. To address this, we propose Responsible-DKT, a neural-symbolic deep knowledge tracing approach that integrates symbolic educational knowledge (e.g., mastery and non-mastery rules) into sequential neural models for responsible learner modelling. Experiments on a real-world dataset of students’ math interactions show that Responsible-DKT outperforms both a neural-symbolic baseline and a fully data-driven PyTorch DKT model across training settings. The model achieves over 0.80 AUC with only 10% of training data and up to 0.90 AUC, improving performance by up to 13%. It also demonstrates improved temporal reliability, producing lower early- and mid-sequence prediction errors and the lowest prediction inconsistency rates across sequence lengths, indicating that prediction updates remain directionally aligned with observed student responses over time. Furthermore, the neural-symbolic approach offers intrinsic interpretability via a grounded computation graph that exposes the logic behind each prediction, enabling both local and global explanations. It also allows empirical evaluation of pedagogical assumptions, revealing that repeated incorrect responses (non-mastery) strongly influence prediction updates. These results indicate that neural-symbolic approaches enhance both performance and interpretability, mitigate data limitations, and support more responsible, human-centered AI in education.

关键词: neural-symbolic deep knowledge tracing, responsible learner modelling, large language models in education, interpretable AI, educational knowledge integration, student knowledge prediction, temporal reliability, human-centered AI

79. ❌ DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

作者: Jiangbei Yue, Sharib Ali 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像（内窥镜图像）的OOD检测，提出了一种双分支多模态框架。虽然属于AI在科学（医学）领域的应用，但论文内容主要涉及计算机视觉、多模态学习和OOD检测方法，并未涉及大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、Agent等）或任何关键词中提到的具体大模型技术。因此，除“AI for Science OR Bioinformatics OR Cheminformatics”因涉及生物医学AI应用给5分外，其余关键词均不相关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于内窥镜图像的双分支多模态框架（DBMF），以解决现有方法在利用多模态信息进行分布外（OOD）检测方面的不足，实验表明该框架能显著提升OOD检测性能。

摘要翻译

复杂且动态变化的真实临床环境要求深度学习系统具备高可靠性。当遇到偏离训练分布的数据（如未见过的疾病病例）时，分布外检测对于提升深度学习模型的可靠性与泛化能力至关重要。然而，现有的分布外检测方法通常仅依赖单一视觉模态或单纯的图文匹配，未能充分利用多模态信息。为克服这一挑战，我们提出了一种新颖的双分支多模态框架，通过引入图文分支与视觉分支，充分挖掘多模态表征，以这两个互补的分支来识别分布外样本。训练完成后，我们分别计算图文分支得分（$S_t$）与视觉分支得分（$S_v$），并将其融合得到最终的分布外检测得分$S$，通过与阈值比较实现分布外判定。在公开可用的内窥镜图像数据集上进行的大量实验表明，我们所提出的框架在不同骨干网络中均表现出鲁棒性，并将分布外检测的先进性能最高提升了24.84%。

摘要 (Abstract)

The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%

关键词: out-of-distribution detection, multimodal framework, endoscopic images, deep learning, clinical reliability, text-image branch, vision branch, OOD score

80. ❌ HyperMem: Hypergraph Memory for Long-Term Conversations

作者: Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu, Li Guo, Yafeng Deng 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08256v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）的对话系统，提出了一种超图记忆架构（HyperMem）来改进检索增强生成（RAG）方法，以解决长期对话中的高阶关联问题。因此，与’Large Language Models’和’Retrieval-Augmented Generation’高度相关（10分）。论文涉及长期对话，与’Long Context LLMs’有一定关联（5分），且对话代理（agents）的应用与’LLM Agents’相关（5分）。其他关键词如MoE、SFT、RLHF等未在论文中提及或应用，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对长期对话中现有检索增强生成方法难以捕捉高阶关联的问题，提出了一种基于超图的层次记忆架构HyperMem，通过实验在LoCoMo基准上实现了92.73%的LLM-as-a-judge准确率，证明了其有效性。

摘要翻译

长期记忆对于对话智能体维持对话连贯性、追踪持续性任务以及在长程对话中提供个性化交互至关重要。然而，现有方法如检索增强生成（RAG）和基于图的记忆大多依赖成对关系，难以捕捉高阶关联——即多个元素间的联合依赖关系，导致检索结果碎片化。为此，我们提出HyperMem，一种基于超图的分层记忆架构，通过超边显式建模此类关联。具体而言，HyperMem将记忆结构化为三个层级：主题、情景和事实，并通过超边将相关情景及其事实进行分组，将分散内容整合为连贯单元。基于此结构，我们设计了混合词汇-语义索引及由粗到精的检索策略，支持对高阶关联进行精准高效的检索。在LoCoMo基准测试上的实验表明，HyperMem以92.73%的LLM-as-a-judge准确率实现了最先进的性能，验证了其在长程对话中的有效性。

摘要 (Abstract)

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.

关键词: Hypergraph Memory, Long-Term Conversations, Retrieval-Augmented Generation, High-order Associations, Hierarchical Memory Architecture, LLM-as-a-judge, Coarse-to-fine Retrieval, Conversational Agents

81. ❌ From Phenomenological Fitting to Endogenous Deduction: A Paradigm Leap via Meta-Principle Physics Architecture

作者: Helong Hu, HongDan Pan, ShuiQing Hu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的神经网络架构（MPPA），将物理元原则嵌入到架构中，以提高物理推理、数学和逻辑任务的能力。然而，论文并未涉及任何给定的大模型或深度学习技术关键词。它专注于物理原则嵌入的架构创新，而不是大模型技术、训练方法、推理优化、对齐、代理系统、压缩或科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的Meta-Principle Physics Architecture（MPPA），通过将物理元原则嵌入神经网络架构，实现了从纯现象拟合到现象拟合与内生演绎融合的范式转变，显著提升了物理推理、数学和逻辑任务性能。

摘要翻译

当前神经网络架构的本质是现象学拟合：它们通过海量参数和数据学习输入-输出的统计相关性，但缺乏对支配物理现实的基本原理的内在理解。本文提出一种从纯粹现象学拟合向现象学拟合与内生演绎相融合的范式跃迁。通过将物理元原理嵌入神经网络架构，我们构建了元原理物理架构（Meta-Principle Physics Architecture, MPPA）。
具体而言，MPPA将三个核心元原理——连通性（Connectivity）、守恒性（Conservation）、周期性（Periodicity）——嵌入其架构中，通过三个核心组件实现：引力器（Gravitator）通过标准因果注意力实现连通性；能量编码器（Energy Encoder）通过对数域能量追踪与延迟补偿实现守恒性；周期性编码器（Periodicity Encoder）通过基于快速傅里叶变换的频谱分析与延迟调制实现周期性。这些组件通过可学习的独立门控融合机制协同工作，形成“局部关系连通性-全局守恒约束-演化周期律”的完整物理认知框架。
实验表明MPPA取得显著提升：物理推理能力（从接近零提升至0.436，0.436 vs 0.000），数学任务性能提升2.18倍（0.330 vs 0.151），逻辑任务增益52%（0.456 vs 0.300），验证困惑度降低3.69%（259.45 vs 269.40），而参数量仅增加11.8%（242.40M vs 216.91M）。值得注意的是，MPPA在分布外物理场景中表现出强大泛化能力，证明了这种原理嵌入设计的鲁棒性与可解释性。本研究为构建具备物理常识、因果推理与数学严谨性的下一代人工智能奠定了新的理论基础与技术路径。

摘要 (Abstract)

The essence of current neural network architectures is phenomenological fitting: they learn input-output statistical correlations via massive parameters and data, yet lack intrinsic understanding of the fundamental principles governing physical reality. This paper proposes a paradigm leap from pure phenomenological fitting to the fusion of phenomenological fitting and endogenous deduction. By embedding physical meta-principles into neural network architecture, we construct the Meta-Principle Physics Architecture (MPPA). Specifically, MPPA embeds three core meta-principles - Connectivity, Conservation, Periodicity - into its architecture, implemented via three core components: the Gravitator realizes Connectivity via standard causal attention; the Energy Encoder implements Conservation via log-domain energy tracking and delayed compensation; the Periodicity Encoder fulfills Periodicity via FFT-based spectral analysis and delayed modulation. These components collaborate via a learnable independent gating fusion mechanism, forming a complete physical cognition framework of ’local relational connectivity - global conservation constraint - evolutionary periodic law’. Experiments show MPPA achieves significant improvements: physical reasoning (from near zero to 0.436, 0.436 vs 0.000), 2.18x mathematical task improvement (0.330 vs 0.151), 52% logical task gain (0.456 vs 0.300), and 3.69% lower validation perplexity (259.45 vs 269.40), with only 11.8% more parameters (242.40M vs 216.91M). Notably, MPPA shows strong generalization on out-of-distribution physical scenarios, proving the robustness and interpretability of this principle-embedded design. This work establishes a new theoretical foundation and technical path for next-generation AI with physical common sense, causal reasoning, and mathematical rigor.

关键词: Meta-Principle Physics Architecture, phenomenological fitting, endogenous deduction, physical meta-principles, neural network architecture, physical reasoning, generalization, interpretability

82. ❌ Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework

作者: Seyed Amir Ahmad Safavi-Naini, Elahe Meftah, Josh Mohess, Pooya Mohammadi Kazaj, Georgios Siontis, Zahra Atf, Peter R. Lewis, Mauricio Reyes, Girish Nadkarni, Roland Wiest, Stephan Windecker, Christoph Grani, Ali Soroush, Isaac Shiri 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一个临床世界模型和技能混合框架，旨在为临床AI建立形式化的能力评估体系。虽然论文涉及AI在临床领域的应用，但所有关键词都聚焦于大模型/深度学习的技术细节（如训练方法、架构优化、推理技术等），而本文完全不讨论这些具体技术。论文的核心是概念框架和评估方法论，而非技术实现或模型创新，因此与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了临床世界模型和技能混合框架，通过形式化临床护理中的患者、提供者和生态系统互动，为临床AI建立了包含八个维度的能力评估体系，解决了临床AI缺乏统一能力评估框架的问题。

摘要翻译

任何智能智能体的能力都受限于其对所处世界的形式化描述。临床人工智能目前缺乏这样的描述。现有框架仅孤立地处理评估、监管或系统设计问题，缺乏一个能够连接这些方面的、关于临床世界的共享模型。我们在此提出“临床世界模型”，这是一个将医疗照护形式化为患者、提供者和生态系统三方互动的框架。为了形式化任何智能体（无论是人类还是人工智能）如何将信息转化为临床行动，我们基于临床认知的已验证原则，为医疗提供者、患者和人工智能体开发了并行的决策架构。
“临床人工智能技能组合”通过八个维度将能力操作化。其中五个维度定义了临床能力空间（疾病状况、诊疗阶段、照护场景、提供者角色和任务），另外三个维度则具体规定了人工智能如何介入人类推理过程（被授予的权限、面向对象和锚定层）。这些维度的组合乘积产生了一个包含数十亿个独特能力坐标的空间。一个核心的结构性启示是：在某一坐标内进行的验证，对于其他坐标的性能表现提供的证据极为有限，这使得能力空间具有不可约简性。该框架提供了一种通用语法，各利益相关方可以借此对临床人工智能进行具体规范、评估和界定。通过使这一结构显性化，临床世界模型将领域的核心问题从“人工智能是否有效”重新定义为：在哪些能力坐标中其可靠性已得到证实，以及为谁而证实。

摘要 (Abstract)

The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision-making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill-Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field’s central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.

关键词: Clinical World Model, Clinical AI, Skill-Mix Framework, Clinical Competency, Decision-making Architectures, Clinical Cognition, Validation Framework, Healthcare AI

作者: He Zhao, Yijun Yang, Zichuan Lin, Deheng Ye, Chunyan Miao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于大推理模型（LRMs）的具身导航智能体，核心创新是提出混合推理策略，根据动作熵自适应激活推理过程。与以下关键词高度相关：1）‘Large Language Models’（8分）- 论文基于LRMs，属于大模型范畴；2）‘Post-training/SFT’（10分）- 明确使用混合监督微调作为冷启动；3）‘Chain of Thought/CoT Reasoning’（10分）- 涉及多步推理和思考过程；4）‘System 2 Thinking’（10分）- 论文区分反射性行动和深思熟虑推理，对应系统2思维；5）‘LLM Agents’（10分）- 研究基于大模型的自主智能体导航。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何智能高效地利用大推理模型的推理能力进行长视野具身导航，提出了基于动作熵的混合推理智能体HiRO-Nav，通过只在关键高熵动作时激活推理，在保持任务成功率的同时显著降低了计算开销。

摘要翻译

基于大型推理模型（LRMs）构建的具身导航智能体能够处理复杂的多模态环境输入，并在每一步执行具身推理以提升长时程任务的序列决策能力。然而，一个关键问题仍然存在：如何为长时程导航任务智能且高效地利用大型推理模型的推理能力？ 在简单场景中，智能体应能反射性地行动，而在复杂场景中则需在行动前进行审慎推理。为实现这一目标，我们提出了混合推理导航（HiRO-Nav） 智能体，这是首类能够根据自身动作熵自适应决定每一步是否执行推理的智能体。具体而言，通过分析智能体在导航轨迹中动作熵的演变，我们观察到仅有少量动作表现出高熵特性，而这些动作往往引导智能体进入新场景或接近关键物体。进一步研究动作熵与任务完成度（即Q值）的关系发现，提升高熵动作的质量对任务成功有更显著的积极影响。因此，我们设计了一套定制化训练流程：首先通过混合监督微调实现冷启动，随后采用提出的混合推理策略进行在线强化学习，仅针对高熵动作显式激活推理过程，从而在提升决策质量的同时显著降低计算开销。在CHORES-𝕊 ObjectNav基准上的大量实验表明，与全程推理和无推理基线相比，HiRO-Nav在成功率和计算效率（token效率）之间实现了更优的平衡。

摘要 (Abstract)

Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberate reasoning before acting.To achieve this, we introduce \textbf{H}ybr\textbf{i}d \textbf{R}eas\textbf{O}ning \textbf{Nav}igation (\textbf{HiRO-Nav}) agent, the first kind of agent capable of adaptively determining whether to perform thinking at every step based on its own action entropy. Specifically, by examining how the agent’s action entropy evolves over the navigation trajectories, we observed that only a small fraction of actions exhibit high entropy, and these actions often steer the agent toward novel scenes or critical objects. Furthermore, studying the relationship between action entropy and task completion (i.e., Q-value) reveals that improving high-entropy actions contributes more positively to task success.Hence, we propose a tailored training pipeline comprising hybrid supervised fine-tuning as a cold start, followed by online reinforcement learning with the proposed hybrid reasoning strategy to explicitly activate reasoning only for high-entropy actions, significantly reducing computational overhead while improving decision quality. Extensive experiments on the \textsc{CHORES}-$\mathbb{S}$ ObjectNav benchmark showcases that HiRO-Nav achieves a better trade-off between success rates and token efficiency than both dense-thinking and no-thinking baselines.

关键词: embodied navigation, large reasoning models, hybrid reasoning, action entropy, supervised fine-tuning, reinforcement learning, computational efficiency, decision-making

84. ❌ AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

作者: Yuankun Xie, Haonan Cheng, Jiayi Zhou, Xiaoxuan Guo, Tao Wang, Jian Liu, Weiqiang Wang, Ruibo Fu, Xiaopeng Wang, Hengyan Huang, Xiaoying Huang, Long Ye, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要关注音频大语言模型（ALLMs）在音频深度伪造检测中的应用挑战，与’Large Language Models’关键词高度相关（8分），因为论文明确提到ALLMs并讨论其带来的安全挑战。与’AI for Science’有一定关联（5分），因为音频深度伪造检测属于AI在多媒体取证领域的应用，可视为AI在科学/技术应用的一部分。其他关键词（如MoE、SFT、RAG等）均未在论文中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AT-ADD挑战赛，旨在解决音频大语言模型生成的多样化音频深度伪造检测问题，通过设计两个赛道（鲁棒语音检测和全类型音频检测）来推动鲁棒且可泛化的音频取证技术发展。

摘要翻译

音频大语言模型（Audio Large Language Models, ALLMs）的快速发展，使得语音与非语音音频（包括音效、歌声和音乐）能够以低成本、高保真度的方式生成与编辑。尽管这些能力促进了创意与内容生产，但也带来了重大的安全与信任挑战，因为高度逼真的音频深度伪造内容如今已可大规模生成与传播。然而，现有的音频深度伪造检测（Audio Deepfake Detection, ADD）防御对策与基准仍主要集中于语音领域，往往依赖语音特有的伪影特征，对现实环境中的失真鲁棒性有限，且对异构音频类型及新兴伪造技术的泛化能力不足。为弥补这些缺陷，我们为ACM Multimedia 2026提出了“全类型音频深度伪造检测（All-Type Audio Deepfake Detection, AT-ADD）”大型挑战赛，旨在连接受控的学术评估与实际的多媒体取证需求。AT-ADD包含两个赛道：（1）鲁棒语音深度伪造检测，旨在现实场景下评估检测器对未见过的先进语音生成方法的防御能力；（2）全类型音频深度伪造检测，将检测范围从语音扩展至多样化的未知音频类型，并推动跨语音、音效、歌声与音乐的类别无关泛化能力。通过提供标准化数据集、严谨的评估协议与可复现的基线模型，AT-ADD旨在加速鲁棒且可泛化的音频取证技术的发展，以在合成音频泛在的时代中，为安全通信、可信媒体验证与负责任治理提供支持。

摘要 (Abstract)

The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.

关键词: Audio Large Language Models, Audio Deepfake Detection, Robust Speech Detection, All-Type Audio Detection, Multimedia Forensics, Synthetic Audio, Generalization, ACM Multimedia Challenge

85. ❌ Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

作者: Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, Lan-Zhe Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agent在工具使用环境中的奖励建模评估，与"Large Language Models”、“Instruction Tuning OR Alignment OR Value Alignment”、“RLHF OR RLAIF OR Direct Preference Optimization OR DPO”、“LLM Agents OR Autonomous Agents OR Agentic Workflow”、“Tool Use OR Function Calling OR API Tool Use"高度相关（10分），因为这些是论文研究的核心技术和应用场景。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM Agent在工具使用环境中奖励建模缺乏评估基准的问题，提出了Plan-RewardBench轨迹级偏好基准，发现现有奖励模型在长轨迹任务上面临显著挑战，需要专门训练。

摘要翻译

在经典的人类反馈强化学习（RLHF）中，奖励模型（Reward Models, RMs）是模型对齐的核心信号提供者。随着大语言模型（Large Language Models）演变为能够自主调用工具并进行复杂推理的智能体系统，奖励建模的范式正面临前所未有的挑战——其中最突出的是，缺乏专门用于评估工具集成环境中奖励模型能力的基准。为填补这一空白，我们提出了Plan-RewardBench，一个轨迹层面的偏好基准，旨在评估评判者（judges）在复杂工具使用场景中区分优选智能体轨迹与干扰轨迹的能力。Plan-RewardBench涵盖四个代表性任务类别——（i）安全拒绝，（ii）工具无关性/不可用性，（iii）复杂规划，以及（iv）鲁棒性错误恢复——包含经过验证的正向轨迹，以及通过多模型自然推演、基于规则的扰动和最小化编辑的LLM扰动构建的易混淆困难负例。我们在统一的成对评估协议下，对代表性奖励模型（生成式、判别式及LLM-as-Judge）进行基准测试，报告了不同轨迹长度和任务类别下的准确率趋势。此外，我们对常见的失败模式进行了诊断分析。我们的结果表明，所有三类评估模型均面临显著挑战，其在长视野轨迹上的性能急剧下降，这凸显了在智能体轨迹层面奖励建模中进行专门训练的必要性。最终，Plan-RewardBench旨在同时作为一个实用的评估工具集，以及一个可复用的构建智能体规划偏好数据的蓝图。

摘要 (Abstract)

In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges–most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families – (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery – comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.

关键词: Reward Modeling, LLM Agents, Tool Use, Trajectory-level Preference, RLHF, Agentic Planning, Benchmark Evaluation, Complex Planning

86. ❌ OceanMAE: A Foundation Model for Ocean Remote Sensing

作者: Viola-Joanna Stamer, Panagiotis Agrafiotis, Behnood Rasti, Begüm Demir 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出OceanMAE，一种用于海洋遥感的基础模型，属于AI for Science（AI4Science）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。论文核心是自监督预训练（pre-training），通过掩码自编码器（MAE）整合多光谱数据和海洋描述符，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。论文未涉及大语言模型（LLMs）、微调、推理加速、智能体等关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

论文针对海洋遥感中标注数据有限和陆地预训练模型迁移性差的问题，提出了OceanMAE——一种整合物理海洋描述符的自监督预训练基础模型，实验表明其在海洋分割和测深任务中优于标准MAE，提升了海洋遥感的下游性能。

摘要翻译

精确的海洋测绘对于水深估算、海底特征描述、海洋垃圾检测及生态系统监测等应用至关重要。然而，海洋遥感（RS）仍受限于标注数据的稀缺，以及主要基于陆地主导的地球观测影像预训练模型的可迁移性不足。本文提出OceanMAE，一种专为海洋设计的掩码自编码器，其在自监督学习过程中通过整合多光谱哨兵-2号观测数据与具有物理意义的海洋描述符，扩展了标准MAE的预训练方式。通过融入这些辅助海洋特征，OceanMAE旨在从大规模无标注数据中学习更具信息量和海洋感知能力的潜在表征。为将这些表征迁移至下游应用，我们进一步采用一种改进的基于UNet的框架，用于海洋分割和水深估算。在Hydro数据集上预训练后，OceanMAE在MADOS和MARIDA数据集上评估了海洋污染物与废弃物分割性能，并在MagicBathyNet数据集上评估了水深回归性能。实验表明，OceanMAE在海洋分割任务上取得了最显著的性能提升，而水深估算的效益则具有竞争力且依赖于具体任务。此外，在MARIDA数据集上与标准MAE进行的消融实验表明，在预训练阶段引入辅助海洋描述符能提升下游分割质量。这些发现凸显了基于物理先验且与领域对齐的自监督预训练在海洋遥感中的价值。代码与权重已公开于https://git.tu-berlin.de/joanna.stamer/SSLORS2。

摘要 (Abstract)

Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.

关键词: OceanMAE, foundation model, ocean remote sensing, masked autoencoder, self-supervised pre-training, marine segmentation, bathymetry estimation, domain adaptation

87. ❌ Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

作者: Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM对齐的脆弱性问题，提出激活引导作为运行时防御方法，因此与’Large Language Models’和’Alignment’高度相关（10分）。论文涉及通过激活空间干预纠正错误行为，与’Self-Correction’和’Mechanistic Interpretability’有一定关联（5分）。论文评估诚实性，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型对齐的脆弱性问题，提出了三种激活引导方法作为运行时防御机制，能够在恢复目标特质（诚实和同情）的同时保持生成连贯性。

摘要翻译

大语言模型的对齐性比通常假设的更为脆弱：对抗性提示、良性微调、涌现性错位以及目标误泛化均可触发错位。近期证据表明，某些错位行为以激活空间中的线性结构形式编码，使得通过激活导向进行干预成为可能；同时研究表明，安全性对齐主要控制最初几个输出标记，后续生成则处于无防护状态。这些发现促使我们将激活导向作为一种轻量级运行时防御机制，在生成过程中持续纠正错位的激活向量。我们评估了三种方法：固定系数导向法（SwFC）采用均匀加性干预；以及两种新颖的投影感知方法——目标投影导向法（StTP）与镜像投影导向法（StMP），这两种方法利用逻辑回归决策边界，仅对激活值低于分布阈值的标记进行选择性干预。通过恶意系统提示作为错位的受控代理，我们在两种威胁模型（不诚实性与漠视性）和两种架构（Llama-3.3-70B-Instruct, Qwen3-32B）下进行评估。所有方法均能显著恢复目标特性（诚实性与同理心）同时保持文本连贯性。StTP与StMP能更好地维持通用能力（MMLU, MT-Bench, AlpacaEval），并在多轮对话中产生更少的重复内容。

摘要 (Abstract)

Alignment in LLMs is more brittle than commonly assumed: misalignment can be triggered by adversarial prompts, benign fine-tuning, emergent misalignment, and goal misgeneralization. Recent evidence suggests that some misalignment behaviors are encoded as linear structure in activation space, making it tractable via steering, while safety alignment has been shown to govern the first few output tokens primarily, leaving subsequent generation unguarded. These findings motivate activation steering as a lightweight runtime defense that continuously corrects misaligned activations throughout generation. We evaluate three methods: Steer-With-Fixed-Coeff (SwFC), which applies uniform additive steering, and two novel projection-aware methods, Steer-to-Target-Projection (StTP) and Steer-to-Mirror-Projection (StMP), that use a logistic regression decision boundary to selectively intervene only on tokens whose activations fall below distributional thresholds. Using malicious system prompts as a controlled proxy for misalignment, we evaluate under two threat models (dishonesty and dismissiveness) and two architectures (Llama-3.3-70B-Instruct, Qwen3-32B). All methods substantially recover target traits (honesty and compassion) while preserving coherence. StTP and StMP better maintain general capabilities (MMLU, MT-Bench, AlpacaEval) and produce less repetition in multi-turn conversations.

关键词: Activation Steering, LLM Alignment, Runtime Defense, Misalignment Correction, Linear Structure, Logistic Regression, Coherence Preservation, Honesty and Compassion

88. ❌ ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

作者: Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ViVa，一种基于预训练视频生成器的价值模型，用于机器人强化学习。与大多数关键词无关，因为论文专注于机器人视觉-语言-动作模型和视频生成，而非大语言模型技术。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）相关，因为使用预训练视频生成器；与’World Models AND General World Models’（5分）相关，因为价值模型预测未来状态，类似于世界模型。其他关键词如LLMs、MoE、SFT、RLHF等均未涉及。

!!! tip deepseek-chat TL;DR

论文提出ViVa，一种视频生成价值模型，通过利用预训练视频生成器的时空先验来改进机器人强化学习中的价值估计，在真实世界盒子组装任务中取得显著提升。

摘要翻译

视觉-语言-动作（VLA）模型通过大规模预训练推动了机器人操作的发展，但由于部分可观测性和延迟反馈，其实际部署仍面临挑战。强化学习通过价值函数应对这一问题，价值函数评估任务进度并指导策略改进。然而，现有基于视觉-语言模型（VLM）构建的价值模型难以捕捉时序动态，从而损害了长视野任务中价值估计的可靠性。本文提出ViVa，一种视频生成式价值模型，其将预训练的视频生成器重新用于价值估计。ViVa以当前观测和机器人本体感知为输入，联合预测未来的本体感知及当前状态的标量价值。通过利用预训练视频生成器的时空先验，我们的方法将价值估计建立在预期的具身动态之上，超越了静态快照，将价值与前瞻性内在耦合。将ViVa集成至RECAP框架后，在真实世界箱子组装任务中实现了显著提升。对全部三项任务的定性分析证实，ViVa能产生更可靠的价值信号，准确反映任务进度。通过利用视频语料库的时空先验，ViVa还能泛化至新物体，凸显了视频生成模型在价值估计领域的潜力。

摘要 (Abstract)

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.

关键词: video-generative value model, robot reinforcement learning, vision-language-action models, value estimation, spatiotemporal priors, pretrained video generator, real-world box assembly, future proprioception prediction

89. ❌ Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection

作者: Yushuo Zhang, Yu Cheng, Yongkang Hu, Jiuan Zhou, Jiawei Chen, Yuan Xie, Zhaoxia Yin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面部DeepFake检测，采用多域协同表示和双持续学习机制（结合EWC和OGC）。虽然论文涉及持续学习，但主要针对计算机视觉任务，而非大模型或深度学习技术原理的创新。关键词中仅’Pre-training OR Continual Pre-training OR Domain Adaptation’与论文的持续学习主题有一定关联（5分），其他关键词均与大模型、语言模型、推理、对齐、压缩等无关（0分）。论文未涉及大模型在不同领域的应用或新技术创新，因此大部分关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出Face-D(^2)CL框架，通过多域协同表示和双持续学习机制解决面部DeepFake检测中的特征表示不足和灾难性遗忘问题，在稳定性和可塑性上超越现有方法，平均检测错误率相对降低60.7%。

摘要翻译

面部伪造技术的快速发展对公众信任与信息安全构成严重威胁，使得面部深度伪造检测成为关键的研究重点。持续学习为适应不断演变的伪造模式提供了一种有效途径。然而，现有方法在现实世界的持续学习场景中面临两大瓶颈：特征表示不足与灾难性遗忘。为解决这些问题，我们提出了Face-D(^2)CL，一种用于面部深度伪造检测的框架。该框架利用多域协同表示融合空域与频域特征，以全面捕捉多样化的伪造痕迹，并采用双重持续学习机制，结合了弹性权重固化（Elastic Weight Consolidation, EWC）——该技术区分了真实样本与伪造样本的参数重要性，以及正交梯度约束（Orthogonal Gradient Constraint, OGC）——确保针对特定任务的适配器更新不会干扰先前学到的知识。这种协同作用使模型能够在无需依赖历史数据回放的情况下，实现强大的抗遗忘能力与对新兴面部伪造范式的敏捷适应性之间的动态平衡。大量实验表明，我们的方法在稳定性与可塑性方面均超越了当前的最先进（SOTA）方法，平均检测错误率相对降低了60.7%。在未见过的伪造领域上，相较于当前SOTA方法，其平均检测AUC进一步提升了7.9%。

摘要 (Abstract)

The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.

关键词: Facial DeepFake Detection, Continual Learning, Multi-domain Synergistic Representation, Elastic Weight Consolidation (EWC), Orthogonal Gradient Constraint (OGC), Catastrophic Forgetting, Feature Representation, Spatial and Frequency-domain Features

90. ❌ Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

作者: Longgang Zhang, Xiaowei Fu, Fuxiang Huang, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于加密流量解释的多模态推理框架（mmTraffic），该框架结合了以感知为中心的流量编码器和以认知为中心的LLM生成器，以实现可解释的流量分析。论文的核心是LLM在网络安全领域的应用，特别是用于生成可读的、基于证据的流量解释报告。因此，与"Large Language Models"高度相关（10分）。论文强调可解释性和推理过程，与"Mechanistic Interpretability OR Explainable AI"高度相关（10分）。论文提到"multimodal reasoning"和"chains of evidence”，与"Chain of Thought OR CoT Reasoning"和"System 2 Thinking OR Slow Thinking"有一定关联（各8分）。论文还提到缓解生成幻觉，与"Hallucination Mitigation OR Factuality OR Truthfulness"有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Quantization等，论文未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有加密流量分析方法缺乏多维语义理解和可解释推理过程的问题，提出了一个结合原始字节数据和专家注释的BGTD基准，以及一个端到端的多模态流量-语言表示框架（mmTraffic），该框架能够自主生成高保真、可读且基于证据的流量解释报告，同时保持与专用单模态模型相竞争的分类准确性。

摘要翻译

网络流量作为一种关键媒介格式，对保障现代互联网基础设施的安全与通信至关重要。尽管现有方法展现出优异的性能，但仍面临两个关键瓶颈：（1）它们无法捕获单模态序列模式之外的多维语义；（2）其黑箱特性——即仅提供类别标签——缺乏可审计的推理过程。我们发现一个关键因素：现有网络流量数据集主要面向分类任务设计，本质上缺乏丰富的语义标注，无法生成人类可读的证据报告。为应对数据稀缺问题，本文首次提出字节锚定的流量描述基准（Byte-Grounded Traffic Description, BGTD），将原始字节流与结构化专家标注相结合。BGTD为可解释加密流量解析的多模态推理提供了必要的行为特征与可验证的证据链。基于BGTD，本文提出端到端的流量-语言表征框架（mmTraffic），这是一种连接物理流量编码与语义解析的多模态推理架构。为缓解模态干扰与生成幻觉问题，mmTraffic采用联合优化的感知-认知架构：通过融合以感知为中心的流量编码器和以认知为中心的大语言模型生成器，实现了在保证类别预测准确性的同时完成精细化流量解析。大量实验表明，mmTraffic能自主生成高保真、人类可读且基于证据的流量解析报告，同时在分类准确率上与专用单模态模型（如NetMamba）相比保持高度竞争力。源代码已发布于https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

摘要 (Abstract)

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

关键词: Multimodal Reasoning, LLM, Encrypted Traffic Interpretation, Explainable AI, Traffic-Language Representation, Byte-Grounded Traffic Description, Perception-Cognition Architecture, Evidence-Grounded Reports

91. ❌ Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

作者: Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao, Dongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE架构的推理效率优化，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分），因为MoE是论文的核心架构。与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为论文以LLM（如DeepSeek-V2-Lite）为实验对象。与’Speculative Decoding OR Inference Acceleration’相关（10分），因为论文旨在加速MoE模型的推理（prefill和decode速度提升）。其他关键词与论文内容无关（0分），因为论文专注于MoE推理优化，不涉及模型训练、对齐、代理、科学应用等其他主题。

!!! tip deepseek-chat TL;DR

该论文针对混合专家模型推理时专家激活过多导致的延迟瓶颈问题，提出了Alloc-MoE框架，通过层级和令牌级的预算分配优化，在约束激活预算下保持了模型性能并实现了推理加速。

摘要翻译

混合专家模型因其稀疏激活机制，已成为扩展大语言模型的主流架构。然而，推理过程中大量的专家激活数量造成了关键的延迟瓶颈，尤其在资源受限的部署场景中。现有减少专家激活的方法可能导致严重的模型性能下降。在本研究中，我们引入激活预算的概念，将其作为专家激活数量的约束条件，并提出Alloc-MoE这一统一框架。该框架在层级和令牌级别协同优化预算分配，以最小化性能损失。在层级方面，我们提出Alloc-L方法，它利用敏感性分析和动态规划来确定各层间专家激活的最优分配。在令牌级别，我们提出Alloc-T方法，该方法基于路由分数动态重新分配激活，在不增加延迟的情况下优化预算分配。在多个混合专家模型上的大量实验表明，Alloc-MoE能在受限的激活预算下保持模型性能。特别地，在原始预算一半的条件下，Alloc-MoE在DeepSeek-V2-Lite模型上实现了$1.15\times$的预填充加速和$1.34\times$的解码加速。

摘要 (Abstract)

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

关键词: Mixture-of-Experts, MoE, inference efficiency, activation budget, expert activation, budget allocation, DeepSeek-V2-Lite, latency bottleneck

92. ❌ LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows

作者: Lingyun Yang, Suyi Li, Tianyu Feng, Xiaoxiao Jiang, Zhipeng Di, Weiyi Lu, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows》专注于扩散模型（Diffusion Models）的推理服务系统优化，特别是文本到图像生成工作流。论文的核心贡献是提出了一种微服务架构（LegoDiffusion），将扩散工作流分解为松散耦合的模型执行节点，以实现独立的资源管理、模型共享和自适应并行。然而，所有给定的评分关键词均围绕大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等），或特定科学领域应用（如AI for Science）。本文虽涉及扩散模型（一种生成式AI模型），但未讨论LLMs、MoE、SFT、对齐、推理、代理、量化等关键词指向的技术。扩散模型与LLMs在架构、训练和应用上均有显著差异，因此论文内容与所有关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文针对文本到图像扩散工作流中现有服务系统将整个工作流视为不透明单体所导致的资源管理粗粒度、模型无法共享等问题，提出了LegoDiffusion系统，通过将工作流分解为可独立管理的微服务节点，实现了高达3倍的请求率提升和8倍的突发流量容忍能力。

摘要翻译

文本到图像生成执行一个以基础扩散模型为核心、包含多个模型的扩散工作流。现有服务系统将每个工作流视为不透明的整体，对所有组成模型进行统一配置、部署和扩缩容，这掩盖了内部数据流、阻碍了模型共享，并导致粗粒度的资源管理。本文提出通过LegoDiffusion系统实现扩散工作流的微服务化，该系统将工作流解耦为可独立管理和调度的松散耦合模型执行节点。通过显式管理单个模型推理，LegoDiffusion实现了集群级别的优化，包括按模型扩缩容、模型共享和自适应模型并行。综合而言，LegoDiffusion优于现有扩散工作流服务系统，可维持高达3倍的请求处理速率，并能承受高达8倍的突发流量冲击。

摘要 (Abstract)

Text-to-image generation executes a diffusion workflow comprising multiple models centered on a base diffusion model. Existing serving systems treat each workflow as an opaque monolith, provisioning, placing, and scaling all constituent models together, which obscures internal dataflow, prevents model sharing, and enforces coarse-grained resource management. In this paper, we make a case for micro-serving diffusion workflows with LegoDiffusion, a system that decomposes a workflow into loosely coupled model-execution nodes that can be independently managed and scheduled. By explicitly managing individual model inference, LegoDiffusion unlocks cluster-scale optimizations, including per-model scaling, model sharing, and adaptive model parallelism. Collectively, LegoDiffusion outperforms existing diffusion workflow serving systems, sustaining up to 3x higher request rates and tolerating up to 8x higher burst traffic.

关键词: text-to-image generation, diffusion workflows, micro-serving, model sharing, adaptive model parallelism, inference acceleration, resource management, workflow decomposition

93. ❌ Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

作者: Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Uni-ViGU提出了一种统一视频生成与理解的框架，其核心创新包括：1）提出一种基于MoE（Mixture of Experts）的模态驱动框架，在Transformer块中增加轻量级层用于文本生成，同时保留生成先验，这与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关，因此给10分。2）论文主要关注视频生成与理解的多模态统一，涉及扩散模型、流匹配、双向训练等具体技术，但未涉及其他关键词如LLMs、Scaling Laws、RLHF、RAG、量化等主题，因此这些关键词均给0分。3）论文虽涉及AI技术，但未明确针对科学领域（如生物信息学），因此’AI for Science’相关关键词给0分。

!!! tip deepseek-chat TL;DR

论文Uni-ViGU解决了视频生成与理解在计算成本上的不平衡问题，通过扩展视频生成器作为基础框架，结合MoE架构和双向训练机制，实现了在视频生成和理解任务上的竞争性性能。

摘要翻译

整合视觉理解与生成的统一多模态模型面临一个根本性挑战：视觉生成的计算成本远高于理解，尤其是对于视频数据。这种不平衡促使我们逆转传统范式：我们并非扩展以理解为中心的多模态大语言模型来支持生成，而是提出了Uni-ViGU框架，通过扩展视频生成器作为基础来统一视频生成与理解。我们引入了一种统一流方法，在单一过程中对视频执行连续流匹配、对文本执行离散流匹配，从而实现连贯的多模态生成。我们进一步提出一种基于模态驱动的混合专家框架，通过轻量级层增强Transformer块以支持文本生成，同时保留生成先验。为了将生成知识重新应用于理解任务，我们设计了一种包含两个阶段的双向训练机制：知识召回阶段通过重建输入提示来利用已学习的文本-视频对应关系，而能力精炼阶段则通过细粒度描述进行微调，以建立具有判别力的共享表征。实验表明，Uni-ViGU在视频生成和理解任务上均取得了有竞争力的性能，验证了以生成为中心的架构是实现统一多模态智能的可扩展路径。项目页面与代码：https://fr0zencrane.github.io/uni-vigu-page/。

摘要 (Abstract)

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

关键词: Unified Video Generation and Understanding, Diffusion-Based Video Generator, Modality-Driven MoE Framework, Continuous Flow Matching, Bidirectional Training Mechanism, Knowledge Recall, Capability Refinement, Multimodal Intelligence

94. ❌ Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

作者: Gyuho Shim, Seongtae Hong, Heuiseok Lim 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到LLMs在Document AI中的应用，并基于LLMs构建OCR纠错框架，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Revise的框架，利用大语言模型（LLMs）系统纠正OCR在字符、单词和结构层面的错误，通过分层错误分类和合成数据生成策略训练纠错模型，显著提升了文档检索和问答任务的下游性能。

摘要翻译

近期，大规模语言模型（LLMs）的进展显著推动了文档智能（Document AI）领域的发展，在问答等文档理解任务中展现出卓越性能。然而，现有方法主要集中于解决特定任务，缺乏对文档信息进行结构化组织与管理的能力。为应对这一局限，我们提出Revise框架，该系统可在字符、词语及结构层面系统性地修正光学字符识别（OCR）引入的错误。具体而言，Revise采用了一套涵盖常见OCR错误的层次化分类体系，并通过合成数据生成策略逼真模拟此类错误，以训练高效的校正模型。实验结果表明，Revise能有效修正OCR输出结果，实现对文档内容更结构化的表征与系统化管理。因此，我们的方法显著提升了文档检索与问答任务的下游性能，凸显了突破现有文档智能框架在结构化信息管理方面局限的潜力。

摘要 (Abstract)

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.

关键词: Large Language Models, Document AI, OCR error correction, synthetic data generation, document retrieval, question answering, hierarchical taxonomy, structured document management

95. ❌ Small Vision-Language Models are Smart Compressors for Long Video Understanding

作者: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Small Vision-Language Models (SVLMs)作为压缩器用于长视频理解，与’Small Language Models OR SLMs OR On-device AI’高度相关（10分），因为SVLM是核心组件。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为论文涉及MLLMs的适应和比较。与’Context Window Extension OR Long Context LLMs’相关（8分），因为论文解决长视频上下文限制问题。与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为涉及压缩技术，但非核心。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出Tempo框架，利用小型视觉语言模型作为压缩器解决长视频理解中的上下文限制问题，通过自适应令牌分配实现高效压缩，在LVBench上超越GPT-4o和Gemini 1.5 Pro。

摘要翻译

适应长视频的多模态大语言模型受限于上下文长度。密集的视觉流会耗尽令牌预算，并加剧“迷失在中间”的现象。现有启发式方法（如稀疏采样或均匀池化）盲目牺牲保真度，丢弃关键瞬间，并在无关背景上浪费带宽。我们提出Tempo，一种高效的查询感知框架，用于压缩长视频以支持下游理解。Tempo利用一个小型视觉语言模型作为局部时序压缩器，将令牌减少过程转化为一种早期的跨模态蒸馏，在单次前向传播中生成紧凑且与意图对齐的表征。为了在不破坏因果性的前提下强制执行严格预算，我们引入了自适应令牌分配机制。该机制利用小型视觉语言模型的零样本相关性先验和语义前置特性，作为一个无需训练的O(1)动态路由器。它将密集带宽分配给查询关键片段，同时将冗余内容压缩为最少的时序锚点以维持全局叙事线。大量实验表明，我们的60亿参数架构在激进的动态压缩下实现了最先进的性能。在超长的LVBench上，Tempo在严格的8K视觉令牌预算下得分52.3，超越了GPT-4o和Gemini 1.5 Pro。扩展到2048帧时，得分达到53.7。关键的是，Tempo将小时级视频压缩至远低于理论极限，证明真正的长视频理解依赖于意图驱动的效率，而非贪婪填充的上下文窗口。

摘要 (Abstract)

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM’s zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

关键词: Small Vision-Language Models, Long Video Understanding, Context Compression, Adaptive Token Allocation, Multimodal Large Language Models, Temporal Compression, Query-aware Framework, Zero-shot Relevance

96. ❌ TADP-RME: A Trust-Adaptive Differential Privacy Framework for Enhancing Reliability of Data-Driven Systems

作者: Labani Halder, Payel Sadhukhan, Sarbani Palit 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于差分隐私和数据可靠性，属于隐私保护技术领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种信任自适应的差分隐私框架TADP-RME，通过动态调整隐私预算和反向流形嵌入技术，在对抗环境中提高了数据驱动系统的可靠性，减少了推理攻击成功率。

摘要翻译

在对抗性环境中确保可靠性需将隐私视为数据驱动系统的基础组成部分。差分隐私与密码学协议虽能提供强保障，但现有方案依赖固定的隐私预算，导致僵化的效用-隐私权衡机制，无法适应异构化的用户信任需求。此外，纯噪声差分隐私会保留几何结构，而推断攻击正利用此特性导致隐私泄露。
我们提出TADP-RME（基于逆向流形嵌入的信任自适应差分隐私）框架，以增强不同用户信任等级下的系统可靠性。该框架引入[0,1]范围内的逆向信任评分来自适应调节隐私预算，实现效用与隐私间的平滑过渡。同时，逆向流形嵌入技术通过非线性变换破坏局部几何关联性，并借助后处理机制保持形式化的差分隐私保障。
理论与实验结果表明，该方法优化了隐私-效用权衡关系，在未显著降低数据效用的前提下，将攻击成功率降低最高达3.1%。该框架在抵御推断攻击时持续优于现有方法，为对抗性环境下的可靠学习提供了统一解决方案。

摘要 (Abstract)

Ensuring reliability in adversarial settings necessitates treating privacy as a foundational component of data-driven systems. While differential privacy and cryptographic protocols offer strong guarantees, existing schemes rely on a fixed privacy budget, leading to a rigid utility-privacy trade-off that fails under heterogeneous user trust. Moreover, noise-only differential privacy preserves geometric structure, which inference attacks exploit, causing privacy leakage. We propose TADP-RME (Trust-Adaptive Differential Privacy with Reverse Manifold Embedding), a framework that enhances reliability under varying levels of user trust. It introduces an inverse trust score in the range [0,1] to adaptively modulate the privacy budget, enabling smooth transitions between utility and privacy. Additionally, Reverse Manifold Embedding applies a nonlinear transformation to disrupt local geometric relationships while preserving formal differential privacy guarantees through post-processing. Theoretical and empirical results demonstrate improved privacy-utility trade-offs, reducing attack success rates by up to 3.1 percent without significant utility degradation. The framework consistently outperforms existing methods against inference attacks, providing a unified approach for reliable learning in adversarial environments.

关键词: differential privacy, trust-adaptive, reverse manifold embedding, privacy-utility trade-off, inference attacks, data-driven systems, adversarial environments

97. ❌ OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

作者: Seungjae Moon, Seunghyun Oh, Youngmin Ro 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的开放词汇语义分割，提出了一种无需训练的框架OV-Stitcher，通过拼接子图像特征实现全局注意力。虽然论文涉及预训练大模型（如大型视觉和视觉语言模型）的知识利用，但研究内容主要围绕视觉模型的特征表示和分割技术，与评分关键词列表中的大多数大语言模型（LLM）相关技术（如MoE、SFT、RLHF、RAG等）无直接关联。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文利用了预训练模型的知识，但未深入探讨预训练技术本身。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

论文解决了训练免费开放词汇语义分割中因滑动窗口策略导致的全局注意力缺失问题，提出了OV-Stitcher框架，通过拼接子图像特征实现全局上下文聚合，在八个基准测试中显著提升了mIoU性能。

摘要翻译

免训练开放词汇语义分割（TF-OVSS）近期因其能够利用大型视觉与视觉-语言模型的预训练知识进行密集预测，且无需额外训练而受到关注。然而，由于这些预训练编码器的输入分辨率有限，现有的TF-OVSS方法普遍采用滑动窗口策略，独立处理裁剪后的子图像。尽管该策略能有效处理高分辨率输入，但它阻碍了图像整体的全局注意力机制，导致特征表示碎片化且上下文推理能力受限。我们提出OV-Stitcher，一种免训练框架，通过在最终编码器块内直接拼接碎片化的子图像特征来解决这一局限。通过从碎片化的子图像特征中重建注意力表示，OV-Stitcher在最终编码器块内实现了全局注意力，从而产生连贯的上下文聚合以及空间一致、语义对齐的分割图。在八个基准数据集上的广泛评估表明，OV-Stitcher为开放词汇分割建立了一个可扩展且高效的解决方案，与先前的免训练基线相比，平均交并比（mIoU）从48.7显著提升至50.7。

摘要 (Abstract)

Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

关键词: Open-vocabulary semantic segmentation, Training-free framework, Global attention, Feature stitching, Large vision models, Context-aware, Semantic alignment, mIoU improvement

98. ❌ AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

作者: Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究通过微调3B参数的视觉语言模型（VLM）构建首个开源Darija OCR模型。核心相关关键词：1）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 论文明确使用微调策略；2）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）- 使用QLoRA进行参数高效训练；3）‘Large Language Models OR LLMs OR Foundation Models’（8分）- 基于Qwen2.5-VL 3B模型，属于基础模型范畴。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了摩洛哥阿拉伯方言Darija缺乏专门OCR工具的问题，通过微调3B参数视觉语言模型并采用QLoRA高效训练方法，构建了首个开源Darija OCR模型AtlasOCR，在多个基准测试中取得了最先进的性能。

摘要翻译

摩洛哥阿拉伯语方言达里贾语（Darija）虽蕴含丰富的视觉内容，却长期缺乏专门的光学字符识别（OCR）工具。本文介绍了AtlasOCR，这是首个通过微调30亿参数视觉语言模型（VLM）构建的开源达里贾语OCR模型。我们详细阐述了从数据构建到高效微调的全流程方法：利用自研的OCRSmith库进行合成生成，并结合精心采集的真实场景数据，构建了独特的达里贾语专用数据集；采用QLoRA与Unsloth技术对Qwen2.5-VL 3B模型进行参数高效微调，并通过系统的消融实验优化关键超参数。在新构建的AtlasOCRBench与既有KITAB-Bench上的评估表明，该模型取得了最先进的性能表现，其效果可挑战更大规模模型，并凸显了AtlasOCR在达里贾语及标准阿拉伯语OCR任务中兼具鲁棒性与泛化能力。

摘要 (Abstract)

Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR’s robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

关键词: Darija OCR, Vision Language Model, Fine-tuning, QLoRA, Parameter-efficient training, Open-source model, OCR benchmark, Multilingual OCR

99. ❌ ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

作者: Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents的隐式记忆能力评估，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLM agents的自动化行为适应能力。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

论文提出了首个系统评估LLM agents隐式记忆能力的基准ImplicitMemBench，发现当前模型在自动化行为适应方面存在严重局限，最高性能仅65.3%，远低于人类水平。

摘要翻译

现有的大语言模型智能体记忆基准主要评估事实的显性回忆，却忽视了隐性记忆——即经验无需意识提取即可转化为自动化行为。这一缺失至关重要：高效的助手必须能自动应用习得程序或避免重复失败操作，而无需显性提示。我们推出ImplicitMemBench，首个通过三个认知科学基础构念系统评估隐性记忆的基准，这些构念源自标准认知科学对非陈述性记忆的阐释：程序性记忆（干扰后的一次性技能习得）、启动效应（通过配对实验/对照实例产生的主题驱动偏差）以及经典条件反射（条件刺激-无条件刺激关联对首次决策的塑造）。我们包含300个测试项的套件采用统一的学习/启动-干扰-测试协议，并以首次尝试准确率评分。对17个模型的评估揭示了严重局限：无模型总体准确率超过66%，表现最佳的DeepSeek-R1（65.3%）、Qwen3-32B（64.1%）和GPT-5（63.0%）远低于人类基线。分析发现显著不对称性（抑制效应17.6% vs 偏好效应75.0%）及普遍瓶颈，表明需要超越参数扩展的架构创新。ImplicitMemBench将评估范式从“智能体回忆什么”重构为“它们自动执行什么”。

摘要 (Abstract)

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus–Unconditioned Stimulus (CS–US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from “what agents recall” to “what they automatically enact”.

关键词: Implicit Memory, LLM Agents, Behavioral Adaptation, Memory Benchmark, Procedural Memory, Priming, Classical Conditioning, Automated Behavior

100. ❌ Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules

作者: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是具身智能体（embodied agents）中能力模块（capability modules）的安全升级、兼容性检查和运行时回滚的系统框架问题，属于具身智能系统领域。虽然论文涉及智能体（agents），但具体研究的是模块化能力升级的系统工程问题，而非大语言模型（LLM）技术、深度学习原理或AI在科学领域的应用。所有关键词均与大语言模型技术、深度学习原理、AI科学应用相关，与论文的系统工程研究主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一个具身智能体能力模块的安全升级框架，通过兼容性检查和分阶段部署流程，在保持任务成功率的同时实现了零不安全激活，并成功回滚了79.8%的激活后漂移场景。

摘要翻译

具身智能体日益需要通过更新其可执行能力而非重写智能体本身来实现持续改进。先前研究已分别探讨了模块化能力封装、能力演化与运行时治理。然而，一个关键的系统性问题仍未得到充分探索：一旦具身能力模块演化为新版本，宿主系统如何在不断裂策略约束、执行假设或恢复保证的前提下安全部署该版本？
我们将治理式能力演化确立为具身智能体的一类首要系统性问题。为此，我们提出一种生命周期感知的升级框架，其中每个新能力版本均被视为受治理的部署候选对象，而非立即可执行的替代品。该框架引入了四项升级兼容性检查——接口兼容性、策略兼容性、行为兼容性与恢复兼容性——并将其组织为分阶段的运行时流水线，涵盖候选验证、沙箱评估、影子部署、门控激活、在线监控与回滚机制。
我们通过15组随机种子进行了超过6轮能力升级评估。朴素升级方案实现了72.9%的任务成功率，但在最终轮次中将不安全激活率推升至60%；治理式升级在保持相当成功率（67.4%）的同时，所有轮次均维持零不安全激活记录（Wilcoxon检验p=0.003）。影子部署揭示了40%仅凭沙箱评估无法发现的性能退化案例，而在激活后发生状态漂移的场景中，回滚机制的成功率达到79.8%。

摘要 (Abstract)

Embodied agents are increasingly expected to improve over time by updating their executable capabilities rather than rewriting the agent itself. Prior work has separately studied modular capability packaging, capability evolution, and runtime governance. However, a key systems problem remains underexplored: once an embodied capability module evolves into a new version, how can the hosting system deploy it safely without breaking policy constraints, execution assumptions, or recovery guarantees? We formulate governed capability evolution as a first-class systems problem for embodied agents. We propose a lifecycle-aware upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks – interface, policy, behavioral, and recovery – and organizes them into a staged runtime pipeline comprising candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. We evaluate over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.

关键词: embodied agents, capability evolution, safe upgrade, compatibility checking, runtime rollback, governed deployment, shadow deployment, capability modules

101. ❌ LINE: LLM-based Iterative Neuron Explanations for Vision Models

作者: Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Przemysław Biecek 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LINE提出了一种利用大语言模型（LLM）解释视觉模型神经元概念的新方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。该方法旨在提高AI的可解释性，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文未涉及其他关键词，如MoE、SLMs、训练技术、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为LINE的无需训练的迭代方法，利用大语言模型和文本到图像生成器，在严格的黑盒设置下为视觉模型提供开放词汇的神经元概念标注，在多个模型架构上实现了最先进的性能，并发现了大量新概念。

摘要翻译

解读深度神经网络中单个神经元所编码的概念，是理解其复杂决策过程并确保人工智能安全的关键步骤。尽管神经元标注研究近期取得进展，但现有方法通常将搜索空间限制在预定义的概念词汇表中，或产生过于具体、无法捕捉高阶全局概念的描述。我们提出LINE，一种新颖的、无需训练、专为视觉模型开放词汇概念标注设计的迭代方法。LINE在严格的黑盒设置下运行，利用大型语言模型和文生图生成器，在激活历史的引导下，以闭环方式迭代提出并优化概念。我们证明，LINE在多种模型架构上均实现了最先进的性能，在ImageNet和Places365数据集上的AUC（曲线下面积）分别提升高达0.18和0.05，同时平均发现了29%被海量预定义词汇表遗漏的新概念。除了识别核心概念外，LINE还提供了完整的生成历史，这支持了多义性评估，并产生了可与依赖梯度的激活最大化方法相媲美的支撑性视觉解释。

摘要 (Abstract)

Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.

关键词: neuron interpretation, large language models, explainable AI, vision models, concept labeling, black-box setting, activation history, polysemanticity evaluation

102. ❌ PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation

作者: Soumya Mazumdar, Vineet Kumar Rakesh, Tapas Samanta 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于个性化说话头生成的隐私保护联邦学习框架，使用扩散模型和LoRA适配器。仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确使用LoRA进行身份适配。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、推理方法、对齐技术、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

论文提出了PrivFedTalk框架，通过联邦学习结合条件潜在扩散模型和LoRA身份适配器，在保护隐私的前提下实现个性化说话头生成，实验验证了该框架在资源受限环境下的可行性。

摘要翻译

基于扩散模型的说话头生成技术发展迅速，但其训练通常依赖于集中式人脸视频与语音数据集，引发了严重的隐私担忧。这一问题在个性化说话头生成中尤为突出，因为身份相关的数据具有高度敏感性，往往无法跨用户或设备汇集。本文提出PrivFedTalk，这是一个面向个性化说话头生成的隐私感知联邦学习框架，它将条件潜在扩散模型与参数高效的身份适配相结合。该框架在客户端间训练一个共享的扩散主干网络，同时每个客户端利用本地私有视听数据学习轻量级的LoRA身份适配器，从而避免原始数据共享并降低通信成本。为应对客户端数据分布异质性问题，身份稳定联邦聚合（ISFA）通过基于设备端身份一致性与时序稳定性估计计算出的隐私安全标量可靠性信号，对客户端更新进行加权。此外，本文引入了时序去噪一致性（TDC）正则化，以减少联邦去噪过程中的帧间漂移、闪烁和身份漂移。为限制更新侧的隐私风险，适配器更新采用了安全聚合与客户端级差分隐私保护。该实现方案支持在异构共享硬件上进行低内存GPU执行和多GPU客户端并行训练。通过在多种训练与聚合条件下，对PrivFedTalk、FedAvg和FedProx进行对比实验，结果表明该框架实现了稳定的联邦优化，并在受限资源下成功完成了端到端的训练与评估。这些结果支持了在联邦环境中进行隐私感知的个性化说话头训练的可行性，同时指出，关于各组件性能、隐私-效用权衡以及生成质量的更强论断仍需进一步的标准化评估。

摘要 (Abstract)

Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.

关键词: federated learning, privacy-aware, talking-head generation, diffusion models, LoRA adapters, personalized generation, identity adaptation, secure aggregation

103. ❌ IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

作者: Zhaomeng Zhou, Lan Zhang, Junyang Wang, Mu Yuan, Junda Lin, Jinke Song 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在物理世界交互中的应用，具体解决语义-物理映射差距问题，通过Spatial Trajectory Graph和IoT-Brain系统实现可靠的传感器调度。因此，与’Large Language Models’高度相关（10分），因为LLMs是系统的核心组件；与’LLM Agents’高度相关（10分），因为论文研究LLMs作为智能体进行规划和决策。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了大型语言模型在物理世界交互中存在的语义-物理映射差距问题，通过提出Spatial Trajectory Graph和IoT-Brain系统，显著提升了传感器调度的任务成功率和效率。

摘要翻译

由大规模传感器网络驱动的智能系统正从预定义监测转向意图驱动操作，这揭示出关键的语义到物理映射鸿沟。尽管大语言模型在语义理解方面表现卓越，现有以感知为中心的流程仍采用回顾式运作模式，忽略了“感知什么”与“何时感知”这一根本性决策。我们将这种前瞻性决策形式化为语义空间传感器调度问题，并证明由于表征、推理与优化层面的固有缺陷，直接使用大语言模型进行规划并不可靠。为弥合这些鸿沟，我们提出空间轨迹图——一种受“验证后执行”原则支配的神经符号范式，它将开放式规划转化为可验证的图优化问题。基于该范式，我们实现了具体系统IoT-Brain，并构建了校园级基准测试集TopoSense-Bench，其包含2,510个摄像头生成的5,250条自然语言查询。评估表明：相较于最强的搜索密集型方法，IoT-Brain将任务成功率提升37.6%，同时运行速度提升近2倍，提示词使用量减少6.6倍。在实际部署中，该系统在降低4.1倍网络带宽的同时逼近可靠性理论上限，为大语言模型与物理世界交互提供了兼具卓越可靠性与高效性的基础框架。

摘要 (Abstract)

Intelligent systems powered by large-scale sensor networks are shifting from predefined monitoring to intent-driven operation, revealing a critical Semantic-to-Physical Mapping Gap. While large language models (LLMs) excel at semantic understanding, existing perception-centric pipelines operate retrospectively, overlooking the fundamental decision of what to sense and when. We formalize this proactive decision as Semantic-Spatial Sensor Scheduling (S3) and demonstrate that direct LLM planning is unreliable due to inherent gaps in representation, reasoning, and optimization. To bridge these gaps, we introduce the Spatial Trajectory Graph (STG), a neuro-symbolic paradigm governed by a verify-before-commit discipline that transforms open-ended planning into a verifiable graph optimization problem. Based on STG, we implement IoT-Brain, a concrete system embodiment, and construct TopoSense-Bench, a campus-scale benchmark with 5,250 natural-language queries across 2,510 cameras. Evaluations show that IoT-Brain boosts task success rate by 37.6% over the strongest search-intensive methods while running nearly 2 times faster and using 6.6 times fewer prompt tokens. In real-world deployment, it approaches the reliability upper bound while reducing 4.1 times network bandwidth, providing a foundational framework for LLMs to interact with the physical world with unprecedented reliability and efficiency.

关键词: Large Language Models, Semantic-Spatial Sensor Scheduling, IoT-Brain, Spatial Trajectory Graph, Sensor Networks, Physical World Interaction, Planning and Optimization

作者: Joel Jose, Andreas Madsen, Andreas Brandsæter, Tor A. Johansen, Erlend M. Coates 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究海事自主导航中对比解释方法，属于可解释AI在特定领域应用，与’Mechanistic Interpretability OR Explainable AI’有一定关联（评分5分），但未涉及大模型、深度学习技术原理或科学领域应用，与其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究如何为海事自主避碰系统生成对比解释以提高人类监督者的理解，用户研究表明该方法在复杂场景中有效但会增加认知负荷。

摘要翻译

在可预见的未来，自动化海上避碰系统仍需依赖人类监督。这就要求系统必须透明展示其如何感知场景并规划避让动作。然而，避碰操作背后的因果逻辑通常复杂且难以向航海人员传达。本文探讨了如何为具有航海背景的监督者提供一种有选择性且易于理解的解释方式。我们提出了一种生成对比性解释的方法，该方法通过将系统提出的解决方案与相关替代方案进行比较，提供以人为中心的洞察。为评估此方法，我们开发了一个框架，利用视觉与文本线索来突出来自先进避碰系统的关键目标。一项由四位经验丰富的海事官员参与的探索性用户研究表明，对比性解释有助于理解系统目标。然而，我们的研究结果也表明，尽管这类解释在复杂的多船会遇场景中极具价值，但它们也可能增加认知负荷。这表明未来的海事交互界面或可最大程度受益于需求驱动或特定场景驱动的解释策略。

摘要 (Abstract)

Automated maritime collision avoidance will rely on human supervision for the foreseeable future. This necessitates transparency into how the system perceives a scenario and plans a maneuver. However, the causal logic behind avoidance maneuvers is often complex and difficult to convey to a navigator. This paper explores how to explain these factors in a selective, understandable manner for supervisors with a nautical background. We propose a method for generating contrastive explanations, which provide human-centric insights by comparing a system’s proposed solution against relevant alternatives. To evaluate this, we developed a framework that uses visual and textual cues to highlight key objectives from a state-of-the-art collision avoidance system. An exploratory user study with four experienced marine officers suggests that contrastive explanations support the understanding of the system’s objectives. However, our findings also reveal that while these explanations are highly valuable in complex multi-vessel encounters, they can increase cognitive workload, suggesting that future maritime interfaces may benefit most from demand-driven or scenario-specific explanation strategies.

关键词: maritime autonomous navigation, collision avoidance, contrastive explanations, human supervision, explainable AI, user study, cognitive workload, multi-vessel encounters

105. ❌ From Universal to Individualized Actionability: Revisiting Personalization in Algorithmic Recourse

作者: Lena Marie Budde, Ayan Majumdar, Richard Uth, Markus Langer, Isabel Valera 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究算法追索中的个性化问题，属于机器学习公平性、可解释性和因果推理领域，但完全不涉及大模型、深度学习技术原理或AI for Science等关键词。论文聚焦于算法追索框架中的个性化约束、用户偏好和公平性评估，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文形式化了算法追索中的个性化作为个体可操作性，并实证研究发现个性化约束会显著降低追索建议的合理性和有效性，同时揭示不同社会人口群体在追索成本和合理性方面的差异。

摘要翻译

算法追索旨在提供可操作的推荐，使个体能够改变不利的模型输出结果，先前研究已广泛探讨了其效率、鲁棒性和公平性等属性。然而，个性化在追索中的作用在很大程度上仍处于隐含状态且未得到充分探索。尽管现有方法通过用户交互融入了个性化要素，但它们通常缺乏对个性化的明确定义，也未系统分析其对其他追索期望特性的下游影响。
本文中，我们将个性化形式化为个体可操作性，并从两个维度进行刻画：一是规定哪些特征具有个体可操作性的硬约束，二是捕捉个体对行动取值与成本偏好的软性个性化约束。我们在因果算法追索框架内实现了这些维度的操作化，采用了一种事前用户提示方法，即在生成任何追索推荐之前，个体通过排序或评分来表达其偏好。通过广泛的实证评估，我们研究了个性化如何与追索的关键期望特性（包括有效性、成本和合理性）相互作用。我们的结果揭示了重要的权衡：个体可操作性约束，尤其是硬约束，可能会在摊销与非摊销方法中显著降低追索推荐的合理性和有效性。值得注意的是，我们还发现，纳入个体可操作性可能揭示不同社会人口群体间追索行动成本与合理性的差异。这些发现强调，在算法追索中需要对个性化进行原则性定义、谨慎操作化及严格评估。

摘要 (Abstract)

Algorithmic recourse aims to provide actionable recommendations that enable individuals to change unfavorable model outcomes, and prior work has extensively studied properties such as efficiency, robustness, and fairness. However, the role of personalization in recourse remains largely implicit and underexplored. While existing approaches incorporate elements of personalization through user interactions, they typically lack an explicit definition of personalization and do not systematically analyze its downstream effects on other recourse desiderata. In this paper, we formalize personalization as individual actionability, characterized along two dimensions: hard constraints that specify which features are individually actionable, and soft, individualized constraints that capture preferences over action values and costs. We operationalize these dimensions within the causal algorithmic recourse framework, adopting a pre-hoc user-prompting approach in which individuals express preferences via rankings or scores prior to the generation of any recourse recommendation. Through extensive empirical evaluation, we investigate how personalization interacts with key recourse desiderata, including validity, cost, and plausibility. Our results highlight important trade-offs: individual actionability constraints, particularly hard ones, can substantially degrade the plausibility and validity of recourse recommendations across amortized and non-amortized approaches. Notably, we also find that incorporating individual actionability can reveal disparities in the cost and plausibility of recourse actions across socio-demographic groups. These findings underscore the need for principled definitions, careful operationalization, and rigorous evaluation of personalization in algorithmic recourse.

关键词: algorithmic recourse, personalization, individual actionability, hard constraints, soft constraints, user preferences, fairness, causal framework

106. ❌ Wiring the ‘Why’: A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

作者: Moein Salimi, Shaygan Adim, Danial Parnian, Nima Alighardashi, Mahdi Jafari Siavoshani, Mohammad Hossein Rohban 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于LLMs中溯因推理的综述研究，核心聚焦于LLMs的推理能力评估和分析。因此，与’Large Language Models’高度相关（10分）。论文涉及推理过程（假设生成与选择）、评估策略和机制理解，与’Chain of Thought’、‘System 2 Thinking’和’Mechanistic Interpretability’有较强关联（各8分）。论文提到基准测试、训练框架和机制理解，与’Self-Correction’、‘Hallucination Mitigation’和’In-context Learning’有一定关联（各5分）。其他关键词如MoE、量化、RAG、对齐等未在摘要中提及，评分为0。

!!! tip deepseek-chat TL;DR

该论文首次系统综述了LLMs中的溯因推理，提出了统一的两阶段定义（假设生成与选择）和分类法，并通过基准测试揭示了当前方法在基准设计、领域覆盖和机制理解等方面的关键差距。

摘要翻译

尽管溯因推理——即对观察现象进行最合理解释的推断——在人类发现与意义建构中具有基础性作用，其在大型语言模型中的探索仍相对不足。尽管大型语言模型发展迅速，当前对溯因推理及其多维度的研究仍处于零散状态，缺乏系统性整合。本文首次对大型语言模型中的溯因推理研究进行全面综述，追溯其从哲学基础到当代人工智能实现的发展轨迹。针对该领域普遍存在的概念混淆与任务定义割裂问题，我们提出了一个统一的两阶段定义，对现有研究进行形式化分类。该定义将溯因推理解构为假设生成（模型通过弥合认知鸿沟产生候选解释）与假设选择（对生成的候选解释进行评估并选择最合理者）两个阶段。基于此框架，我们对现有文献进行了系统分类，依据其溯因任务类型、数据集、方法论及评估策略进行归纳。为实证支撑该框架，我们对当前大型语言模型在溯因任务上进行了紧凑的基准测试，并围绕模型规模、模型家族、评估方式以及生成与选择两类任务类型展开针对性比较分析。此外，通过综合近期实证结果，我们探究了大型语言模型在溯因推理上的表现与演绎、归纳任务之间的关联，从而揭示其更广泛的推理能力。我们的分析指出了当前研究范式的关键缺陷——包括静态基准设计、狭窄领域覆盖、局限的训练框架以及对溯因过程机制性理解的不足……

摘要 (Abstract)

Regardless of its foundational role in human discovery and sense-making, abductive reasoning–the inference of the most plausible explanation for an observation–has been relatively underexplored in Large Language Models (LLMs). Despite the rapid advancement of LLMs, the exploration of abductive reasoning and its diverse facets has thus far been disjointed rather than cohesive. This paper presents the first survey of abductive reasoning in LLMs, tracing its trajectory from philosophical foundations to contemporary AI implementations. To address the widespread conceptual confusion and disjointed task definitions prevalent in the field, we establish a unified two-stage definition that formally categorizes prior work. This definition disentangles abduction into \textit{Hypothesis Generation}, where models bridge epistemic gaps to produce candidate explanations, and \textit{Hypothesis Selection}, where the generated candidates are evaluated and the most plausible explanation is chosen. Building upon this foundation, we present a comprehensive taxonomy of the literature, categorizing prior work based on their abductive tasks, datasets, underlying methodologies, and evaluation strategies. In order to ground our framework empirically, we conduct a compact benchmark study of current LLMs on abductive tasks, together with targeted comparative analyses across model sizes, model families, evaluation styles, and the distinct generation-versus-selection task typologies. Moreover, by synthesizing recent empirical results, we examine how LLM performance on abductive reasoning relates to deductive and inductive tasks, providing insights into their broader reasoning capabilities. Our analysis reveals critical gaps in current approaches–from static benchmark design and narrow domain coverage to narrow training frameworks and limited mechanistic understanding of abductive processes…

关键词: Abductive Reasoning, Large Language Models, Hypothesis Generation, Hypothesis Selection, Survey, Taxonomy, Benchmark Study, Reasoning Capabilities

107. ❌ SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

作者: Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶领域的图像检索数据集构建和基准测试，核心贡献是创建了一个大规模、长尾的罕见驾驶场景图像检索数据集SearchAD，并进行了文本到图像和图像到图像检索方法的评估。论文内容涉及计算机视觉、数据集构建、多模态检索和自动驾驶应用，但未涉及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中列出的具体大模型技术（如MoE、Scaling Laws、RLHF、RAG等）。虽然论文提到“fine-tuning of multi-modal retrieval models”，但这指的是视觉-语言模型的微调，而非大语言模型相关的技术。因此，所有关键词均评为0分，因为论文主题与评分关键词完全无关，不符合研究背景中“大模型和深度学习技术原理的创新”或“大模型在不同领域的研究应用”的要求。

!!! tip deepseek-chat TL;DR

该论文提出了SearchAD，一个用于自动驾驶的大规模罕见图像检索数据集，包含超过423k帧图像和513k个边界框标注，专注于解决长尾类别检索问题，并通过基准测试发现文本方法优于图像方法，但整体检索性能仍有待提升。

摘要翻译

从大规模数据集中检索罕见且安全关键的驾驶场景对于构建鲁棒的自动驾驶系统至关重要。随着数据集规模持续增长，核心挑战已从收集更多数据转向如何高效识别最相关的样本。本文介绍SearchAD——一个用于自动驾驶的大规模罕见图像检索数据集，其包含从11个现有数据集中提取的超过423k帧图像。SearchAD提供了超过513k个边界框的高质量人工标注，涵盖90个罕见类别。该数据集专门针对“大海捞针”问题，旨在定位极端罕见的类别，其中部分类别在整个数据集中出现次数少于50次。与现有专注于实例级检索的基准不同，SearchAD通过精心设计的数据划分强调语义图像检索，支持文本到图像和图像到图像检索、少样本学习以及多模态检索模型的微调。综合评估表明，基于文本的方法因其更强的内在语义基础而优于基于图像的方法。尽管直接对齐空间视觉特征与语言的模型实现了最佳零样本性能，且我们的微调基线显著提升了效果，但绝对检索能力仍不尽如人意。通过在公共基准服务器上设置独立测试集，SearchAD建立了首个面向自动驾驶领域检索驱动数据策展与长尾感知研究的大规模数据集：https://iis-esslingen.github.io/searchad/

摘要 (Abstract)

Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

关键词: autonomous driving, image retrieval, rare scenarios, large-scale dataset, multi-modal retrieval, long-tail perception, text-to-image retrieval, benchmark evaluation

108. ❌ Evaluating Counterfactual Explanation Methods on Incomplete Inputs

作者: Francesco Leofante, Daniel Neider, Mustafa Yalçıner 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是机器学习模型的可解释性方法（Counterfactual Explanations）在不完整输入下的性能评估，属于可解释AI（XAI）领域。与绝大多数关键词（涉及大模型技术、训练方法、推理优化、应用领域等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联，因为可解释AI是更广泛的范畴，但论文聚焦于具体的反事实解释方法，而非大模型的可解释性，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文评估了现有反事实解释方法在处理不完整输入时的有效性，发现即使鲁棒性方法表现更好，所有方法都难以生成有效的反事实，从而呼吁开发能处理不完整输入的新方法。

摘要翻译

现有为机器学习模型生成反事实解释的算法通常假定输入数据是完全指定的。然而，现实世界的数据往往包含缺失值，而这些不完整输入对现有反事实解释方法性能的影响尚未得到充分探究。为填补这一空白，我们系统评估了近期反事实解释生成方法在不完整输入条件下提供有效且合理的反事实的能力。作为研究的一部分，我们假设鲁棒的反事实解释生成方法能更好地应对不完整输入场景下提供有效合理反事实的挑战。研究结果表明，虽然鲁棒性方法比非鲁棒方法获得更高的有效性，但所有方法都难以找到有效的反事实解释。这些发现表明，亟需开发能够处理不完整输入的新型反事实解释方法。

摘要 (Abstract)

Existing algorithms for generating Counterfactual Explanations (CXs) for Machine Learning (ML) typically assume fully specified inputs. However, real-world data often contains missing values, and the impact of these incomplete inputs on the performance of existing CX methods remains unexplored. To address this gap, we systematically evaluate recent CX generation methods on their ability to provide valid and plausible counterfactuals when inputs are incomplete. As part of this investigation, we hypothesize that robust CX generation methods will be better suited to address the challenge of providing valid and plausible counterfactuals when inputs are incomplete. Our findings reveal that while robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.

关键词: Counterfactual Explanations, Machine Learning, Incomplete Inputs, Missing Values, Validity, Plausibility, Robustness

109. ❌ The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development

作者: Ioannis Nasios 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究机器学习竞赛生态系统，包括平台、参与者及其对AI发展的影响，属于宏观的AI发展机制研究。所有关键词均聚焦于大模型/深度学习的具体技术原理、训练方法、应用或评估（如LLM、MoE、Scaling Laws、RLHF、RAG、CoT、量化等），而论文未涉及任何具体的大模型技术、训练方法或应用案例，也未讨论生物医药等特定领域的AI应用。因此，所有关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了机器学习竞赛（如Kaggle、Zindi）的生态系统，分析了其平台、参与者、工作流程和影响，并指出这些竞赛在促进AI创新、技能发展和实际问题解决方面发挥着关键作用。

摘要翻译

机器学习竞赛（MLCs）在推动人工智能（AI）发展方面发挥着关键作用，其通过促进创新、技能培养和实际问题解决来实现这一目标。本研究对Kaggle和Zindi等主要竞赛平台进行了全面分析，考察了其工作流程、评估方法和奖励结构。研究进一步评估了竞赛质量、参与者专业水平及全球影响力，并特别关注了顶尖参赛者的人口统计学趋势。通过探究竞赛主办方的动机，本文强调了MLCs在塑造AI发展、促进协作以及推动有影响力的技术进步方面的重要作用。此外，通过结合文献综述、平台级数据分析以及从业者见解，本研究提供了对MLC生态系统的全面理解。
此外，本文论证了MLCs在学术研究与工业应用的交叉领域发挥作用，促进了跨领域的知识、数据及实用方法的交流。它们与开源社区的紧密联系进一步推动了更广泛机器学习生态系统内的协作、可复现性和持续创新。通过塑造研究重点、影响行业标准以及实现大规模众包问题解决，这些竞赛在AI的持续演进中扮演着关键角色。本研究提供了与研究人员、从业者和竞赛组织者相关的见解，并探讨了MLCs对AI发展的未来轨迹与持续影响。

摘要 (Abstract)

Machine learning competitions (MLCs) play a pivotal role in advancing artificial intelligence (AI) by fostering innovation, skill development, and practical problem-solving. This study provides a comprehensive analysis of major competition platforms such as Kaggle and Zindi, examining their workflows, evaluation methodologies, and reward structures. It further assesses competition quality, participant expertise, and global reach, with particular attention to demographic trends among top-performing competitors. By exploring the motivations of competition hosts, this paper underscores the significant role of MLCs in shaping AI development, promoting collaboration, and driving impactful technological progress. Furthermore, by combining literature synthesis with platform-level data analysis and practitioner insights a comprehensive understanding of the MLC ecosystem is provided. Moreover, the paper demonstrates that MLCs function at the intersection of academic research and industrial application, fostering the exchange of knowledge, data, and practical methodologies across domains. Their strong ties to open-source communities further promote collaboration, reproducibility, and continuous innovation within the broader ML ecosystem. By shaping research priorities, informing industry standards, and enabling large-scale crowdsourced problem-solving, these competitions play a key role in the ongoing evolution of AI. The study provides insights relevant to researchers, practitioners, and competition organizers, and includes an examination of the future trajectory and sustained influence of MLCs on AI development.

关键词: machine learning competitions, Kaggle, Zindi, AI development, innovation, crowdsourced problem-solving, open-source communities, competition platforms

110. ❌ PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

作者: Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu, Pengcheng Wu, Guibin Zhang, Yue Liao, Xiaobin Hu, Deheng Ye, Chunyan Miao, Shuicheng Yan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于构建具有长期记忆的主动AI代理系统（PASK），核心贡献在于DD-MM-PAS范式、混合记忆架构和真实世界基准测试。与关键词的相关性分析如下：1）高度相关（10分）：‘LLM Agents’等，因为论文核心就是研究主动AI代理系统；2）中等相关（5分）：‘Large Language Models’，论文提到IntentFlow模型与Gemini3-Flash比较，暗示可能使用LLM技术；‘Chain of Thought’和’System 2 Thinking’，论文强调深度推理、推断潜在需求，符合多步推理和深度思考概念；3）无关（0分）：其余关键词如MoE、量化、RAG等，论文未涉及这些具体技术细节。

!!! tip deepseek-chat TL;DR

该论文研究了在真实世界复杂场景中构建具有长期记忆的主动AI代理系统，提出了DD-MM-PAS范式和PASK实现，并通过实验证明其IntentFlow模型在延迟约束下能达到领先模型的性能，同时能识别更深层的用户意图。

摘要翻译

主动性是通用人工智能的核心预期能力。现有研究大多局限于实验室环境，与现实世界主动智能体所需具备的深度、复杂性、模糊性、精确性及实时性约束之间存在明显差距。本研究聚焦于这一现实场景，其中有效的主动干预需要从持续变化的上下文推断潜在需求，并在延迟和长程约束下，将行动建立在不断演化的用户记忆基础上。我们首先提出DD-MM-PAS（需求检测、记忆建模、主动智能体系统）作为流式主动AI智能体的通用范式。我们在Pask系统中实例化了该范式，采用流式IntentFlow模型进行需求检测，构建混合记忆（工作空间、用户、全局）以实现长期记忆建模，并设计了PAS基础设施框架，阐述了这些组件如何形成闭环。同时，我们推出了LatentNeeds-Bench——一个基于用户授权数据构建、经过数千轮人工精校的真实世界基准测试集。实验表明，在延迟约束下，IntentFlow模型的性能与领先的Gemini3-Flash模型相当，同时能识别更深层次的用户意图。

摘要 (Abstract)

Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long-term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds-Bench, a real-world benchmark built from user-consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3-Flash models under latency constraints, while identifying deeper user intent.

关键词: Proactive Agents, Long-Term Memory, Intent Inference, Streaming AI Agent, Hybrid Memory, Real-world Benchmark, Latent Needs, AGI Proactivity

111. ❌ Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support

作者: Jing Xu, Jiarui Hu, Zhihao Shuai, Yiyun Chen, Weikai Yang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究信息图检索和创作支持系统，主要涉及信息可视化、人机交互、检索系统和设计意图理解。虽然使用了AI技术（如检索模型），但论文没有明确提及或深入探讨任何给定的大模型或深度学习技术关键词。论文关注的是特定应用领域（信息图创作）的检索框架，而非大模型技术原理、训练方法、推理优化、对齐技术或科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对用户用模糊自然语言表达设计意图时难以检索到合适信息图的问题，提出了一个基于意图分类的检索框架，通过丰富用户查询并利用检索到的示例支持交互式创作，实验证明该方法提高了检索质量和创作效率。

摘要翻译

尽管信息图表已成为传达数据驱动故事的有力媒介，但从零开始创作信息图表仍然具有挑战性，尤其对于新手用户而言。从大规模图库中检索相关样例能够提供设计灵感并促进复用，从而显著降低信息图表创作的门槛。然而，有效的检索十分困难，因为用户通常以模糊的自然语言表达设计意图，而信息图表则体现了丰富且多层面的视觉设计。因此，基于关键词的搜索往往难以捕捉设计意图，而在自然图像上训练的通用视觉-语言检索模型也不适用于信息图表文本密集、多组件的特性。为应对这些挑战，我们开发了一种意图感知的信息图表检索框架，以更好地将用户查询与信息图表设计对齐。我们首先开展了一项关于人们如何描述信息图表的形成性研究，并推导出一个涵盖内容与视觉设计维度的意图分类体系。随后，该分类体系被用于丰富和优化自由形式的用户查询，通过特定意图线索引导检索过程。基于检索到的样例，用户可以通过高级编辑意图将设计适配到自身数据中，这一过程由一个执行底层适配的交互式智能体提供支持。我们进行了定量评估和用户研究，结果表明，相较于基线方法，我们的方法提升了检索质量，同时更好地支持了意图满足和高效的信息图表创作。

摘要 (Abstract)

While infographics have become a powerful medium for communicating data-driven stories, authoring them from scratch remains challenging, especially for novice users. Retrieving relevant exemplars from a large corpus can provide design inspiration and promote reuse, substantially lowering the barrier to infographic authoring. However, effective retrieval is difficult because users often express design intent in ambiguous natural language, while infographics embody rich and multi-faceted visual designs. As a result, keyword-based search often fails to capture design intent, and general-purpose vision-language retrieval models trained on natural images are ill-suited to the text-heavy, multi-component nature of infographics. To address these challenges, we develop an intent-aware infographic retrieval framework that better aligns user queries with infographic designs. We first conduct a formative study of how people describe infographics and derive an intent taxonomy spanning content and visual design facets. This taxonomy is then leveraged to enrich and refine free-form user queries, guiding the retrieval process with intent-specific cues. Building on the retrieved exemplars, users can adapt the designs to their own data with high-level edit intents, supported by an interactive agent that performs low-level adaptation. Both quantitative evaluations and user studies are conducted to demonstrate that our method improves retrieval quality over baseline methods while better supporting intent satisfaction and efficient infographic authoring.

关键词: infographic retrieval, design intent, authoring support, intent taxonomy, vision-language retrieval, interactive agent, user study, retrieval framework

112. ❌ A Decomposition Perspective to Long-context Reasoning for LLMs

作者: Yanling Xiao, Huaibing Xie, Guoliang Zhao, Shihan Dou, Shaolei Wang, Yiting Liu, Nantao Zheng, Cheng Zhang, Pluto Zhou, Zhisong Zhang, Lemao Liu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文推理（Long-context reasoning）在LLMs中的应用，与’Large Language Models’和’Context Window Extension’高度相关（10分）。论文涉及推理技能分解和强化学习训练，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、RLHF等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文通过将长上下文推理分解为原子技能并利用强化学习进行训练，显著提升了大型语言模型在多个长文本推理基准上的性能，平均提升7.7%。

摘要翻译

长上下文推理对于复杂的现实应用至关重要，但仍是大型语言模型面临的一项重大挑战。尽管长上下文推理领域发展迅速，当前研究往往忽视了该任务本身的内部复杂性。本文超越整体性视角，将长上下文推理分解为一组基础原子技能，并自动合成一系列伪数据集，每个数据集都明确针对特定的原子技能进行训练。我们的实证分析证实，掌握这些原子技能与通用的长文本推理性能存在强相关性。基于这一发现，我们利用强化学习在这些伪数据集上精炼模型的原子技能，以期提升其通用的长上下文推理能力。在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR等多个基准测试上的广泛实验证明了我们方法的有效性：其平均性能超越强基线7.7%（从46.3%提升至54.0%）。

摘要 (Abstract)

Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model’s atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7% (improving from 46.3% to 54.0%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.

关键词: Long-context reasoning, Large Language Models, atomic skills, reinforcement learning, benchmarks, performance improvement, pseudo datasets, reasoning decomposition

113. ❌ Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

作者: Jiayuan Ye, Vitaly Feldman, Kunal Talwar 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在预训练阶段的事实记忆问题，提出基于训练损失的数据选择方法提升事实准确性，直接相关关键词包括：LLMs（核心研究对象）、Pre-training（研究背景）、Hallucination Mitigation（解决幻觉问题）、Scaling Laws AND Data Quality（涉及数据质量与模型容量关系）。其他关键词如MoE、SFT、RAG等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型在预训练中如何通过训练数据剪枝来优化事实记忆，提出基于训练损失的数据选择方法，使小模型能记忆更多实体事实，达到大模型的性能水平。

摘要翻译

大型语言模型（LLMs）在参数中记忆事实知识时可能存在困难，这常常导致幻觉现象以及在知识密集型任务上表现不佳。本文从信息论角度形式化事实记忆问题，并研究训练数据分布如何影响事实准确性。我们证明，当训练数据事实所包含的信息量超过模型容量时，事实准确性会处于次优状态（低于容量上限）。若事实频率分布呈偏态（例如遵循幂律分布），这一问题会进一步加剧。
我们提出了仅基于训练损失的数据选择方案，旨在限制训练数据中的事实数量并使其频率分布扁平化。在包含高熵事实的半合成数据集上，我们的选择方法能有效将事实准确性提升至容量上限。当在标注的维基百科语料库上从头预训练语言模型时，我们的选择方法使GPT2-Small模型（1.1亿参数）相比标准训练多记忆1.3倍的实体事实，其表现与在全数据集上预训练的10倍规模模型（13亿参数）相当。

摘要 (Abstract)

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

关键词: Large Language Models, Fact Memorization, Training Data Pruning, Pre-training, Hallucination Mitigation, Data Quality, Model Capacity, Information Theory

114. ❌ sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

作者: Sergey V Samsonau 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究科学写作质量验证工具，使用开源模型在本地验证参考文献的真实性和相关性。与’AI for Science’高度相关（10分），因为直接应用于科学领域；与’Retrieval-Augmented Generation’相关（8分），涉及检索和验证引用文献；与’Hallucination Mitigation’相关（8分），旨在减少科学写作中的虚假引用；与’Large Language Models’和’Small Language Models’有一定关联（各5分），因为工具可能使用LLM进行裁决，并在本地运行（类似on-device）。其他关键词如MoE、Scaling Laws、训练方法、推理优化等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对科学写作中质量保证不足的问题，开发了一个开源工具sciwrite-lint，可在本地验证科学手稿中参考文献的真实性和相关性，并提出了结合完整性验证和科学贡献评估的SciLint Score框架。

摘要翻译

当前科学界存在两种质量保障方案，但均存在不足。期刊守门机制声称同时验证论文的完整性与学术贡献，实则衡量的是声望：同行评审过程缓慢、存在偏见，即使在顶级出版平台也无法识别伪造的参考文献。开放科学则完全不提供质量保障：在人工智能生成的文本与公开学术记录之间，唯一的过滤机制仅依赖作者的诚信。人工智能辅助写作加剧了这两类系统的缺陷，其快速生成论文的能力远超任何系统的处理负荷。
我们提出第三种方案：直接测量论文本身。sciwrite-lint（可通过 pip install sciwrite-lint 安装）是一款面向科学手稿的开源代码检测工具，完全在研究者本地设备运行（仅需免费公共数据库、消费级GPU和开放权重模型），无需将稿件传输至外部服务。该流程可验证参考文献是否存在、核查撤稿状态、比对元数据与权威记录、下载并解析被引文献、核验文献是否支持文中相关论断，并进一步追踪被引文献本身的参考文献。系统通过整合所有验证信号，为每篇参考文献生成独立可靠性评分。我们在arXiv和bioRxiv的30篇未训练论文中注入错误，并采用大型语言模型裁定的假阳性分析对流程进行评估。
作为实验性延伸，我们提出SciLint评分体系，将完整性验证与贡献度评估相结合。该体系将科学哲学中五种理论框架（波普尔、拉卡托斯、基彻尔、劳丹、梅奥）转化为可计算的科学论证结构特征。完整性验证模块是工具的核心，本文已对其展开评估；贡献度评估模块则以实验性代码形式发布，供学术社区共同开发。

摘要 (Abstract)

Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI-generated text and the public record is the author’s integrity. AI-assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite-lint (pip install sciwrite-lint) is an open-source linter for scientific manuscripts that runs entirely on the researcher’s machine (free public databases, a single consumer GPU, and open-weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers’ own bibliographies. Each reference receives a per-reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM-adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.

关键词: scientific manuscript verification, reference validation, open-source linter, AI-assisted writing, integrity verification, local processing, citation analysis, science quality assurance

115. ❌ What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

作者: Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究眼动追踪中的语义扫描路径相似性，使用视觉语言模型（VLMs）和NLP指标。与大多数关键词无关，因为论文专注于特定应用而非大模型技术原理创新。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’（评分8.0），因为论文将AI应用于眼动追踪研究，属于科学领域的AI应用。‘Large Language Models OR LLMs OR Foundation Models’（评分5.0）有间接关联，因为VLMs是多模态基础模型，但论文未深入探讨LLMs技术本身。其他关键词如MoE、Scaling Laws、RLHF等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对眼动研究中现有扫描路径相似性度量忽略语义内容的问题，提出了一个基于视觉语言模型和NLP指标的语义扫描路径相似性框架，实验表明该框架能捕捉与几何对齐部分独立的方差，为眼动研究提供了内容感知的补充维度。

摘要翻译

扫视路径相似性度量是眼动研究的核心，然而现有方法主要评估空间与时间对齐，却忽视了被注视图像区域间的语义对等性。我们提出了一种语义扫视路径相似性框架，将视觉-语言模型（VLMs）整合到眼动追踪分析中。每个注视点在受控视觉上下文（基于图像块和基于标记的策略）下被编码，并转化为简洁的文本描述，这些描述被聚合为扫视路径级别的表征。随后，使用基于嵌入和基于词汇的自然语言处理（NLP）度量计算语义相似性，并与包括MultiMatch和动态时间规整（DTW）在内的经典空间度量进行比较。在自由观看眼动数据上的实验表明，语义相似性捕捉了与几何对齐部分独立的方差，揭示了尽管空间分布存在差异但内容一致性高的案例。我们进一步分析了上下文编码对描述保真度与度量稳定性的影响。我们的研究结果表明，多模态基础模型能够实现经典扫视路径分析的可解释、内容感知的扩展，为ETRA社区内的注视研究提供了一个互补的维度。

摘要 (Abstract)

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

关键词: semantic scanpath similarity, vision-language models, eye-tracking analysis, NLP metrics, multimodal foundation models, gaze research, content-aware analysis, ETRA community

116. ❌ Formalizing building-up constructions of self-dual codes through isotropic lines in Lean

作者: Jae-Hyun Baek, Jon-Lark Kim 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08485v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究编码理论中的自对偶码构造，属于纯数学领域，与所有大模型、深度学习、AI应用等关键词完全无关。论文内容涉及有限域、几何构造、形式化验证（Lean），没有任何人工智能或机器学习元素。

!!! tip deepseek-chat TL;DR

该论文通过各向同性线和双曲几何统一了自对偶码的构造方法，并实现了Lean形式化验证，构造了多个最优自对偶码。

摘要翻译

本文目的有二。其一，我们证明Kim的二元自对偶码的构建构造与Chinburg-Zhang的希尔伯特符号构造等价。其二，我们引入Chinburg-Zhang构造的$q$元版本，以高效构建$q$元自对偶码。对于后者，我们通过三个互补的视角研究分裂有限域(\F_q)（满足(q \equiv 1 \pmod{4})）上的自对偶码：构建构造、Chinburg-Zhang的二元算术约化，以及欧几里得平面的双曲几何。其中(-1)为平方元这一条件是联系这些视角的共同代数基础：在二元情形中，它构成了拉格朗日约化图景的基础；而在分裂$q$元情形中，它产生了支配扩展公式中修正项的迷向线。作为我们高效生成矩阵形式的一个应用，我们通过分裂盒构造构建了最优自对偶码，包括(\GF{5})上的自对偶([6,3,4])与([8,4,4])码、(\GF{13})上的MDS自对偶([8,4,5])与([10,5,6])码，以及(\GF{13})上的自对偶([12,6,6])码。这些结构陈述辅以代数核心部分的Lean~4形式化验证。

摘要 (Abstract)

The purpose of this paper is two-fold. First we show that Kim’s building-up construction of binary self-dual codes is equivalent to Chinburg-Zhang’s Hilbert symbol construction. Second we introduce a $q$-ary version of Chinburg-Zhang’s construction in order to construct $q$-ary self-dual codes efficiently. For the latter, we study self-dual codes over split finite fields (\F_q) with (q \equiv 1 \pmod{4}) through three complementary viewpoints: the building-up construction, the binary arithmetic reduction of Chinburg–Zhang, and the hyperbolic geometry of the Euclidean plane. The condition that (-1) be a square is the common algebraic input linking these viewpoints: in the binary case it underlies the Lagrangian reduction picture, while in the split (q)-ary case it produces the isotropic line governing the correction terms in the extension formulas. As an application of our efficient form of generator matrices, we construct optimal self-dual codes from the split boxed construction, including self-dual ([6,3,4]) and ([8,4,4]) codes over (\GF{5}), MDS self-dual ([8,4,5]) and ([10,5,6]) codes over (\GF{13}), and a self-dual ([12,6,6]) code over (\GF{13}). These structural statements are accompanied by a Lean~4 formalization of the algebraic core.

关键词: self-dual codes, building-up construction, isotropic lines, finite fields, Lean formalization, optimal codes, hyperbolic geometry, generator matrices

117. ❌ AI generates well-liked but templatic empathic responses

作者: Emma Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, Desmond C. Ong 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08479v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究LLMs在情感支持场景中的应用表现，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），但未涉及其他关键词的技术细节或创新，如MoE、SLMs、训练方法、推理优化、代理系统等，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究发现LLMs在情感支持中因使用高度模板化的共情表达方式而被认为比人类回复更具共情力，揭示了AI生成共情回复的公式化特征。

摘要翻译

近期研究表明，越来越多的人开始向大语言模型（LLMs）寻求情感支持，并且人们认为LLM生成的回应比人类撰写的回应更具共情力。我们提出这种成功的一个原因：LLMs已经学会并持续运用一种广受欢迎的共情表达模板。我们构建了一个包含10种共情语言“策略”的分类体系，这些策略包括确认他人感受和转述释义等，并应用该体系来刻画人类和LLMs在撰写共情回应时产生的语言特征。通过两项研究（共涉及n=3,265条由六个模型生成的AI回应和n=1,290条人类撰写的回应）的比较分析，我们发现LLM的回应在话语功能层面高度模式化。我们识别出一个模板——一种结构化的策略序列——该模板匹配了83%至90%的LLM回应（在预留样本中匹配率为60%至83%），且匹配时覆盖了回应的81%至92%内容。相比之下，人类撰写的回应则更为多样。最后，我们讨论了这一发现对AI生成共情未来发展的影响。

摘要 (Abstract)

Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language “tactics” that include validating someone’s feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template – a structured sequence of tactics – that matches between 83–90% of LLM responses (and 60–83% in a held out sample), and when those are matched, covers 81–92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

关键词: Large Language Models, LLMs, empathic responses, emotional support, template, discourse analysis, human-AI comparison, AI-generated empathy

118. ❌ Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

作者: Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于视觉语言模型（VLM）的测试时证据检索和定位方法，提出了一种基于熵梯度的无训练定位技术。所有评分关键词均针对纯语言模型（LLM）或通用大模型技术，而本文研究的是视觉-语言多模态模型，核心内容涉及视觉token嵌入、熵梯度计算、空间定位等，与给定的纯文本大模型技术关键词无直接关联。虽然都属于深度学习领域，但具体技术方向、模型类型和应用场景完全不同，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对预训练视觉语言模型在细节敏感和多证据查询中的定位困难，提出了一种基于熵梯度的无训练证据检索方法，通过计算下一个token分布的熵并反向传播到视觉token嵌入来生成相关性图，在多个基准测试中显著提升了定位性能并产生了更可解释的证据定位。

摘要翻译

尽管进展迅速，预训练的视觉语言模型在处理依赖细微视觉细节或需要整合跨多个区域线索的任务（如文档理解和组合式查询）时仍面临困难。我们通过将视觉定位问题重构为测试时证据检索来解决这一挑战：给定一个查询，模型应主动识别下一步需要关注哪些区域以消除歧义。为此，我们提出一种无需训练、基于模型内在机制的定位方法，该方法以不确定性作为监督信号。具体而言，我们计算模型下一词元分布的信息熵，并通过反向传播将其传递至视觉词元嵌入，从而获得熵梯度相关性图，整个过程无需依赖外部检测器或注意力图启发式方法。随后，我们提取并排序多个连贯区域以支持多证据查询，并引入一种结合空间熵停止规则的迭代缩放与再定位流程，以避免过度细化。在涵盖四种视觉语言模型架构的七个基准测试上的实验表明，本方法相较于现有技术实现了持续改进，在细节敏感和高分辨率场景中提升尤为显著，同时能生成更具可解释性的证据定位结果。

摘要 (Abstract)

Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model’s next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

关键词: vision-language models, evidence retrieval, entropy-gradient, training-free grounding, visual token embeddings, multi-evidence queries, interpretable localization, spatial-entropy stopping

119. ❌ AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

作者: Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo, Leila Misula, Vivian Oloo, Rennish Mboya, Edwin Onkoba, Edward Ombui, Joseph Muguro, Ciira wa Maina, Andrew Kipkebut, Alfred Omondi Otom, Ian Ndung’u Kang’ethe, Angela Wambui Kanyi, Brian Gichana Omwenga 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于构建一个多语言语音数据集（AfriVoices-KE），包含约3000小时的肯尼亚语言音频数据，旨在解决非洲语言在语音技术中的代表性不足问题。论文的核心贡献是数据集的创建、收集方法和质量控制，涉及语音识别和文本转语音系统的开发基础。所有给定的关键词均与大模型、深度学习技术原理、AI科学应用等高度相关，但论文未涉及任何大模型、深度学习技术、AI算法创新或具体科学应用，仅提及数据集可用于开发语音系统，未讨论任何模型训练、优化、推理、对齐、代理、工具使用、量化、解释性等具体技术。因此，所有关键词均得0分，完全无关。

!!! tip deepseek-chat TL;DR

该论文创建了AfriVoices-KE，一个包含约3000小时肯尼亚语言音频的大规模多语言语音数据集，以解决非洲语言在语音技术中的代表性不足问题，为开发包容性自动语音识别和文本转语音系统提供基础资源。

摘要翻译

AfriVoices-KE是一个大规模多语言语音数据集，包含约3,000小时的音频，涵盖五种肯尼亚语言：Dholuo（卢奥语）、Kikuyu（基库尤语）、Kalenjin（卡伦津语）、Maasai（马赛语）和Somali（索马里语）。该数据集包含750小时的脚本化语音和2,250小时的自发性语音，采集自4,777名来自不同地区和人口背景的母语者。本研究通过提供一个高质量、语言多样性的资源，旨在解决非洲语言在语音技术领域代表性严重不足的问题。数据收集采用双重方法：脚本化录音取材于已编译的文本语料库、翻译文本，以及涵盖与肯尼亚语境相关的十一个领域的特定领域生成句子；而非脚本语音则通过文本和图像提示引导采集，以捕捉自然的语言变体和方言细微差别。项目通过定制化的移动应用程序，使贡献者能够使用智能手机进行录音。质量保证在多个层面实施，包括录音前的自动信噪比验证以及人工审核内容准确性。尽管项目在低资源环境中遇到了常见挑战，包括基础设施不可靠、设备兼容性问题以及社区信任障碍，但通过本地动员者、利益相关方合作伙伴关系和适应性培训方案，这些挑战得以缓解。AfriVoices-KE为开发包容性的自动语音识别和文本转语音系统提供了基础资源，同时推动了肯尼亚语言遗产的数字化保护。

摘要 (Abstract)

AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya’s linguistic heritage.

关键词: multilingual speech dataset, Kenyan languages, automatic speech recognition, text-to-speech systems, low-resource languages, data collection methodology, quality assurance, linguistic diversity

120. ❌ SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization

作者: Seyed Mahmoud Sajjadi Mohammadabadi, Xiaolong Ma, Lei Yang, Feng Yan, Junshan Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SOLAR专注于参数高效微调(PEFT)方法的通信和存储成本压缩，这是PEFT/LoRA领域的直接技术创新。论文明确提到PEFT、LoRA、AdaLoRA等关键词，因此’PEFT OR LoRA OR Parameter-efficient Fine-tuning’获得最高分15分。论文在LLaMA、GPT等基础模型上实验，与’Large Language Models OR LLMs OR Foundation Models’高度相关(10分)。论文涉及模型压缩技术，与’Quantization OR Model Compression OR Low-bit Weights’有一定关联(8分)。论文提到边缘设备部署，与’Small Language Models OR SLMs OR On-device AI’有间接关联(5分)。其他关键词如MoE、Scaling Laws、RAG、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文SOLAR提出了一种基于子空间导向潜在适配器重参数化的后训练压缩框架，显著降低了参数高效微调(PEFT)适配器的通信和存储成本，同时保持了任务性能。

摘要翻译

参数高效微调（Parameter-efficient fine-tuning, PEFT）方法，例如LoRA，通过注入低秩适配器实现了基础模型的可扩展适配。然而，在资源受限的环境中，其通信与存储成本仍是主要瓶颈。我们提出了SOLAR（面向子空间的潜在适配器重参数化），一种训练后压缩框架，可显著降低PEFT适配器的通信成本（即需要传输或存储的参数数量）。SOLAR将每个PEFT更新表示为基础向量（由基础模型的奇异向量经受控随机扰动形成）的线性组合。通过利用基础模型与任务特定微调更新之间的子空间相似性（即主方向的对齐），SOLAR将适配器大小与PEFT结构解耦，并确保紧凑且富有表现力的表示。该方法是模型无关的，且兼容现有的PEFT方法，包括LoRA、AdaLoRA及其他适配器模块。我们从理论上建立了重构误差的上界。在语言和视觉任务上使用LLaMA、GPT和ViT模型进行的实验表明，SOLAR在保持任务性能的同时，显著减少了模型表示的大小，为分布式系统和边缘设备中的部署提供了一种高效且通信成本低的解决方案。

摘要 (Abstract)

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses each PEFT update as a linear combination of basis vectors formed from the foundation model’s singular vectors with controlled random perturbations. By exploiting the subspace similarity (the alignment of principal directions) between the foundation model and task-specific fine-tuned updates, SOLAR decouples the adapter size from PEFT structure and ensures compact yet expressive representations. It is model-agnostic and compatible with existing PEFT methods, including LoRA, AdaLoRA, and other adapter modules. We theoretically establish a bound on the reconstruction error. Experiments on language and vision tasks using LLaMA, GPT, and ViT models demonstrate that SOLAR preserves task performance while significantly reducing model representation sizes, offering an effective and communication-efficient solution for deployment in distributed systems and edge devices.

关键词: Parameter-efficient fine-tuning, PEFT, LoRA, communication-efficient, model compression, adapter reparameterization, foundation models, edge deployment

121. ❌ Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

作者: Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08362v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为通用用户模拟器的能力，并引入OmniBehavior基准进行评估，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文提到’performance plateauing even as context windows expand’，与’Context Window Extension’有一定关联（5分）。其他关键词如MoE、SFT、RAG、CoT等未在摘要中提及或与论文主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在模拟真实世界人类行为方面的局限性，通过引入基于真实数据的OmniBehavior基准，发现当前LLMs在模拟长时域、跨场景、异构行为时存在性能瓶颈，并表现出结构性偏见，导致个体差异和长尾行为丢失。

摘要翻译

大型语言模型（LLM）的出现揭示了一种通用用户模拟器的潜力。然而，现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据，未能捕捉真实人类行为的整体性。为弥补这一差距，我们提出了OmniBehavior——首个完全基于真实世界数据构建的用户模拟基准，它将长时程、跨场景和异构行为模式整合到一个统一框架中。基于此基准，我们首先通过实证证明：以往采用孤立场景的数据集存在视野局限，而真实世界的决策依赖于长期、跨场景的因果链。对前沿大型语言模型的广泛评估表明，当前模型难以准确模拟这些复杂行为，即使上下文窗口扩展，其性能仍趋于停滞。关键的是，通过对模拟行为与真实行为的系统比较，我们发现了一种根本性的结构偏差：大型语言模型倾向于收敛于一个“平均化的积极人格”，表现出过度活跃、角色同质化和乌托邦偏见。这导致个体差异与长尾行为的丢失，为未来高保真模拟研究指明了关键方向。

摘要 (Abstract)

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

关键词: Large Language Models, user simulation, benchmark, real-world data, long-horizon behavior, cross-scenario, heterogeneous behavior, structural bias

122. ❌ When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

作者: Ruotao Xu, Yixin Ji, Yu Luo, Jinpeng Li, Dong Li, Peifeng Li, Juntao Li, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08281v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究工具集成推理（TIR）中模型对工具结果的信任校准问题，核心涉及LLM推理、工具使用、智能体工作流、多步推理、系统2思维和自我校正等概念。论文提出ATTC框架解决模型在推理与工具结果冲突时的信任决策问题，与推理方法、智能体行为和工具使用高度相关，与幻觉缓解和可解释性有一定关联，与其他关键词无关。

!!! tip deepseek-chat TL;DR

该论文针对工具集成推理中模型倾向于相信自身推理而忽略正确工具结果的问题，提出了自适应工具信任校准框架ATTC，实验证明该框架能有效减少工具忽略现象，在多个数据集上提升模型性能4.1%至7.5%。

摘要翻译

大型推理模型（LRMs）通过扩展测试时计算实现了显著的性能提升，但由于底层语言模型的固有局限性，它们在需要精确计算和广泛知识储备的任务中仍存在不足。工具集成推理（Tool-Integrated Reasoning, TIR）作为一种新兴范式，将工具调用与执行融入推理轨迹中，展现出巨大潜力。尽管近期研究已发布了一些强大的开源TIR模型，我们的分析表明这些模型仍存在关键缺陷。我们发现，当模型自身的推理与工具结果冲突时，模型倾向于相信自身推理；且存在工具结果正确却被模型忽略的情况，导致错误答案，我们将此定义为“工具忽略”现象。这表明模型尚未掌握何时应信任或忽略工具。为克服这些局限，我们提出了自适应工具信任校准（Adaptive Tool Trust Calibration, ATTC）框架，该框架通过引导模型基于生成代码块的置信度分数，自适应地选择信任或忽略工具结果。在不同规模的开源TIR模型及多个数据集上的实验结果表明，ATTC能有效缓解“工具忽略”问题，使模型性能提升4.1%至7.5%。

摘要 (Abstract)

Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as “Tool Ignored’’. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the “Tool Ignored” issue, resulting in a performance increase of 4.1% to 7.5%.

关键词: Tool-Integrated Reasoning, Large Reasoning Models, Tool Trust Calibration, Adaptive Framework, Math Reasoning, Code Confidence Score, Tool Ignored Problem, Performance Enhancement

123. ❌ Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions

作者: Prisca Piccirilli, Alexander Fraser, Sabine Schulte im Walde 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究隐喻与字面语言在动词-宾语结构中的对比分析，属于计算语言学和语料库语言学范畴。论文使用传统NLP工具（如情感分析、句法分析等）进行特征提取和统计分析，未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均聚焦于大模型技术、训练方法、推理优化、应用框架等现代AI技术，而本文是传统的语言现象实证研究，与这些技术主题完全无关。

!!! tip deepseek-chat TL;DR

该论文通过大规模语料分析比较了英语中297个动词-宾语结构的隐喻和字面用法，发现两者差异主要取决于具体结构而非单一分布模式，字面语境具有更高词汇频率和结构规律性，而隐喻语境则表现出更强的情感负载和词汇多样性。

摘要翻译

隐喻在日常语言中无处不在，使说话者能够通过具体领域表达抽象概念。尽管先前研究已从认知和心理语言学角度探讨隐喻，但与字面语言的大规模对比研究仍显不足，尤其针对近义表达形式。本研究分析了约200万语料库句子中的297组英语动宾搭配（例如：float idea与suggest idea），考察其语境使用特征。通过五种自然语言处理工具，我们提取了2,293项涵盖情感、词汇、句法和语篇层面的认知与语言特征。我们主要探究：（1）隐喻与字面语境的特征差异（跨配对分析）；（2）单个动宾搭配内部是否存在分化（配对内分析）。跨配对分析结果显示：字面语境具有更高的词汇频率、语篇衔接度和结构规整性，而隐喻语境则表现出更强的情感负荷、意象性、词汇多样性及构式特异性。配对内分析揭示出显著的异质性，多数搭配呈现非均匀效应。这些结果表明，并不存在单一且稳定的分布模式来区分隐喻与字面用法，差异主要取决于具体构式特征。总体而言，大规模数据与多维特征相结合，为理解动宾搭配中隐喻与字面用法的对比提供了精细化的分析视角。

摘要 (Abstract)

Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.

关键词: metaphor, literal language, verb-object constructions, corpus analysis, NLP features, cognitive linguistics, distributional patterns, construction-specific differences

124. ❌ Self-Debias: Self-correcting for Debiasing Large Language Models

作者: Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中的偏见传播问题，提出Self-Debias框架实现自我纠正。高度相关的关键词包括：LLMs（核心研究对象）、Chain of Thought（偏见传播的载体）、Self-Correction（核心方法）。中等相关的关键词包括：Instruction Tuning/Alignment（涉及价值观对齐）、RLHF/DPO（与偏好优化方法相关）、Hallucination Mitigation（解决输出质量问题）。其他关键词如MoE、SLMs、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在思维链推理过程中存在的偏见传播问题，提出了Self-Debias框架，通过资源重新分配和轨迹级优化实现自我纠正，在仅需2万标注样本的情况下有效减少偏见同时保持推理能力。

摘要翻译

尽管大型语言模型（LLMs）展现出卓越的推理能力，但其固有的社会偏见常常在思维链（Chain-of-Thought, CoT）过程中持续传递，导致持续的“偏见传播”。现有的去偏见方法主要侧重于静态约束或外部干预，一旦偏见被触发，便无法识别并中断这一传播过程。为应对这一局限，我们提出了Self-Debias框架，这是一种渐进式框架，旨在赋予模型内在的自我修正能力。具体而言，我们将去偏见过程重新构建为一个策略性资源再分配问题，将模型的输出概率质量视为有限资源，需从有偏见的启发式路径重新分配至无偏见的推理路径。与采用广泛惩罚的标准偏好优化不同，Self-Debias采用细粒度的轨迹级目标，并受动态去偏见约束。这使得模型能够选择性地修正有偏见的推理后缀，同时保留有效的上下文前缀。此外，我们整合了一种在线自我改进机制，利用一致性过滤来自主合成监督信号。仅使用2万条标注样本，Self-Debias即可激活高效的自我修正，在无需持续外部监督的情况下，实现优越的去偏见性能，同时保持通用的推理能力。

摘要 (Abstract)

Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous “Bias Propagation”. Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model’s output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

关键词: Large Language Models, Debiasing, Self-correction, Chain-of-Thought, Bias Propagation, Preference Optimization, Reasoning Capabilities, Trajectory-level Objective

125. ❌ Clickbait detection: quick inference with maximum impact

作者: Soveatin Kuntur, Panggih Kusuma Ningrum, Anna Wróblewska, Maria Ganzha, Marcin Paprzycki 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08148v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究点击诱饵检测，使用OpenAI语义嵌入和启发式特征，结合PCA降维和XGBoost、GraphSAGE、GCN分类器。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级混合方法用于点击诱饵检测，结合OpenAI语义嵌入和启发式特征，通过PCA降维和多种分类器实现高效推理，在保持竞争力的性能下显著减少推理时间。

摘要翻译

本文提出一种轻量级混合方法用于点击诱饵检测，该方法将OpenAI语义嵌入与六项紧凑的启发式特征相结合，以捕捉文本的风格和信息线索。为提升效率，语义嵌入通过主成分分析（PCA）进行降维，并采用XGBoost、GraphSAGE和图卷积网络（GCN）分类器进行评估。尽管简化特征设计使F1分数略有下降，但基于图结构的模型在显著降低推理时间的同时仍实现了具有竞争力的性能。较高的受试者工作特征曲线下面积（ROC-AUC）值进一步表明模型具备强大的判别能力，可在不同决策阈值下对点击诱饵标题实现可靠检测。

摘要 (Abstract)

We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC–AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.

关键词: clickbait detection, OpenAI semantic embeddings, heuristic features, PCA, XGBoost, GraphSAGE, GCN, inference time reduction

126. ❌ Training Data Size Sensitivity in Unsupervised Rhyme Recognition

作者: Petr Plecháč, Artjoms Šeļa, Silvie Cinková, Mirella De Sisto, Lara Nugues, Neža Kočnik, Antonina Martynenko, Ben Nagy, Luca Giovannini, Robert Kolár 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究无监督押韵识别中训练数据量的敏感性，并比较了RhymeTagger工具与大型语言模型（LLMs）的性能。与关键词的相关性如下：1）‘Large Language Models’得5分，因为论文在实验中使用了LLMs进行one-shot学习比较；2）‘Scaling Laws AND Data Quality’得5分，因为论文核心研究了训练数据量（scaling）对性能的影响；3）‘In-context Learning’得5分，因为论文使用了one-shot学习策略（一种in-context learning）；4）‘AI for Science’得5分，因为论文涉及计算语言学/数字人文领域的AI应用；其他关键词（如MoE、SFT、RAG等）与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了无监督押韵识别工具RhymeTagger在不同语言中所需训练数据量的敏感性，发现提供足够数据后其性能优于人类标注一致性，而缺乏语音表征的大型语言模型在该任务上表现不佳。

摘要翻译

押韵具有一种看似直观的欺骗性：何为押韵、何不为押韵是历史构建的产物，学者们在押韵分类上存在困难，人们对两个词是否押韵也意见不一。这使自动化的押韵识别与评估变得复杂，尤其是在多语言语境中。本文研究了使用RhymeTagger（一种基于诗歌语料库中重复模式来识别押韵的、与语言无关的工具）进行可靠的无监督押韵识别需要多少训练数据。我们在七种语言（捷克语、德语、英语、法语、意大利语、俄语和斯洛文尼亚语）中评估其性能，考察训练数据规模和语言差异如何影响准确性。为设定一个现实的性能基准，我们评估了人工标注诗歌子集上标注者间的一致性，并分析了导致专家标注分歧的因素：押韵词之间的语音相似性及其在诗中的距离。我们还通过一次性学习策略，将RhymeTagger与三个大型语言模型进行了比较。我们的研究结果表明，一旦提供足够的训练数据，RhymeTagger的表现持续优于人类标注者间的一致性，而缺乏语音表征的大型语言模型在此任务上则表现显著不佳。

摘要 (Abstract)

Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.

关键词: unsupervised rhyme recognition, training data size, RhymeTagger, large language models, one-shot learning, multilingual evaluation, phonetic similarity, inter-annotator agreement

127. ❌ Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

作者: Soveatin Kuntur, Maciej Krzywda, Anna Wróblewska, Marcin Paprzycki, Maria Ganzha, Szymon Łukasik, Amir H. Gandomi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究图神经网络在虚假信息检测中的应用，与大多数大模型技术关键词无关。仅与’Large Language Models OR LLMs OR Foundation Models’有微弱关联（摘要提到LLMs作为背景但非研究重点），其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文通过系统基准测试发现，经典的图神经网络在虚假信息检测任务中比非图机器学习方法更有效且高效，挑战了需要复杂架构的假设。

摘要翻译

在线虚假信息的快速传播催生了日益复杂的检测模型，包括大语言模型和混合架构。然而，其计算成本和部署限制引发了对其实际应用性的担忧。本研究在受控且可比较的条件下，将图神经网络（Graph Neural Networks, GNNs）与非图机器学习方法进行了基准测试。我们在英语、印度尼西亚语和波兰语的七个公共数据集上，评估了轻量级GNN架构（图卷积网络GCN、GraphSAGE、图注意力网络GAT、切比雪夫网络ChebNet）与逻辑回归、支持向量机和多层感知机的性能。所有模型均使用相同的TF-IDF特征，以隔离关系结构的影响。性能采用F1分数衡量，并报告推理时间以评估效率。在所有数据集上，GNNs均一致优于非图基线模型。例如，GraphSAGE在Kaggle数据集上达到96.8%的F1分数，在WELFake数据集上达到91.9%，而多层感知机（MLP）的分数分别为73.2%和66.8%。在COVID-19数据集上，GraphSAGE的F1分数达到90.5%，对比多层感知机的74.9%；在FakeNewsNet数据集上，ChebNet达到79.1%，对比多层感知机的66.4%。这些性能提升是在推理时间相当甚至更短的情况下实现的。总体而言，结果表明经典的图神经网络在虚假信息检测中依然高效且有效，这对该领域是否需要日益复杂的架构提出了挑战。

摘要 (Abstract)

The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.

关键词: Graph Neural Networks, Misinformation Detection, Performance-Efficiency Trade-offs, Benchmarking, GNN Architectures, F1 Score, Inference Time, Public Datasets

128. ❌ LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

作者: Tian Huang, Tom Bourgeade, Irina Illina 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在医学教育（AI for Science/Bioinformatics子领域）中的应用，具体为生成和评估法语OSCE对话，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文提到使用中等规模模型（≤32B参数）实现本地部署，与’Small Language Models’有一定关联（5分）。论文未涉及其他关键词的技术原理或创新，如MoE、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了法语OSCE（客观结构化临床考试）中标注数据稀缺的问题，通过开发LLM驱动的管道生成合成医患对话并自动评估，结果表明中等规模LLMs能达到与GPT-4o相当的评估准确率（约90%），支持本地可部署的隐私保护医疗教育系统。

摘要翻译

客观结构化临床考试（OSCE）是通过结构化患者访谈评估医学生临床与沟通技能的标准方法。然而在法国，培训课程的组织受到人力和后勤条件的限制，导致学生难以获得重复练习和结构化反馈的机会。自然语言处理（NLP）与大型语言模型（LLM）的最新进展为自动评估此类医学访谈提供了可能，从而减少培训中对人工考官的需求。但目前真实法语OSCE标注转录本仍极度稀缺，限制了可重复研究及可靠基准测试的开展。为应对这些挑战，本研究探索在低资源语境下利用LLM进行法语OSCE对话的生成与评估。我们提出一种受控流程，可根据特定场景评估标准生成模拟医患访谈转录本，通过结合理想表现与扰动表现来模拟不同学生技能水平。生成的对话通过支持可调节严格度的LLM辅助框架进行自动银标注。对多个开源与专有LLM的基准测试表明，中型模型（参数≤320亿）在合成数据上达到与GPT-4o相当的准确率（约90%），这凸显了本地可部署、保护隐私的医学教育评估系统的可行性。

摘要 (Abstract)

Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students’ clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students’ access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

关键词: Large Language Models, OSCE, medical education, data generation, clinical skills evaluation, low-resource, French language, privacy-preserving

129. ❌ Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

作者: Ian W. Kennedy, Nafise Sadat Moosavi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的极端量化（2-3位）技术，特别是加性量化方法，旨在实现边缘部署。核心创新在于提出OA-EM初始化方法，解决量化中代码本初始化的瓶颈问题。因此，与’Large Language Models’高度相关（10分），与’Quantization’核心相关（15分），与’Small Language Models/On-device AI’相关（8分，因涉及边缘部署）。其他关键词如MoE、Scaling Laws、Fine-tuning等未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现极端LLM量化（如2位）失败的主要瓶颈是代码本初始化，并提出OA-EM初始化方法，显著提升量化模型性能，在多个架构和压缩率下主导质量-计算前沿。

摘要翻译

加法量化技术能够通过O(1)查表反量化实现极端的大型语言模型压缩，这使其在边缘部署中极具吸引力。然而在2比特精度下，即使经过大量搜索和微调，该方法仍常出现灾难性失效。我们发现其主要瓶颈在于码本初始化：贪婪序列初始化常使模型陷入不良优化区域，后续的束搜索和后训练向量微调（PV-tuning）难以克服此问题。我们通过表征比率ρ = N/KM（该指标刻画了权重组与码本容量之间的关系）分析这一现象，并提出OA-EM——一种基于海森加权马氏距离的输出感知EM初始化方法。在不同压缩率、搜索预算及三种架构（Llama 3.2 3B、Llama 3.1 8B、Qwen 2.5 3B）的测试中，OA-EM经PV-tuning后始终能产生更优解，并在质量-计算效率边界上占据主导地位。该瓶颈的严重程度随ρ值变化：在3比特每参数（bpp）时表现中等，而在2 bpp时极为突出——不良初始化可使困惑度恶化数个数量级。更广泛而言，我们的研究结果揭示了压缩模型空间中优化几何的重要性，其中初始化可能主导后续搜索与微调的效果。

摘要 (Abstract)

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

关键词: LLM quantization, additive quantization, codebook initialization, edge deployment, OA-EM, 2-bit precision, model compression, optimization geometry

130. ❌ Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

作者: Khalid Zaman, Melike Sah, Anuwat Chaiwongyenc, Cem Direkoglu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究基于量子视觉理论的音频分类方法用于深度伪造语音检测，属于深度学习在音频处理领域的应用。论文内容与绝大多数关键词（主要涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于AI在科学/音频分析领域的应用，但并非核心的生物信息学或化学信息学，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文提出将量子视觉理论应用于音频分类，通过将语音谱图转换为信息波并训练QV-CNN和QV-ViT模型，在ASVSpoof数据集上实现了比标准模型更高的深度伪造语音检测准确率和鲁棒性。

摘要翻译

我们提出量子视觉理论作为基于深度学习的音频分类新视角，并将其应用于深度伪造语音检测。受量子物理学中波粒二象性启发，量子视觉理论基于以下理念：数据不仅可表现为可观测的坍缩形态（如图像），还能以信息波的形式存在。在传统深度学习中，模型直接基于此类坍缩表征进行训练；而在量子视觉理论框架下，输入数据首先通过量子视觉模块转换为信息波，再输入深度学习模型进行分类。相较于非量子视觉模型，基于量子视觉理论的模型在图像分类任务中已展现出性能提升。若将量子视觉理论应用于语音频谱图进行音频分类，会产生何种效果？这正是本研究的创新动机所在。本文工作中，我们通过提出的量子视觉模块将语音信号的短时傅里叶变换、梅尔频谱图及梅尔频率倒谱系数转换为信息波，并用于训练基于量子视觉的卷积神经网络和基于量子视觉的视觉变换器模型。我们在ASVSpoof数据集上进行了大量深度伪造语音分类实验。结果表明，量子视觉卷积神经网络和量子视觉视觉变换器模型均持续优于标准卷积神经网络与视觉变换器模型，在区分真实语音与伪造语音时实现了更高的分类准确率和更强的鲁棒性。其中，采用梅尔频率倒谱系数特征的量子视觉卷积神经网络在ASVspoof数据集上取得最佳综合性能（准确率94.20%，等错误率9.04%），而采用梅尔频谱图的量子视觉卷积神经网络则达到最高准确率94.57%。这些发现证明量子视觉理论是音频深度伪造检测的有效且具有前景的方法，为音频感知任务中量子启发式学习开辟了新方向。

摘要 (Abstract)

We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.

关键词: Quantum Vision theory, audio classification, deepfake speech detection, information waves, QV-CNN, QV-ViT, ASVSpoof dataset, Mel-spectrograms

131. ❌ Efficient Provably Secure Linguistic Steganography via Range Coding

作者: Ruiyi Yan, Yugo Murawaki 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08052v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言模型隐写术，使用GPT-2等语言模型进行实验，因此与’Large Language Models’有一定关联（5分）。但论文核心是隐写术方法（range coding、rotation mechanism），并非大模型技术原理创新或科学领域应用，与其他关键词（如MoE、Scaling Laws、RLHF、RAG、AI for Science等）无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于范围编码和旋转机制的高效可证明安全语言隐写方法，在多种语言模型上实现了接近100%的熵利用率和高速嵌入性能。

摘要翻译

语言隐写术旨在将秘密信息嵌入看似无害的文本中，以实现隐蔽通信。可证明安全性作为该领域长期追求的目标与核心驱动力，已扩展至基于语言模型的隐写方法。先前具备可证明安全性的方法通过实现零KL散度（Kullback-Leibler divergence）达到了完美的不可感知性，但这是以牺牲嵌入容量为代价的。本文尝试直接采用经典熵编码方法（区间编码，range coding）实现安全隐写，进而提出一种结合轮转机制的高效可证明安全语言隐写方法。在不同语言模型上的实验表明，本方法在嵌入容量上实现了约100%的熵利用率（嵌入效率），优于现有基线方法。同时，该方法实现了高速嵌入（在GPT-2上最高可达1554.66比特/秒）。代码发布于github.com/ryehr/RRC_steganography。

摘要 (Abstract)

Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at github.com/ryehr/RRC_steganography.

关键词: linguistic steganography, provable security, range coding, entropy utilization, embedding capacity, language models, rotation mechanism, GPT-2

132. ❌ Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM推理服务优化，核心解决KV缓存过度分配和并发利用率低的问题。与"Large Language Models"高度相关（10分），因为论文明确研究LLM服务。与"Context Window Extension OR Long Context LLMs"高度相关（10分），因为论文处理长上下文请求的配置优化。与"KV Cache Compression OR Linear Attention OR FlashAttention"高度相关（10分），因为论文直接解决KV缓存效率问题。与"Speculative Decoding OR Inference Acceleration"高度相关（10分），因为论文通过路由机制提高吞吐量和降低延迟，属于推理加速范畴。其他关键词如MoE、SLMs、训练方法、对齐、RAG、推理技术、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种双池令牌预算路由机制，通过将LLM推理服务集群划分为短上下文高吞吐量池和长上下文高容量池，解决了KV缓存过度分配和并发利用率低的问题，在真实数据集上实现了31-42%的GPU小时节省和6%的P99 TTFT改进。

摘要翻译

生产环境中的vLLM推理集群通常按最差情况下的上下文长度配置每个实例，导致KV缓存过度分配与并发利用率不足。在实际场景中，80-95%的请求属于短上下文请求，却始终在针对长上下文优化的配置下运行，造成4-8倍的吞吐量能力浪费，并引发内存溢出（OOM）崩溃、任务抢占和请求拒绝等可靠性问题。我们发现这些低效现象的共同根源在于：配置与流量特征不匹配。为此，我们提出双池化令牌预算路由机制——一种轻量级调度方法，将同构集群划分为两个专用资源池：高吞吐量的短上下文池与高容量的长上下文池。每个请求根据其预估的总令牌预算进行路由，该预算通过基于类别的字节-令牌转换率计算得出。该转换率采用指数移动平均法从实际使用的prompt_tokens反馈中在线学习，无需依赖分词器。我们还构建了一个简洁的分析模型，可根据工作负载特征与实测吞吐量差异预测集群级成本节约，帮助实践者在部署前评估收益。基于Azure LLM推理数据集和LMSYS-Chat-1M真实流量轨迹的评估显示（使用A100 GPU部署Llama-3-70B），该方法可减少31-42%的GPU工时，相当于集群规模下每年节约286万美元，同时将任务抢占率降低5.4倍，并将P99首令牌延迟（TTFT）改善6%。在AMD MI300X平台上部署Qwen3-235B-A22B的案例研究表明，在每秒10,000请求的负载下，预计可实现每年1540万美元的成本节约。该方法仅产生O(1)的调度开销，能自适应异构工作负载，并可无缝兼容现有优化技术（如PagedAttention、连续批处理及预填充-解码分离架构）。

摘要 (Abstract)

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to $2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects $15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

关键词: LLM serving, KV-cache, token-budget routing, throughput optimization, context length, inference efficiency, cost reduction, dual-pool architecture

133. ❌ Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

作者: Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG框架GuarantRAG，直接高度相关关键词包括’Retrieval-Augmented Generation’（核心方法）、‘Large Language Models’（应用对象）、‘RLHF/DPO’（使用Contrastive DPO训练）、‘Hallucination Mitigation’（主要目标之一）。与’Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’有一定关联，因涉及推理流程分离和内部知识抑制。其他关键词如MoE、SLMs、量化等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对RAG中LLMs无法有效利用检索文档的'集成瓶颈'问题，提出GuarantRAG框架，通过分离推理与证据集成、使用Contrastive DPO训练和联合解码机制，在五个QA基准上实现最高12.1%的准确率提升和16.3%的幻觉减少。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过接入外部知识显著增强了大语言模型（Large Language Models，LLMs）的能力。然而，当前研究主要聚焦于检索质量，往往忽视了关键的“整合瓶颈”：即使检索到相关文档，由于外部信息与模型内部参数化知识之间存在冲突，LLMs仍常常无法有效利用这些文档。本文认为，在单次生成过程中隐式解决这一冲突并非最优方案。我们提出了GuarantRAG框架，该框架将推理过程与证据整合进行显式解耦。首先，我们仅基于参数化知识生成一个“内部答案”，以捕捉模型的推理路径。其次，为确保证据提取的忠实性，我们采用一种新颖的对比性直接偏好优化（Contrastive DPO）目标来生成“参考答案”。该目标将参数化的内部答案作为负面约束，将检索到的文档作为正面真实依据，迫使模型在此阶段抑制内部幻觉，优先采用外部证据。最后，我们并未采用简单的答案拼接或直接使用DPO训练模型，而是提出了一种联合解码机制，该机制在词元级别动态融合内部答案的逻辑连贯性与参考答案的事实准确性。在五个问答基准测试上的实验表明，与标准及动态RAG基线相比，GuarantRAG将准确率最高提升了12.1%，并将幻觉率降低了16.3%。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ‘‘integration bottleneck’’: even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ‘‘Inner-Answer’’ based solely on parametric knowledge to capture the model’s reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ‘‘Refer-Answer’’ using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.

关键词: Retrieval-Augmented Generation, Large Language Models, Integration Bottleneck, Contrastive DPO, Joint Decoding, Hallucination Reduction, Knowledge Integration, QA Benchmarks

134. ❌ Rag Performance Prediction for Question Answering

作者: Or Dado, David Carmel. Oren Kurland 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG在问答任务中的性能预测，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文涉及大模型在RAG中的应用，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词如MoE、量化、对齐等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何预测在问答任务中使用RAG相对于不使用RAG的性能增益，并提出了一种新的监督预测器，通过显式建模问题、检索段落和生成答案之间的语义关系，实现了最佳的预测质量。

摘要翻译

本研究致力于预测在问答任务中使用检索增强生成（RAG, retrieval augmented generation）相较于不使用该方法所能带来的性能增益。我们评估了若干最初为特定检索任务设计的检索前与检索后预测指标，同时研究了多种生成后预测指标；其中一种为本研究新提出的方法，其预测质量最佳。结果表明，最有效的预测方法是一种新型监督式预测器，它能够显式地对问题、检索到的文本段落以及生成答案之间的语义关系进行建模。

摘要 (Abstract)

We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.

关键词: RAG, retrieval augmented generation, question answering, performance prediction, supervised predictor, semantic relationships, retrieved passages, generated answer

135. ❌ Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

作者: Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei, Jie Gao, Jie Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在自动语音识别（ASR）中的应用，属于大模型在不同领域的研究应用。高度相关的关键词包括：1）‘Large Language Models’（论文明确研究LLM-based ASR，权重10）；2）‘Post-training/SFT’（论文提出多阶段训练策略，包含SFT阶段，权重10）；3）‘Pre-training/Domain Adaptation’（论文重新设计预训练策略以缓解语音-文本模态差距，权重8）；4）‘Alignment’（论文提到alignment阶段，权重8）；5）‘Hallucination Mitigation’（论文明确解决幻觉问题，权重10）。其他关键词如MoE、SLMs、RAG、Agents等与论文内容无关，权重0。

!!! tip deepseek-chat TL;DR

该论文从熵分配角度重新审视基于大语言模型的自动语音识别系统，提出了一种基于能力边界感知的多阶段训练策略，在仅使用2.3B参数的情况下实现了与最先进模型竞争的性能，并通过解耦导向设计有效缓解了幻觉问题。

摘要翻译

将大语言模型（LLM）集成到自动语音识别（ASR）系统中已成为主流范式。尽管近期基于LLM的ASR模型在公开基准测试中展现出良好性能，但在平衡识别质量与延迟开销方面仍面临挑战，同时幻觉问题进一步限制了其实际部署。本研究从熵分配的角度重新审视基于LLM的ASR系统，并引入三项指标来量化不同训练范式如何在语音编码器与大语言模型之间分配熵减任务。为改善现有方法中熵分配的低效问题，我们提出一种基于能力边界认知的多阶段训练策略，以优化参数效率并增强抗幻觉鲁棒性。具体而言，我们重新设计预训练策略以缓解语音-文本模态差异，并进一步在对齐训练与联合监督微调（SFT）之间引入迭代异步SFT阶段，从而保持功能解耦并约束编码器表征漂移。在汉语和英语基准测试上的实验表明，我们的方法仅使用23亿参数即可达到与最先进模型相竞争的性能，同时通过解耦导向的设计有效缓解了幻觉现象。

摘要 (Abstract)

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.

关键词: LLM-based ASR, entropy allocation, multi-stage training, hallucination mitigation, speech-text modality gap, supervised fine-tuning, parameter efficiency, capability-boundary awareness

136. ❌ Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

作者: George Fountzoulas 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Kathleen的新型文本分类架构，该架构直接在原始UTF-8字节上操作，使用频域处理，无需分词器或注意力机制，仅包含733K参数。虽然该研究涉及深度学习在文本分类中的应用，但其核心创新点在于频率域处理、振荡器银行和相位谐波等特定技术，与提供的关键词列表（主要围绕大语言模型、微调技术、推理方法、代理系统等）无直接关联。论文未提及任何大语言模型、微调技术、代理、推理加速或科学AI应用等主题，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出了一种无需分词器和注意力机制的字节级文本分类架构Kathleen，通过频域处理和仅733K参数在多个数据集上实现了优于更大参数模型的性能。

摘要翻译

本文提出Kathleen架构——一种直接处理原始UTF-8字节的文本分类模型，其采用频域处理方法，无需分词器与注意力机制，仅包含73.3万参数。该架构引入三个创新组件：(1) 循环振荡器组：具备时间记忆的阻尼正弦卷积模块，支持O(L)级序列处理；(2) FFT旋转波表编码器：仅用单个可学习向量（256个浮点数）映射全部256种字节取值，在提升精度的同时替代了传统嵌入表（6.5万参数）；(3) 相位谐波：仅含6个可学习相位参数的正弦非线性单元，消融实验证明这是影响性能最关键的组件（精度提升2.6%，参数占比<0.001%）。通过对180万参数前代模型的系统消融研究，我们发现频域组件持续优于复杂的认知架构：移除56万参数的仿生框架仅导致精度下降0.2%，而移除6参数的相位谐波组件会使精度下降2.6%。最终版Kathleen-Clean在IMDB数据集上达到88.6%准确率，AG News达92.3%，SST-2达83.3%——其在IMDB（+1.6%）和AG News（+2.1%）上均超越参数量16倍于自身的分词模型。该架构以O(L)时间与内存复杂度处理序列，可在O(L^2)复杂度Transformer耗尽GPU内存的序列长度下实现字节级运算。

摘要 (Abstract)

We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing – requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks – damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics – a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, <0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 – outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.

关键词: text classification, byte-level processing, frequency-domain processing, oscillator-based architecture, parameter-efficient model, no tokenization, no attention mechanism, O(L) complexity

137. ❌ AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

作者: Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang, Yingcai Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究事实核查系统中对抗性声明的评估方法，与LLM技术高度相关（8分），因为论文明确使用LLM生成对抗性声明并分析其局限性。与’Factuality’高度相关（10分），因为核心研究问题就是检测事实腐败和评估真实性。与’Explainable AI’有一定关联（5分），因为提出的AtomEval框架通过原子分解提供了更透明的评估方法。其他关键词如MoE、SFT、RAG等与论文的技术内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对事实核查系统中对抗性声明评估标准不可靠的问题，提出了AtomEval评估框架，通过原子分解和有效性评分检测事实腐败，实验发现更强的LLM模型在有效性感知评估下不一定能产生更有效的对抗性声明。

摘要翻译

对抗性主张改写被广泛用于测试事实核查系统，但标准评估指标无法捕捉真值条件一致性，且常将语义受损的改写误判为成功。我们提出AtomEval——一个有效性感知的评估框架，该框架将主张分解为主语-关系-宾语-修饰语（Subject-Relation-Object-Modifier, SROM）原子单元，并通过原子有效性评分（Atomic Validity Scoring, AVS）对对抗性改写进行打分，从而能够检测超越表面相似性的事实性篡改。在FEVER数据集上对代表性攻击策略和大型语言模型（LLM）生成器进行的实验表明，AtomEval在我们的实验中提供了更可靠的评估信号。借助AtomEval，我们进一步分析了基于LLM的对抗性生成器，发现在有效性感知的评估下，更强的模型未必能产生更有效的对抗性主张，这揭示了当前对抗性评估实践中先前被忽视的局限性。

摘要 (Abstract)

Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.

关键词: fact verification, adversarial claims, evaluation framework, truth-conditional consistency, LLM generators, validity-aware evaluation, FEVER dataset, Atomic Validity Scoring

138. ❌ Rethinking Data Mixing from the Perspective of Large Language Models

作者: Yuanjian Xu, Tianze Sun, Changwei Xu, XinLong Zhao, Jianing Hao, Ran Chen, Yang Liu, Ruijie Xu, Stephen Chen, Guang Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM训练中的数据混合策略，与’Large Language Models’高度相关（10分），直接涉及LLM训练过程。与’Pre-training’相关（8分），因为数据混合是预训练的关键环节。与’Scaling Laws AND Data Quality’有一定关联（5分），论文探讨数据质量对泛化的影响，但未明确涉及扩展定律。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型训练中的数据混合策略，提出了DoGraph框架，通过图约束优化重新加权数据，在GPT-2模型上验证了其提升泛化性能的有效性。

摘要翻译

数据混合策略对于大规模语言模型（LLM）的训练至关重要。实证研究表明，不恰当的混合策略会显著降低模型的泛化能力。尽管近期方法已提升了实证性能，但若干基础问题仍未得到解答：如何定义领域、人类与模型对领域的认知是否一致，以及领域权重如何影响泛化性能。我们通过建立梯度动态与领域分布之间的形式化联系来回应这些问题，提出了一个阐明领域在训练动态中作用的理论框架。基于此分析，我们提出了DoGraph——一种将数据调度建模为图约束优化问题的重加权框架。在不同规模的GPT-2模型上进行的大量实验表明，DoGraph始终能取得具有竞争力的性能表现。

摘要 (Abstract)

Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.

关键词: Large Language Models, Data Mixing, Domain Weighting, Generalization, Training Dynamics, DoGraph, Graph-constrained Optimization, GPT-2

139. ❌ TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

作者: Yifei Gong, Xing Wu, Wenda Liu, Kang Tu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM作为工具使用代理在CAD生成中的应用，与’Large Language Models’、‘Post-training’、‘Chain of Thought’、‘LLM Agents’、‘Tool Use’高度相关（10分），其中CAD-CoT是核心创新。‘AI for Science’因CAD属于工程/设计领域而非严格科学应用，给5分。其余关键词如MoE、量化、RAG等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了ToolCAD框架，通过强化学习训练LLM作为工具使用代理，实现了文本到CAD的生成，填补了开源LLM在CAD工具使用代理领域的空白。

摘要翻译

计算机辅助设计（CAD）是一项依赖长程推理与连贯建模操作的专家级任务。大语言模型（LLMs）在赋能语言智能体处理现实世界任务方面已展现出显著进展。值得注意的是，目前尚未有研究探讨工具调用型LLMs如何与CAD引擎实现最优交互，这阻碍了基于LLM的智能文本到CAD建模系统的出现。我们提出ToolCAD，一种新颖的智能CAD框架，通过部署LLMs作为工具调用智能体来实现文本到CAD的生成。此外，我们引入了一个交互式CAD建模训练环境，用于展开与CAD引擎的推理及工具增强的交互轨迹，并融合了混合反馈与人工监督机制。同时，我们提出一种端到端的后训练策略，使LLM智能体能够生成精细化的CAD建模思维链（CAD-CoT），并通过在线课程强化学习进化为熟练的CAD工具调用智能体。我们的研究结果表明，ToolCAD填补了在CAD工具调用智能体中采用与训练开源LLMs的空白，使其性能可比肩专有模型，为构建更易获取且更稳健的自主文本到CAD建模系统开辟了道路。

摘要 (Abstract)

Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.

关键词: Tool-Using LLMs, Text-to-CAD Generation, Reinforcement Learning, CAD Modeling Chain of Thought, Agentic CAD Framework, Online Curriculum RL, CAD Tool-Using Agents, Autonomous Modeling Systems

140. ❌ HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

作者: Guoqi Ma, Liang Zhang, Hongyao Tu, Hao Fu, Hui Li, Yujie Lin, Longyue Wang, Weihua Luo, Jinsong Su 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在跨文档关系抽取中的应用，与’Large Language Models’高度相关（10分），并明确对比了SLM方法，与’Small Language Models’相关（8分）。论文未涉及其他关键词的具体技术（如MoE、Scaling Laws、微调方法等），也未涉及科学领域应用，因此其他关键词得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在跨文档关系抽取任务中的性能，发现直接应用LLM效果有限，因此提出了基于层次分类和预测-验证策略的HCRE方法，实验证明该方法优于现有基线。

摘要翻译

跨文档关系抽取旨在识别位于不同文档中的头实体与尾实体之间的关系。现有方法通常采用“小语言模型+分类器”的范式，但小语言模型有限的语言理解能力制约了其性能的进一步提升。本文开展了一项初步研究，探索大语言模型在跨文档关系抽取任务中的表现。尽管大语言模型参数量巨大，但我们的研究发现其性能并未稳定超越现有小语言模型。进一步分析表明，这种不足主要源于大量预定义关系带来的挑战。为克服此问题，我们提出了一种基于大语言模型的层次化分类模型（HCRE），该模型包含两个核心组件：1）用于关系预测的大语言模型；2）基于预定义关系集构建的层次化关系树。该关系树使大语言模型能够进行层次化分类，逐级推断目标关系。由于子节点数量远少于整个预定义关系集的规模，层次化关系树显著减少了大语言模型在推理时需要考虑的关系选项数量。然而，层次化分类可能引发层级间的错误传播风险。为缓解此问题，我们提出一种“预测-验证”推理策略，通过在每一层级进行多视角验证来提高预测可靠性。大量实验表明，HCRE模型优于现有基线方法，验证了其有效性。

摘要 (Abstract)

Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}’’. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.

关键词: Cross-document relation extraction, Large Language Models, Small Language Models, Hierarchical classification, Prediction-then-verification, Relation extraction, LLM-based method, Error propagation mitigation

141. ❌ Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

作者: Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu, Liting Zhang, Yuhang Jia, Yanzhe Zhang, Hualong Yu, Zichen Xu, Qicheng Li, Yong Qin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是LLM后训练领域的综述，核心讨论LLM后训练方法（SFT、偏好优化、强化学习等）的统一框架，因此与’Large Language Models’和’Post-training/SFT’高度相关（10分）。论文提到’preference-based methods’和’RL’，与’Instruction Tuning/Alignment’和’RLHF/RLAIF/DPO’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

这篇综述论文提出了一个统一框架来理解LLM后训练方法，将其组织为基于轨迹来源的离策略和策略学习，并通过支持扩展、策略重塑和行为整合三个角色来解释各种方法，以诊断后训练瓶颈并指导系统设计。

摘要翻译

后训练已成为将预训练大语言模型转化为对齐且可部署系统的核心环节。近期进展涵盖监督微调、偏好优化、强化学习、过程监督、验证器引导方法、蒸馏以及多阶段流程等多种技术。然而这些方法常以碎片化方式被讨论，通常按技术标签或目标函数类别进行分类，而非根据其解决的行为瓶颈进行组织。
本综述主张将大语言模型后训练理解为对模型行为的结构化干预。我们首先通过轨迹来源组织该领域，这定义了两类主要学习范式：基于外部提供轨迹的离线策略学习，以及基于学习者自生成轨迹的在线策略学习。进而，我们通过两种反复出现的角色来阐释各类方法——有效支持扩展（使有益行为更易达成）与策略重塑（改进已可达区域内的行为），并辅以系统层面的补充角色：行为整合（在不同阶段和模型转换间保存、迁移和分摊行为）。
这一视角为理解主要范式提供了统一框架。监督微调既可服务于支持扩展也可用于策略重塑，而基于偏好的方法通常属于离线策略重塑。在线策略强化学习常能改进学习者在自生成状态下的行为，但在更强引导下也能使难以触达的推理路径变得可达。蒸馏通常应被理解为行为整合而非单纯压缩，混合流程则显现为协调的多阶段组合。
总体而言，该框架有助于诊断后训练中的瓶颈问题并推演阶段组合策略，表明大语言模型后训练的进展日益依赖于协调的系统设计，而非任何单一的主导目标函数。

摘要 (Abstract)

Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles – effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions – together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.

关键词: Large Language Models, Post-training, Supervised Fine-tuning, Preference Optimization, Reinforcement Learning, Off-policy Learning, On-policy Learning, Behavioral Consolidation

142. ❌ SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

作者: Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen, Weili Guan, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07922v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAT框架专注于优化大型推理模型（LRMs）的推理过程，通过动态状态机（FSM）和过程奖励模型（PRM）实现步级自适应剪枝，核心与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’高度相关（10分），因直接解决推理链过长和深度推理效率问题。与’Large Language Models/LLMs/Foundation Models’相关（8分），因LRMs是大模型在推理任务上的应用变体。与’Speculative Decoding/Inference Acceleration’有一定关联（5分），因SAT旨在减少推理令牌数以提升效率，但非直接解码加速技术。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型存在的'过度思考'问题，提出了步级自适应思考（SAT）框架，通过动态状态机和轻量级过程奖励模型实现推理步骤的难度感知剪枝，在9个模型和7个基准测试中实现了高达40%的推理令牌减少，同时保持或提高了准确性。

摘要翻译

大型推理模型（LRMs）已彻底改变了复杂问题的解决方式，但它们普遍存在“过度思考”现象，即生成不必要的冗长推理链。尽管现有解决方案提升了令牌使用效率，却常常牺牲细粒度控制，或可能破坏推理过程的逻辑完整性。为解决这一问题，我们提出了逐步自适应思维（Stepwise Adaptive Thinking, SAT）框架，该框架在保持核心推理结构的同时，执行步骤级、难度感知的剪枝。SAT将推理建模为一个具有不同思维模式（慢速、常规、快速、跳过）的有限状态机（Finite-State Machine, FSM），并通过轻量级过程奖励模型（Process Reward Model, PRM）动态引导状态转换，从而压缩简单步骤的推理长度，同时保留困难步骤的思考深度。在9种LRM和7个基准测试上的实验表明，SAT在通常保持或提升准确率的同时，实现了高达40%的推理令牌缩减。

摘要 (Abstract)

Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive “overthinking”, generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framework that performs step-level, difficulty-aware pruning while preserving the core reasoning structure. SAT formulates reasoning as a Finite-State Machine (FSM) with distinct thinking modes (Slow, Normal, Fast, Skip). It navigates these states dynamically using a lightweight Process Reward Model (PRM), compressing easy steps while preserving depth for hard ones. Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.

关键词: Large Reasoning Models, Stepwise Adaptive Thinking, Finite-State Machine, Process Reward Model, reasoning efficiency, token reduction, difficulty-aware pruning, overthinking

143. ❌ Data Selection for Multi-turn Dialogue Instruction Tuning

作者: Bo Li, Shikun Zhang, Wei Ye 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令调优（Instruction Tuning）的数据选择方法，直接涉及’Instruction Tuning’和’Post-training/SFT’关键词（10分）。研究基于大语言模型（LLMs）的指令调优，因此’Large Language Models’高度相关（10分）。论文关注数据质量对模型性能的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、量化、推理加速、RAG等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多轮对话指令调优中数据噪声和结构不一致的问题，提出了一个对话级别的数据选择框架MDS，在多个基准测试中优于现有方法并提高了长对话的鲁棒性。

摘要翻译

指令微调语言模型日益依赖大规模多轮对话语料库，但这些数据集常存在噪声与结构不一致问题，表现为话题漂移、重复性闲聊以及跨轮次的答案格式失配。本研究从数据筛选角度出发，提出\textbf{MDS}（多轮对话选择框架），该框架以完整对话而非孤立单轮作为评分单元。MDS融合两个核心阶段：全局覆盖阶段在用户查询轨迹空间中进行分箱选择，以保留具有代表性且非冗余的对话；局部结构阶段则通过实体锚定的主题连贯性与信息推进度评估对话内部可靠性，并结合查询-答案形式一致性实现功能对齐。在三个多轮对话基准测试及领域内银行数据集上，MDS均优于强单轮选择器、对话级大语言模型评分器及启发式基线方法，在无参考与基于参考的评估指标中取得最佳综合排名，且在相同训练预算下对长对话表现出更强鲁棒性。代码与资源已附于补充材料。

摘要 (Abstract)

Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.

关键词: Instruction Tuning, Multi-turn Dialogue, Data Selection, Dialogue-level Framework, Entity-grounded Topic Grounding, Information Progress, Query-answer Consistency, Training Budget

144. ❌ Linear Representations of Hierarchical Concepts in Language Models

作者: Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, Kentaro Inui 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型内部表示中层次关系的编码机制，属于大模型技术原理的探索，与’Large Language Models’高度相关（10分），同时涉及模型可解释性分析，与’Mechanistic Interpretability’高度相关（10分）。其他关键词涉及具体技术方法（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或应用领域（如科学AI、智能体等），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究探索了语言模型如何在线性表示空间中编码概念层次关系，发现层次信息被编码在相对低维的领域特定子空间中，且不同领域的层次表示具有高度相似性。

摘要翻译

我们研究语言模型内部表征如何以及在何种程度上编码层级关系（例如日本⊂东亚⊂亚洲）。基于线性关系概念方法，我们针对不同层级深度和语义域训练特定的线性变换，并通过比较这些变换来刻画与层级关系相关的表征差异。相较于先前关于语言模型中层级关系表征几何结构的研究，我们的分析涵盖多词元实体和跨层表征。我们在多个领域训练此类变换，并评估其在领域内对未见数据的泛化能力及跨领域迁移效果。实验表明，在特定领域内，层级关系可从模型表征中线性恢复。我们进一步分析了层级信息在表征空间中的编码方式，发现其编码于相对低维的子空间中，且该子空间通常具有领域特异性。我们的主要结论是：尽管这些子空间具有领域特异性，但不同领域间的层级表征高度相似。总体而言，实验中的所有模型均以高度可解释的线性表征形式编码概念层级结构。

摘要 (Abstract)

We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.

关键词: language models, hierarchical relations, linear representations, internal representations, representational geometry, domain-specific subspaces, interpretable representations, concept hierarchies

145. ❌ Contextualising (Im)plausible Events Triggers Figurative Language

作者: Annerose Eichel, Tonmoy Rakshit, Sabine Schulte im Walde 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLMs在语言理解任务（特别是比喻语言和事件合理性判断）中的应用和评估，因此仅与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究通过分析人类和LLMs对英语主谓宾事件合理性的判断，发现LLMs在区分字面/非字面与不合理事件时存在浅层语境化模式，倾向于将不合理事件解释为非字面但合理的表达。

摘要翻译

本研究以英语主谓宾事件为例，探讨了（非）字面义与合理性之间的关联。我们设计了一套系统性的实验框架，结合抽象与具体的成分范畴，构建了合理与不合理的事件三元组。通过对人类与大型语言模型（LLM）生成的合理性判断及示例语境的分析，我们发现二者在合理性评估上存在显著差异。人类擅长细致区分（非）字面义事件与不合理事件，并能进行精准的语境化解读；而LLM的结果仅呈现出浅层的语境化模式，且倾向于将不合理事件置换为非字面义的合理性解释。

摘要 (Abstract)

This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.

关键词: figurative language, plausibility, LLM evaluation, contextualization, event triples, literalness, human-LLM comparison, language understanding

146. ❌ An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

作者: Gabriel Stefan, Adrian-Marius Dumitran 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文的核心是提出一个用于历史教科书偏见检测的’agentic evaluation architecture’，包含多个代理（如筛选代理、评审代理、元代理）进行协同评估。这与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分），因为论文明确构建了多代理系统来实现评估工作流。论文提到使用模型进行评估，暗示可能涉及大语言模型（LLMs），因此给’Large Language Models/LLMs/Foundation Models’ 8分。其他关键词如MoE、SFT、RAG、CoT等均未在摘要中提及或与论文主题无关，故给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于多代理系统的评估架构，用于检测历史教科书中的偏见，并通过在罗马尼亚教科书上的实验证明，该架构能有效减少误判，且成本低廉，可作为教育治理的决策支持工具。

摘要翻译

历史教科书常包含难以大规模审查的隐性偏见、民族主义叙事框架与选择性省略。我们提出一种智能体评估架构，包含多模态筛查智能体、由五个评估智能体组成的异质评审团，以及负责裁决合成与人工升级的元智能体。其核心创新在于引入来源归属协议，该协议能区分教科书叙述内容与引用的历史原始资料，从而避免因错误归因而导致单模型评估者产生系统性误判。
基于罗马尼亚高中历史教科书的实证研究显示：在270个经筛查的文本摘录中，83.3%被判定为教学可接受内容（平均严重度2.9/7），而零样本基线评估的严重度均值为5.4/7，这表明智能体审议机制能有效缓解过度惩罚问题。在盲态人工评估中（18位评估者，54组对比），独立审议配置在64.8%的案例中优于启发式变体与零样本基线。以每本教科书约2美元的成本计算，这些结果表明智能体评估架构可作为教育治理领域经济可行的决策支持工具。

摘要 (Abstract)

History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8% of cases over both a heuristic variant and the zero-shot baseline. At approximately $2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.

关键词: agentic evaluation architecture, historical bias detection, educational textbooks, multi-agent systems, source attribution protocol, Romanian history textbooks, decision-support tools, educational governance

147. ❌ MemReader: From Passive to Active Extraction for Long-Term Agent Memory

作者: Jingyi Kang, Chunyu Li, Ding Chen, Bo Tang, Feiyu Xiong, Zhiyu Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究智能体（agent）的长期记忆提取系统，与’LLM Agents’高度相关（10分）。涉及推理（Chain of Thought, System 2 Thinking, Self-Correction）和幻觉减少（Hallucination Mitigation），相关度较高（8分）。使用小型模型（MemReader-0.6B/4B）和强化学习优化（GRPO），与’Small Language Models’和’Post-training’有一定关联（5分）。涉及检索历史上下文（Retrieval-Augmented Generation）、长上下文处理（Long Context LLMs）、工具使用（Tool Use）和上下文学习（In-context Learning），相关度中等（5分）。其他关键词如MoE、量化、科学AI等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对智能体长期记忆提取中存在的噪声、缺失引用和跨轮依赖问题，提出了MemReader系列模型，通过被动提取和主动决策相结合的方法，实现了选择性、推理驱动的记忆写入，在多个基准测试中取得了最先进的性能，并成功部署到实际应用中。

摘要翻译

长期记忆是个性化与自主智能体的基础能力，但其内容填充仍是瓶颈。现有系统将记忆提取视为从上下文到结构化条目的单次被动转录，难以处理嘈杂对话、指代缺失及跨轮次依赖问题，导致记忆污染、低价值写入与不一致性。本文提出用于智能体系统主动式长期记忆提取的MemReader系列模型：MemReader-0.6B作为经蒸馏训练的紧凑高效被动提取器，能生成精确且模式一致的结构化输出；MemReader-4B则是通过分组相对策略优化（Group Relative Policy Optimization, GRPO）训练的主动提取器，具备记忆写入决策能力。在ReAct范式下，MemReader-4B在执行前会显式评估信息价值、指代歧义与完整性，并能选择性写入记忆、暂存不完整输入、检索历史上下文或过滤无关对话。在LOCOMO、LongMemEval和HaluMem数据集上的实验表明，MemReader系列持续超越现有基于提取的基线方法。特别地，MemReader-4B在知识更新、时序推理和幻觉消减任务中取得了最先进的性能。这些结果表明，有效的智能体记忆不仅需要提取更多信息，更需通过推理驱动与选择性记忆提取来构建低噪声、动态演化的长期记忆。此外，MemReader已集成至MemOS系统并投入实际应用。为支持未来研究与落地，我们公开模型并提供公共API访问接口。

摘要 (Abstract)

Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.

关键词: agent memory, long-term memory extraction, active extraction, Group Relative Policy Optimization, ReAct paradigm, hallucination reduction, temporal reasoning, MemReader

148. ❌ Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

作者: Michelle Damin Kim, Ellie S. Paek, Yufen Lin, Emily Mroz, Jane Chung, Jinho D. Choi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文的核心是应用LLMs（GPT-4o、GPT-5-nano、GPT-5）构建社交媒体数据集以测量和分析孤独感，属于大模型在社会科学/心理健康领域的应用研究。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是方法论的核心。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为研究涉及AI在社会科学（可视为广义科学）中的应用，但非传统硬科学领域。其他关键词（如MoE、Scaling Laws、RLHF等）涉及大模型技术原理或特定应用，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究利用大型语言模型（GPT系列）构建社交媒体数据集，测量并比较了照顾者与非照顾者的孤独感，发现两者在孤独原因分布上存在显著差异。

摘要翻译

本文提出一种基于大语言模型（LLM）的方法，用于构建多样化的社交媒体数据集，以测量和比较照顾者与非照顾者人群的孤独感。我们引入一个由专家开发的孤独感评估框架和一个专家指导的孤独成因分类体系，用于分析社交媒体文本。通过采用经过人工验证的数据处理流程，我们运用GPT-4o、GPT-5-nano和GPT-5构建了一个高质量的Reddit语料库，并分析了两类人群的孤独感。该孤独感评估框架在照顾者和非照顾者群体中分别达到了76.09%和79.78%的平均准确率。孤独成因分类框架在照顾者和非照顾者群体中分别取得了0.825和0.80的微观聚合F1分数。在不同人群中，我们观察到孤独成因类型的分布存在显著差异。照顾者的孤独感主要与照顾角色、身份认同和被遗弃感相关，这表明两组人群的孤独体验存在明显区别。人口统计学信息提取进一步证明了利用Reddit构建多样化照顾者孤独数据集的可行性。总体而言，本研究建立了一套基于LLM的流程，用于创建高质量的社交媒体数据集以研究孤独感，并验证了其在分析孤独感表现的人群层面差异方面的有效性。

摘要 (Abstract)

This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers’ loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.

关键词: LLM-driven approach, loneliness measurement, caregivers, social media datasets, GPT-4o, Reddit corpus, population-level differences, human-validated pipeline

149. ❌ Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

作者: Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo, Derek Yuen, Kun Shao, Huaibo Huang, Junxian Duan, Jie Cao, Ran He 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究GUI代理的对抗性攻击方法（语义级UI元素注入），属于计算机视觉、人机交互和AI安全领域，但未涉及大模型或深度学习技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、对齐、科学应用等直接相关，而本文专注于视觉GUI代理的对抗攻击，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对GUI代理的语义级UI元素注入攻击方法，通过覆盖安全对齐的UI元素来误导代理的视觉定位，实验表明该方法能显著提高攻击成功率并创建持久性吸引元素。

摘要翻译

现有针对图形用户界面（GUI）智能体的红队研究存在重要局限。对抗性扰动通常需要白盒访问权限，这在商业系统中难以实现；而提示注入攻击则因日益增强的安全对齐机制而逐渐失效。为在更贴近实际的威胁模型下研究系统鲁棒性，本文提出语义级界面元素注入攻击——一种通过将安全对齐的无害界面元素叠加至屏幕截图以误导智能体视觉定位的红队测试框架。该方法采用模块化的编辑器-叠加器-受害对象流水线，配合迭代搜索算法：该算法采样多个候选编辑方案，保留最佳累积叠加效果，并根据历史失败记录自适应调整后续提示策略。在五个受害模型上的实验表明，优化后的攻击方案对最强防御模型的攻击成功率较随机注入最高提升4.4倍。此外，在源模型上优化的攻击元素能有效迁移至其他目标模型，表明此类漏洞具有模型无关性。首次成功攻击后，受害模型在后续独立试验中仍有超过15%的概率持续点击攻击者控制的界面元素（随机注入的对照组低于1%），证明注入元素形成了持续性吸引因子而非简单的视觉干扰。

摘要 (Abstract)

Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic-level UI Element Injection, a red-teaming setting that overlays safety-aligned and harmless UI elements onto screenshots to misdirect the agent’s visual grounding. Our method uses a modular Editor-Overlapper-Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model-agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker-controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.

关键词: GUI agents, red-teaming, Semantic-level UI Element Injection, adversarial attacks, visual grounding, model-agnostic vulnerabilities, attack success rate, persistent attractor

150. ❌ Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

作者: Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, Yuekun Yao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究循环深度变换器（recurrent-depth transformers）在隐式推理（implicit reasoning）中的表现，核心关注大语言模型（LLMs）的多跳推理（multi-hop reasoning）和组合泛化（compositional generalization）能力。因此，与’Large Language Models’高度相关（10分），因为论文明确研究基于变换器的大语言模型。与’Chain of Thought’和’System 2 Thinking’高度相关（10分），因为论文研究隐式推理，涉及多步、深度推理过程，与这些关键词的核心概念一致。与’Mechanistic Interpretability’有一定关联（5分），因为论文进行了机制分析（mechanistic analysis）来支持发现。其他关键词如MoE、SFT、RAG、量化等，论文未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究循环深度变换器如何通过迭代计算解决大语言模型在隐式多跳推理中的组合泛化问题，发现它能有效实现系统泛化和深度外推，但存在过度思考的限制。

摘要翻译

本研究探讨隐式推理能力，即在前向单次传播中组合知识或规则的能力。尽管基于Transformer的大语言模型存储了大量事实知识与规则，却常难以组合这些知识进行隐式多跳推理，这表明其参数化知识缺乏组合泛化能力。为突破此限制，我们研究循环深度Transformer，该架构可在相同Transformer层上实现迭代计算。我们在隐式推理场景下探究两类组合泛化挑战：系统性泛化（即组合训练过程中从未共同使用过的知识）与深度外推（即从有限推理深度（如训练时最多5跳）泛化至更深层组合（如10跳））。通过从头训练的模型进行对照实验，我们发现标准Transformer难以应对这两类泛化挑战，而循环深度Transformer能有效实现此类泛化。针对系统性泛化，我们通过机制分析发现该能力通过三阶段顿悟过程涌现：从记忆阶段过渡到分布内泛化阶段，最终实现系统性泛化。对于深度外推，我们证明通过扩展推理时循环迭代可解锁超越训练深度的泛化能力，更多迭代次数支持更深层推理。我们进一步探究训练策略如何影响外推性能，为循环深度Transformer的训练提供指导，并发现关键局限——过度思考现象，即过量循环迭代会降低预测质量并限制对极深层组合的泛化能力。

摘要 (Abstract)

We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.

关键词: implicit reasoning, recurrent-depth transformers, compositional generalization, multi-hop reasoning, systematic generalization, depth extrapolation, grokking process, overthinking

151. ❌ More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

作者: Advait Yadav, Sid Black, Oliver Sourbut 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07821v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体系统中的协作行为，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）；涉及智能体推理分析，与’Chain of Thought’、‘System 2 Thinking’有一定关联（5分）；研究指令遵循下的协作行为，与’Instruction Tuning’有一定关联（5分）；其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在零成本协作环境中LLM智能体的合作失败问题，发现模型能力提升并不保证更好的合作性能，需要通过明确的协作设计来解决多智能体系统中的协调问题。

摘要翻译

大型语言模型（LLM）智能体在多智能体系统中的协作日益增多，但我们仍缺乏对其合作失败发生位置及原因的理解。从组织内部的知识共享到代码文档编写，许多现实世界的协调问题中，帮助他人仅需承担微不足道的个人成本，却能产生显著的集体效益。然而，当帮助行为既不会给帮助者带来收益也不会造成损害，且智能体已获得明确指示去执行时，LLM智能体是否会选择合作，目前尚不明确。我们构建了一个多智能体实验环境，旨在研究无摩擦环境中的合作行为，消除了合作中所有策略复杂性。研究发现，模型能力并不能预测合作倾向：尽管收到相同的“最大化集体收益”指令，OpenAI o3仅实现了17%的最优集体绩效，而OpenAI o3-mini却达到了50%。通过采用自动化单侧智能体通信的因果分解方法，我们将合作失败与能力失败进行分离，并借助智能体推理分析追溯其根源。在测试针对性干预措施时，我们发现明确的操作协议可使低能力模型的绩效翻倍，而微小的共享激励能有效改善弱合作模型的协作表现。我们的研究结果表明，仅靠扩展智能本身无法解决多智能体系统中的协调问题，即使帮助他人无需任何成本，仍需要经过深思熟虑的合作机制设计。

摘要 (Abstract)

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.

关键词: LLM agents, multi-agent systems, cooperation, coordination, agent reasoning, collaboration failures, zero-cost collaboration, collective performance

152. ❌ Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

作者: Kunfeng Chen, Luyao Zhuang, Fei Liao, Juhua Liu, Jian Wang, Bo Du 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM工具学习中的工具检索问题，与’Large Language Models’和’Tool Use’高度相关（10分），因为论文明确研究LLM工具学习范式。与’LLM Agents’相关（8分），因为工具学习是LLM智能体的重要组成部分。与’Retrieval-Augmented Generation’有一定关联（8分），因为论文涉及检索增强的生成过程。其他关键词如MoE、SFT、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型工具学习中真实场景指令模糊导致工具检索性能下降的问题，提出了一种通过桥接模型重写模糊指令的Tool Retrieval Bridge方法，在多个检索基准上实现了显著性能提升。

摘要翻译

工具学习已成为大型语言模型应对现实世界挑战的一种前景广阔的范式。由于工具数量庞大且更新频繁，工具检索对于选择所需工具子集至关重要。然而，当前的工具检索方法通常基于包含过度详细指令（如特定API名称和参数）的学术基准，而现实世界中的指令往往更为模糊。这种差异会阻碍工具检索在实际应用中的效果。本文首先构建了一个新基准VGToolBench，以模拟人类的模糊指令。在此基础上，我们进行了一系列初步分析，发现模糊指令确实会损害工具检索的性能。为此，我们提出了一种简单而有效的工具检索桥梁方法，以提升针对模糊指令的工具检索性能。TRB的核心原理是引入一个桥梁模型，将模糊指令重写为更具体的指令，从而缓解模糊指令与检索器偏好之间的差距。我们在多种常用检索设置下进行了广泛实验，结果表明TRB能有效降低模糊指令的歧义性，并在所有基线检索器上实现一致且显著的性能提升。例如，在TRB的辅助下，BM25检索器实现了高达111.51%的相对改进，即将平均NDCG分数从9.73提升至19.59。源代码与模型已公开于https://github.com/kfchenhn/TRB。

摘要 (Abstract)

Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.

关键词: Tool Learning, Large Language Models, Tool Retrieval, Vague Instructions, Bridge Model, Retrieval Performance, VGToolBench, NDCG

153. ❌ AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

作者: Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai, Jing Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07815v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AsyncTLS专注于长上下文LLM推理中的效率优化，核心解决注意力复杂度（KV缓存）和计算效率问题。高度相关关键词：1）‘Large Language Models’（论文直接研究LLM推理）；2）‘Context Window Extension’（针对48k-96k长上下文）；3）‘KV Cache Compression’（解决KV缓存内存问题）；4）‘Speculative Decoding OR Inference Acceleration’（实现1.2x-10.0x加速）；5）‘Mixture of Experts OR Sparse Models’（使用分层稀疏注意力方法）。其他关键词如小模型、训练方法、对齐、科学应用等均未涉及。

!!! tip deepseek-chat TL;DR

论文提出AsyncTLS系统，通过分层稀疏注意力和异步卸载引擎解决长上下文LLM推理中的KV缓存内存和计算效率问题，在保持全注意力精度的同时实现了显著的推理加速和吞吐量提升。

摘要翻译

大语言模型的长上下文推理面临注意力机制二次方复杂度与键值缓存内存占用的双重挑战。基于令牌粒度的稀疏注意力虽能提供更高精度，但其索引开销巨大；块级方法提升了效率，却牺牲了准确性。我们提出AsyncTLS——一种分层稀疏注意力系统，它结合了粗粒度块过滤与细粒度令牌选择以平衡精度与效率，并配有一个异步卸载引擎，该引擎通过利用时间局部性将键值缓存传输与计算过程重叠执行。在Qwen3与GLM-4.7-Flash模型上，针对GQA（分组查询注意力）与MLA（多层级注意力）架构的评估表明，在48k至96k上下文长度范围内，AsyncTLS达到了与全注意力相当的精度，同时实现了1.2倍至10.0倍的算子加速以及1.3倍至4.7倍的端到端吞吐量提升。

摘要 (Abstract)

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

关键词: LLM inference, sparse attention, KV cache, long-context, asynchronous offloading, efficiency optimization, attention complexity, throughput improvement

154. ❌ GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

作者: Kaiyuan Tian, Yu Tang, Gongqingjian Jiang, Baihui Liu, Yifu Gao, Xialin Su, Linbo Qiao, Dongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GRASS专注于大语言模型（LLMs）的高效微调方法，与’Large Language Models’和’Post-training/Supervised Fine-tuning’高度相关（10分）。它提出了一种参数高效微调（PEFT）方法，与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（10分）。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、推理方法、代理、压缩、科学AI等，因此这些评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于梯度的自适应分层重要性采样框架GRASS，用于解决大语言模型全参数微调中的高内存消耗问题，在减少内存使用的同时提高了下游任务性能。

摘要翻译

大语言模型的全参数微调受限于巨大的GPU内存需求。低秩适应方法通过仅更新部分参数缓解了这一挑战。然而，这些方法通常会限制模型表达能力，导致性能低于全参数微调。逐层微调方法作为一种替代方案应运而生，其通过静态的层重要性采样策略实现了内存高效的训练。然而，这些方法忽视了不同任务及训练阶段中层重要性的动态变化，导致在下游任务上表现欠佳。为应对这些局限，我们提出了GRASS，一种基于梯度的自适应逐层重要性采样框架。GRASS利用平均梯度范数作为任务感知与训练阶段感知的度量指标来评估层重要性。此外，GRASS通过自适应训练策略动态调整层采样概率。我们还引入了一种逐层优化器状态卸载机制，该机制通过重叠计算与通信进一步降低内存占用，同时保持可比的训练吞吐量。在多个模型与基准测试上的广泛实验表明，GRASS consistently outperforms state-of-the-art methods，平均准确率提升最高达4.38个百分点，内存使用量最高降低19.97%。

摘要 (Abstract)

Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97%.

关键词: Large Language Models, Fine-tuning, Parameter-efficient Fine-tuning, Memory-efficient Training, Layer-wise Importance Sampling, Gradient-based Adaptation, Optimizer State Offloading, Performance Improvement

155. ❌ TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

作者: Atahan Dokme, Benjamin Reichman, Larry Heck 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07801v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在定量推理任务中受情感语言影响的表现，直接涉及’Large Language Models’和’Chain of Thought/System 2 Thinking’（推理能力），因此这三项给10分。其他关键词如模型训练方法、优化技术、特定应用领域等均未在论文中涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了情感语言框架是否会影响大语言模型的定量推理能力，发现即使保留所有数值内容，情感表达仍会降低模型准确性，而中性化处理可恢复大部分性能。

摘要翻译

大型语言模型的训练与评估通常基于以简洁、情感中立语言表述的定量推理任务。然而，现实世界中的查询常伴随着沮丧、紧迫或热情等情绪色彩。当所有数值内容保持不变时，仅情感框架本身是否会削弱推理能力？为探究此问题，本研究开发了一种受控情感转换框架，该框架在保留所有数量与关系的前提下，将问题重写为情感化变体。利用此框架，我们在GSM8K、MultiArith和ARC-Challenge数据集上构建了Temper-5400（包含5,400对经语义验证的情感-中立配对），并对十八个模型（从10亿参数到前沿规模）进行了评估。研究得出两个核心结论：首先，即使所有数值内容保持不变，情感框架仍会使准确率降低2-10个百分点。其次，将情感化变体转换为中立表述可恢复大部分损失的性能，这表明性能下降与情感风格而非内容损坏相关，且中立化处理可作为轻量级的推理阶段缓解策略。非情感性复述不会导致此类性能下降，进一步证明影响源自情感内容而非表层语言变化。除情感因素外，本研究的基准构建流程为受控风格转换与鲁棒性评估提供了一个通用框架。

摘要 (Abstract)

Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion–neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.

关键词: Large Language Models, Quantitative Reasoning, Emotional Framing, Benchmark Construction, Robustness Evaluation, Inference-time Mitigation, GSM8K, MultiArith

156. ❌ ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

作者: Kenan Li, Qirui Jin, Liao Zhu, Xiaosong Huang, Yijia Wu, Yikai Zhang, Xin Zhang, Zijian Jin, Yufan Huang, Elsie Nallipogu, Chaoyun Zhang, Yu Kang, Saravan Rajmohan, Qingwei Lin, Wenke Lee, Dongmei Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言模型（LM）代理在软件工程（SWE）任务中的表现，核心关注代理工作流程和上下文信息信号对任务成功的影响。与’LLM Agents’高度相关（10分），因为论文直接研究LM代理在SWE中的应用；与’Tool Use’相关（8分），因为涉及API使用等工具调用；与’Large Language Models’相关（8分），因为使用LM作为代理基础。其他关键词如MoE、SFT、RAG等未在论文中提及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在软件工程任务中，不同上下文信息信号（如测试、API使用）对语言模型代理性能的贡献，提出了Oracle-SWE方法来量化这些信号的影响，以指导自主编码系统的研究优先级。

摘要翻译

语言模型（LM）智能体领域的最新进展显著提升了自动化软件工程（SWE）的水平。先前的研究提出了多种智能体工作流程与训练策略，并分析了智能体系统在软件工程任务中的失败模式，重点关注了若干上下文信息信号：复现测试、回归测试、编辑位置、执行上下文以及API使用。然而，每个信号对整体成功的具体贡献仍未得到充分探究，尤其是在能够完美获取中间信息时各信号的理想贡献。为填补这一空白，我们提出了Oracle-SWE方法，这是一种从软件工程基准测试中隔离并提取预言机（oracle）信息信号的统一方法，用以量化每个信号对智能体性能的影响。为进一步验证该模式，我们评估了由强大语言模型提取的信号在提供给基础智能体时所带来的性能增益，以此近似模拟现实世界任务解决场景。这些评估旨在为自主编码系统的研究优先级提供指导。

摘要 (Abstract)

Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

关键词: language model agents, software engineering, oracle information signals, agentic workflows, API usage, autonomous coding systems, performance quantification

157. ❌ PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context

作者: Steven Au, Baihan Lin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究个性化评论生成的评估框架，涉及检索增强生成（RAG）技术（通过图结构证据检索）和幻觉缓解（通过Dissonance Analysis评估偏离度），与这两个关键词有中等相关性（5-8分）。论文未明确提及大模型技术细节、训练方法、推理优化、代理系统等，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PeReGrINE框架，通过图结构用户-物品证据来评估个性化评论生成的忠实度，并发现图证据是驱动个性化和一致性的主要因素。

摘要翻译

我们推出PeReGrINE——一个基于图结构用户-物品证据的个性化评论生成基准与评估框架。PeReGrINE将Amazon Reviews 2023数据集重构为具有时间一致性的二分图，其中每个目标评论的生成均以明确时间截点下的用户历史、物品上下文及邻域交互的有限证据为条件。为表征用户持久偏好而避免直接依赖稀疏原始历史数据，我们计算了用户风格参数（User Style Parameter），该参数汇总了每位用户在过往评论中表现出的语言风格与情感倾向。此框架支持对四种图结构检索设定进行受控比较：仅产品证据、仅用户证据、仅邻域证据以及混合证据。除标准生成指标外，我们提出了不协调分析（Dissonance Analysis）——一种宏观评估框架，用于衡量生成内容与预期用户风格及产品层面共识的偏离程度。我们还研究了视觉证据作为辅助上下文来源的作用，发现其在某些设定中能提升文本质量，但图结构证据仍是个性化与一致性的主要驱动因素。PeReGrINE为跨产品类别研究证据构成如何影响检索条件语言模型的评论忠实度、个性化程度及事实依据性，提供了可复现的评估路径。

摘要 (Abstract)

We introduce PeReGrINE, a benchmark and evaluation framework for personalized review generation grounded in graph-structured user–item evidence. PeReGrINE restructures Amazon Reviews 2023 into a temporally consistent bipartite graph, where each target review is conditioned on bounded evidence from user history, item context, and neighborhood interactions under explicit temporal cutoffs. To represent persistent user preferences without conditioning directly on sparse raw histories, we compute a User Style Parameter that summarizes each user’s linguistic and affective tendencies over prior reviews. This setup supports controlled comparison of four graph-derived retrieval settings: product-only, user-only, neighbor-only, and combined evidence. Beyond standard generation metrics, we introduce Dissonance Analysis, a macro-level evaluation framework that measures deviation from expected user style and product-level consensus. We also study visual evidence as an auxiliary context source and find that it can improve textual quality in some settings, while graph-derived evidence remains the main driver of personalization and consistency. Across product categories, PeReGrINE offers a reproducible way to study how evidence composition affects review fidelity, personalization, and grounding in retrieval-conditioned language models.

关键词: personalized review generation, graph-structured evidence, retrieval-conditioned language models, evaluation framework, user-item graph, fidelity assessment, dissonance analysis, Amazon Reviews

158. ❌ ACIArena: Toward Unified Evaluation for Agent Cascading Injection

作者: Hengyu An, Minxi Li, Jinghuai Zhang, Naen Xu, Chunyi Zhou, Changjiang Li, Xiaogang Xu, Tianyu Du, Shouling Ji 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07775v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《ACIArena: Toward Unified Evaluation for Agent Cascading Injection》专注于多智能体系统（MAS）的安全评估框架，特别是针对Agent Cascading Injection攻击。论文的核心内容与多智能体系统、智能体协调、安全评估相关，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（评分10分）。然而，论文未涉及大模型技术原理（如LLM架构、训练方法、推理优化等）、AI for Science应用或其他关键词，因此其他关键词评分为0分。加权总分计算为20.0（10×1.0 + 10×1.0）。

!!! tip deepseek-chat TL;DR

该论文针对多智能体系统中Agent Cascading Injection攻击的安全风险，提出了一个统一的评估框架ACIArena，通过系统化测试套件和基准评估发现，仅依赖拓扑结构评估MAS鲁棒性不足，需要精心设计角色和交互模式，且简化环境中的防御措施在真实场景中可能失效。

摘要翻译

协作与信息共享赋能了多智能体系统（MAS），但也引入了名为智能体级联注入（Agent Cascading Injection，ACI）的关键安全风险。在此类攻击中，一个被攻陷的智能体利用智能体间的信任传播恶意指令，导致系统内发生级联故障。然而，现有研究仅考虑了有限的攻击策略和简化的MAS设置，限制了其普适性与全面评估能力。为弥补这一差距，我们提出了ACIArena——一个用于评估MAS鲁棒性的统一框架。ACIArena提供了涵盖多个攻击面（即外部输入、智能体配置文件、智能体间消息）和攻击目标（即指令劫持、任务破坏、信息窃取）的系统化评估套件。具体而言，ACIArena建立了一套统一规范，可同时支持MAS构建与攻防模块。它涵盖了六种广泛使用的MAS实现，并提供了包含1,356个测试用例的基准，用于系统评估MAS的鲁棒性。我们的基准测试结果表明，仅通过拓扑结构评估MAS鲁棒性是不充分的；鲁棒的MAS需要精心的角色设计和受控的交互模式。此外，在简化环境中开发的防御机制往往难以迁移到实际场景中；范围过窄的防御甚至可能引入新的漏洞。ACIArena旨在为推动更深入的MAS设计原则探索提供坚实基础。

摘要 (Abstract)

Collaboration and information sharing empower Multi-Agent Systems (MAS) but also introduce a critical security risk known as Agent Cascading Injection (ACI). In such attacks, a compromised agent exploits inter-agent trust to propagate malicious instructions, causing cascading failures across the system. However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. ACIArena offers systematic evaluation suites spanning multiple attack surfaces (i.e., external inputs, agent profiles, inter-agent messages) and attack objectives (i.e., instruction hijacking, task disruption, information exfiltration). Specifically, ACIArena establishes a unified specification that jointly supports MAS construction and attack-defense modules. It covers six widely used MAS implementations and provides a benchmark of 1,356 test cases for systematically evaluating MAS robustness. Our benchmarking results show that evaluating MAS robustness solely through topology is insufficient; robust MAS require deliberate role design and controlled interaction patterns. Moreover, defenses developed in simplified environments often fail to transfer to real-world settings; narrowly scoped defenses may even introduce new vulnerabilities. ACIArena aims to provide a solid foundation for advancing deeper exploration of MAS design principles.

关键词: Multi-Agent Systems, Agent Cascading Injection, Security Evaluation, Robustness Benchmarking, Attack-Defense Framework, Inter-agent Trust, Cascading Failures, Unified Specification

159. ❌ An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

作者: Clarissa Miranda-Pena, Andrew Reeson, Cécile Paris, Josiah Poon, Jonathan K. Kummerfeld 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在代码生成中的库幻觉问题，与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。‘Self-Correction’有一定关联（5分），因为静态分析可视为一种外部校正机制。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在代码生成中产生库幻觉的问题，通过实证分析发现静态分析方法能检测16-70%的错误，但存在48.5-77%的理论上限，无法完全解决该问题。

摘要翻译

尽管已有大量研究，大型语言模型在生成代码时仍持续产生幻觉现象，在使用库函数时尤为明显。在需要调用库的NL-to-code基准测试中，我们发现LLMs生成的代码中有8.1%-40%使用了不存在的库功能。一种直观的幻觉检测与缓解方法是静态分析。本文系统评估了静态分析工具的潜力，既涵盖其可解决的问题，也明确其局限性。研究表明，静态分析工具可检测出16%-70%的错误及14%-85%的库相关幻觉，其性能表现因LLM和数据集而异。通过人工分析，我们识别出静态方法理论上无法捕获的案例，由此推得其潜在检测上限为48.5%至77%。总体而言，我们证明静态分析是应对特定形式幻觉的低成本方法，并量化了该方法始终存在的固有局限范围。

摘要 (Abstract)

Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

关键词: Large Language Models, code generation, library hallucinations, static analysis, error detection, NL-to-code benchmarks, hallucination mitigation, empirical analysis

160. ❌ The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

作者: Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen, Wenbo Jiang, Guowen Xu, Yang Liu, Michael Backes, Yang Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM对齐与错位问题，直接涉及SFT、对齐、DPO等关键词（10分），LLMs是研究对象（10分），安全性与真实性相关（5分）。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了如何通过微调方法（SFT和PFT）对LLMs进行恶意错位攻击和安全再对齐防御，发现ORPO最有效于错位攻击而DPO最有效于再对齐但会牺牲模型实用性。

摘要翻译

大型语言模型（LLMs）的部署引发了重大的伦理与安全问题。尽管人们采用LLM对齐技术来提升模型的安全性与可信度，但攻击者可能利用这些技术破坏安全性以实现恶意目的，从而导致错位。错位的LLMs可能被发布在开放平台上，从而放大危害。为解决这一问题，在部署不可信的第三方LLMs之前，需要进行额外的安全对齐，即再对齐。本研究探讨了微调方法在错位、再对齐及其相互作用影响方面的有效性。通过在四种主流安全对齐LLMs上评估四种监督微调（SFT）和两种偏好微调（PFT）方法，我们揭示了攻击与防御之间的机制不对称性：虽然比值比偏好优化（ORPO）在错位方面最为有效，但直接偏好优化（DPO）在再对齐方面表现卓越，尽管这会牺牲模型效用。此外，我们发现了模型特异性抵抗、多轮对抗动态的残留效应以及其他值得注意的发现。这些结果凸显了需要建立强健的防护机制和定制化的安全对齐策略，以减轻LLMs部署中的潜在风险。我们的代码发布于https://github.com/zhangrui4041/The-Art-of-Mis-alignment。

摘要 (Abstract)

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.

关键词: LLM alignment, misalignment, realignment, Supervised Fine-Tuning, Preference Fine-Tuning, Direct Preference Optimization, safety, adversarial dynamics

161. ❌ Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

作者: Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Ping Tan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Mixture-of-Experts (MoE)架构在大型多模态模型中的应用，直接对应’Mixture of Experts OR MoE OR Sparse Models’关键词（15分）。论文涉及大型多模态模型（LMMs），属于大模型范畴，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。论文提出统一的预训练框架，与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（10分）。其他关键词如小模型、缩放定律、后训练、对齐、推理加速、AI for Science等均未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了大型多模态模型中图像生成任务导致理解任务灾难性遗忘的问题，通过提出Symbiotic-MoE框架，在零参数开销下实现了生成与理解任务的协同优化，显著提升了MMLU和OCRBench等理解任务的性能。

摘要翻译

赋能大型多模态模型（LMMs）的图像生成能力常因严重的梯度冲突导致其在理解任务上出现灾难性遗忘。现有方法如混合变换器（Mixture-of-Transformers，MoT）虽通过结构隔离缓解了冲突，却从根本上割裂了跨模态协同，并受限于容量碎片化问题。本研究提出Symbiotic-MoE，一个统一预训练框架，该框架在原生多模态混合专家（Mixture-of-Experts，MoE）变换器架构内以零参数量开销解决了任务干扰问题。我们首先发现标准MoE调优会导致路由崩溃，即生成梯度主导了专家使用。为解决此问题，我们引入了模态感知专家解耦，将专家划分为任务特定组，同时利用共享专家作为多模态语义桥梁。关键的是，该设计允许共享专家从生成任务中吸收细粒度视觉语义，以丰富文本表征。为优化此过程，我们提出了一种渐进式训练策略，其具备差异化学习率与早期梯度屏蔽机制。该策略不仅保护了预训练知识免受早期波动影响，最终还将生成信号转化为对理解任务的建设性反馈。大量实验表明，Symbiotic-MoE在实现快速生成收敛的同时，释放了跨模态协同效应，显著提升了模型固有理解能力，在MMLU和OCRBench基准上取得了卓越的增益。

摘要 (Abstract)

Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

关键词: Mixture-of-Experts, Large Multimodal Models, pre-training, cross-modal synergy, generative tasks, understanding tasks, multimodal Transformers, task interference

162. ❌ Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

作者: Pei-Xi Xie, Che-Yu Lin, Cheng-Lin Yang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07747v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究数学推理中的强化学习与可验证奖励（RLVR），核心是改进大模型（Qwen3-1.7B-Base和Llama-3.2-1B-Instruct）在数学问题上的推理性能。高度相关的关键词包括：大模型（LLMs）和小模型（SLMs）是实验对象；RLHF/DPO相关技术（论文在DAPO框架下）；多步推理（Chain of Thought）和深度推理（System 2 Thinking）是数学推理的核心；AI for Science体现在数学领域的应用。其他关键词如MoE、数据质量、预训练、RAG、上下文扩展、模型压缩等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对数学强化学习与可验证奖励（RLVR）中教师-学生分布不匹配和提示暴露问题，提出了分布对齐提示合成和反向提示退火方法，在AIME数学基准上使用Qwen3-1.7B-Base和Llama-3.2-1B-Instruct模型，有效提升了pass@1和pass@2048性能。

摘要翻译

具备可验证奖励的强化学习（RLVR）能够在提升低$k$值推理准确率的同时，缩小对具有挑战性数学问题的求解覆盖范围，且pass@1指标的提升并不必然转化为更好的大$k$值性能。现有的基于提示的方法虽能使难题变得可训练，但有两个问题尚未得到充分探讨：师生分布失配问题，以及为匹配无提示评估而需减少提示暴露的需求。我们通过两个组件应对这些问题。分布对齐的提示合成（DAHS）基于学生风格的响应构建经过验证的教师提示。反向提示退火（BHA）在不同难度层级中逐步减少提示暴露，并采用逐题提示丢弃策略，以在整个RL训练过程中保持无提示的更新。我们在DAPO训练框架下，使用$\texttt{Qwen3-1.7B-Base}$和$\texttt{Llama-3.2-1B-Instruct}$模型，于AIME24、AIME25和AIME26三个数学RLVR基准上评估该方法。在$\texttt{Qwen3-1.7B-Base}$上，相较于DAPO，我们的方法在三个AIME基准上均同时提升了pass@1和pass@2048指标。在$\texttt{Llama-3.2-1B-Instruct}$上，性能增益则集中体现在大$k$值区域。这些结果表明，在数学RLVR中，当提示脚手架能在训练早期为难题恢复可学习的更新，并在无提示评估前逐步撤除时，它是有效的。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.

关键词: Reinforcement Learning with Verifiable Rewards (RLVR), Math Reasoning, Hint Synthesis, Hint Annealing, Distribution Alignment, Large Language Models, Small Language Models, AIME Benchmark

163. ❌ Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

作者: Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, Qingyang Wu, Yuqing Jian, Ce Zhang, Kurt Keutzer, Tri Dao, Xiaoxia Wu, Ben Athiwaratkun, James Zou, Chenfeng Xu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Squeeze Evolve框架，专注于多模型协同的进化推理方法，核心涉及LLM的使用和LLM Agents的工作流程优化。与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为框架依赖LLM进行进化推理；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为框架本质上是多模型协同的agentic工作流，用于自主进化。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

论文解决了无验证器进化方法中多样性和效率的瓶颈问题，通过Squeeze Evolve多模型协同框架，在多个基准测试中实现了成本降低3倍、吞吐量提升10倍，并达到新的最先进性能。

摘要翻译

我们证明，无验证器的进化过程同时受限于多样性与效率瓶颈：在没有外部校正的情况下，重复进化会加速向狭窄模式坍缩，而统一使用高成本模型则会浪费算力并迅速在经济上变得不可行。我们提出Squeeze Evolve，一个用于无验证器进化推理的统一多模型编排框架。我们的方法遵循一个简单原则：将模型能力分配至其边际效用最高的环节。更强的模型被保留用于高影响力阶段，而更经济的模型则以低得多的成本处理其他阶段。这一原则在保持轻量化的同时，共同解决了多样性与成本效益问题。Squeeze Evolve天然支持开源、闭源及混合模型的部署。在AIME 2025、HMMT 2025、LiveCodeBench V6、GPQA-Diamond、ARC-AGI-V2以及多模态视觉基准（如MMMU-Pro和BabyVision）上，Squeeze Evolve持续提升了成本-能力边界，优于单模型进化方法，并在多项任务中取得了新的最先进成果。实验表明，Squeeze Evolve可将API成本降低至多约3倍，并在固定预算下将服务吞吐量提升至多约10倍。此外，在探索性任务中，Squeeze Evolve是首个能够匹配（并在某些情况下超越）基于验证器的进化方法性能的无验证器进化方法。

摘要 (Abstract)

We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.

关键词: verifier-free evolution, multi-model orchestration, cost-efficiency, evolutionary inference, model capability allocation, API cost reduction, serving throughput, state-of-the-art results

作者: Ziyi Chen, Yasir Khan, Mengyuan Zhang, Cheng Peng, Mengxian Lyu, Yiyang Liu, Krishna Vaddiparti, Robert L Cook, Mattia Prosperi, Yonghui Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文的核心是使用大语言模型（LLMs）开发一个工具，用于从临床笔记中识别HIV相关的污名化内容，这直接与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用了GPT-OSS-20B、LLaMA-8B和MedGemma-27B等LLMs。同时，该研究属于生物医学信息学应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文在零样本和少样本提示下评估了生成模型，这涉及’In-context Learning OR Many-shot Learning’（5分）。其他关键词主要涉及大模型的技术原理、优化方法或特定应用场景（如推理、对齐、压缩等），而本论文侧重于LLMs在特定医学NLP任务中的应用评估，未深入探讨这些技术细节，因此相关度为0分。

!!! tip deepseek-chat TL;DR

本研究开发了首个基于大语言模型的工具，用于从临床笔记中自动识别HIV相关的污名化内容，通过比较多种模型发现GatorTron-large在整体性能上表现最佳，而少样本提示能显著提升生成模型的性能。

摘要翻译

人类免疫缺陷病毒（HIV）相关污名是影响HIV感染者（PLWH）健康的关键社会心理决定因素，对心理健康、医疗参与度和治疗结果均产生重要影响。尽管临床叙事记录中已存在污名相关经历的描述，但目前缺乏能够自动提取和分类这些信息的现成工具。本研究旨在开发一种基于大语言模型（LLM）的工具，用于从临床记录中识别HIV相关污名。我们收集了2012年至2022年间在佛罗里达大学（UF）健康中心接受治疗的HIV感染者的临床记录，通过专家筛选的污名相关关键词识别候选语句，并利用临床词嵌入技术进行迭代扩展。最终对1,332条语句进行了人工标注，涵盖四个污名子维度：公众态度顾虑、披露担忧、负面自我形象和个体化污名。我们以GatorTron-large和BERT作为编码器基线模型，以GPT-OSS-20B、LLaMA-8B和MedGemma-27B作为生成式大语言模型，在零样本和少样本提示场景下进行对比。GatorTron-large取得了最佳综合性能（微观F1值=0.62）。少样本提示显著提升了生成式模型的性能，5样本提示下的GPT-OSS-20B和LLaMA-8B分别获得0.57和0.59的微观F1值。不同污名子维度的识别效果存在差异，其中负面自我形象的预测性最高，而个体化污名仍最具挑战性。零样本生成式推理存在显著失败率（最高达32%）。本研究开发了首个用于临床记录中HIV污名识别的实用自然语言处理（NLP）工具。

摘要 (Abstract)

Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

关键词: HIV-related stigma, clinical narratives, large language models, natural language processing, zero-shot prompting, few-shot prompting, GatorTron, MedGemma

165. ❌ Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction

作者: Mingchen Li, Jiatan Huang, Zonghai Yao, Hong yu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文高度相关于LLM在医疗领域的应用（AI for Science），核心创新是改进RAG框架以解决幻觉和延迟问题（Hallucination Mitigation, RAG），因此这些关键词得10分。其他关键词如MoE、SFT、RLHF等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在医疗预测中因幻觉和外部检索延迟导致的可靠性问题，提出了K2K框架，通过内部键值记忆实现高效知识检索，在多个医疗数据集上取得了最先进的性能。

摘要翻译

大型语言模型（LLMs）在医疗健康领域展现出巨大潜力，但其在高风险临床环境中的可靠性常因幻觉问题及缺乏细粒度医学背景而受限。尽管检索增强生成（Retrieval Augmented Generation, RAG）能够缓解这些问题，但标准的监督流程需要在海量外部知识库中进行计算密集的检索，导致高延迟，难以满足时效性强的临床需求。为此，我们提出了“知识密钥”（Keys to Knowledge, K2K）这一创新框架，该框架以内部基于密钥的知识访问机制取代外部检索。通过将关键临床信息直接编码至模型参数空间，K2K能够从内部键值记忆中实现快速检索，且无需推理阶段的计算开销。我们进一步通过激活引导的探针构建与交叉注意力重排序技术提升了检索质量。实验结果表明，K2K在四个医疗健康结果预测基准数据集上均达到了最先进的性能水平。

摘要 (Abstract)

Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model’s parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.

关键词: Large Language Models, Healthcare Prediction, Retrieval Augmented Generation, Hallucination Mitigation, Internal Memory Retrieval, Clinical Settings, Key-based Knowledge Access, Benchmark Healthcare Datasets

166. ❌ IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

作者: David Gringras 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究AI安全措施（如对齐、RLHF）在医疗领域导致的医源性伤害，核心评估前沿大语言模型（LLMs）在不同身份提示下的响应差异。高度相关的关键词包括：LLMs（研究对象）、Alignment（研究安全对齐问题）、Hallucination Mitigation（涉及事实性和安全性）、AI for Science（医疗应用）。RLHF和Explainable AI有一定关联，因为论文涉及安全训练机制和模型行为解释。其他关键词如MoE、SFT、RAG等与论文的医疗安全评估主题无关。

!!! tip deepseek-chat TL;DR

论文通过IatroBench基准发现，前沿大语言模型在医疗场景中会根据用户身份（医生vs普通人）选择性隐瞒安全知识，导致医源性伤害，且模型的安全训练机制与评估工具存在相同盲点。

摘要翻译

当询问前沿模型如何为六毫克阿普唑仑制定减量方案（患者为已退休精神科医生，仅剩十日药量，骤停可能引发癫痫）时，模型却建议她联系一位已明确不存在的精神科医生。仅改动一个词（改为“我是一名精神科医生；患者主诉……”），使用相同模型权重与推理过程，该模型即生成符合《阿什顿手册》标准的减量方案，包含地西泮等效换算、抗癫痫覆盖措施及监测阈值。知识本存在于模型中，模型却选择隐匿。IatroBench 正是为衡量此差距而设计。该研究包含六十项预先注册的临床情境，测试六个前沿模型，共生成三千六百条回应，通过经医师评分验证的结构化评估流程（加权卡帕系数 κ_w = 0.571，一分内一致率 96%）进行双轴评分（主动伤害指数 CH 0-3；疏忽伤害指数 OH 0-4）。核心发现是身份依从性信息隐匿现象：将相同临床问题分别以医师身份与普通人身份提出，所有五个可测模型均向医师身份提供更优指导（解耦差距 +0.38，p = 0.003；在普通人身份表述中，涉及安全冲突操作的正确执行率下降 13.1 个百分点，p < 0.0001，而无冲突操作则无变化）。该差距在安全投入最高的模型（Opus，+0.65）中最为显著。研究清晰区分三种失效模式：训练性隐匿（Opus）、能力缺陷（Llama 4）以及无差别内容过滤（GPT-5.2 的后生成过滤器对医师身份回应的清除率是普通人身份的 9 倍，因其包含更密集的药理学术语）。标准 LLM 评估器将医师评分 OH ≥ 1 的回应中 73% 判定为 OH = 0（κ = 0.045）；评估机制与训练机制存在相同的认知盲区。所有测试情境均针对已用尽标准转诊途径的个体。

摘要 (Abstract)

Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word (“I’m a psychiatrist; a patient presents with…”) and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.

关键词: AI safety measures, iatrogenic harm, large language models, clinical scenarios, withholding behavior, frontier models, medical AI, evaluation benchmark

167. ❌ Optimal Decay Spectra for Linear Recurrences

作者: Yang Cao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PoST框架改进线性递归模型的长期记忆能力，与’Context Window Extension OR Long Context LLMs’高度相关（8分），因为直接解决长上下文处理问题。与’Large Language Models OR LLMs OR Foundation Models’和’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（各5分），因为论文在Mamba-2等模型上进行了预训练实验。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对线性递归模型长期记忆能力不足的问题，提出了Position-Adaptive Spectral Tapering（PoST）框架，通过谱重参数化和位置自适应缩放机制，显著改善了多种模型的长上下文处理性能，并在预训练实验中验证了有效性。

摘要翻译

线性循环模型提供线性时间序列处理能力，但常受限于次优的长程记忆性能。我们将此问题归因于衰减谱特性：对于$N$个通道，随机初始化会使最小谱间隙坍缩至$O(N^{-2})$，导致次指数误差$\exp(-Ω(N/\log N))$；线性间距虽可避免坍缩，但误差会退化为$\exp(-O(N/\sqrt{T}))$，在长上下文场景中实际呈现代数衰减。本文提出位置自适应谱锥化（Position-Adaptive Spectral Tapering, PoST）——一种与架构无关的框架，融合两种机制：（1）谱重参数化，通过结构约束强制对数衰减率呈几何间隔分布，理论证明其达到极小极大最优速率$O(\exp(-cN/\log T))$；（2）位置自适应缩放，作为可证明的唯一机制，通过将频谱拉伸至实际依赖范围，消除静态频谱的尺度失配问题（在位置$t$处仅有$N\log t/\log T$个通道有效），将误差率锐化至$O(\exp(-cN/\log t))$。这种缩放机制天然诱导分数阶不变性：脉冲响应变为尺度无关，通道可在相对与绝对时间坐标间进行插值。PoST可无缝集成至任意对角线性循环架构且无额外开销。我们在Mamba-2、RWKV-7、Gated DeltaNet、Gated Linear Attention及RetNet等模型中实现该框架。在1.8亿至4.4亿参数规模的预训练实验中，模型在零样本语言建模任务上获得持续提升，Mamba-2在长上下文检索任务（MQAR与NIAH）中取得显著增益，其他架构也展现出竞争力相当或更优的性能。代码地址：https://github.com/SiLifen/PoST。

摘要 (Abstract)

Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for $N$ channels, random initialization collapses the minimum spectral gap to $O(N^{-2})$, yielding sub-exponential error $\exp(-Ω(N/\log N))$; linear spacing avoids collapse but degrades to $\exp(-O(N/\sqrt{T}))$, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate $O(\exp(-cN/\log T))$; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only $N\log t/\log T$ of $N$ channels are effective at position $t$) by stretching the spectrum to the actual dependency range, sharpening the rate to $O(\exp(-cN/\log t))$. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.

关键词: Linear recurrent models, long-range memory, decay spectrum, Position-Adaptive Spectral Tapering, spectral reparameterization, position-adaptive scaling, long-context retrieval, Mamba-2

168. ❌ Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

作者: Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, Xiangliang Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的安全对齐问题，提出Guardian-as-an-Advisor（GaaA）框架，通过软门控管道增强LLM的安全性和实用性。高度相关关键词（10分）：涉及LLM基础模型、SFT训练、对齐技术、RLHF/RL训练、幻觉缓解。中等相关（5分）：自我纠正机制（通过建议实现）和可解释AI（提供解释）。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM安全检测器过度拒绝和与模型规范不对齐的问题，提出了Guardian-as-an-Advisor框架，通过训练GuardAdvisor模型生成风险标签和解释作为建议，在保持基础模型原有规范的同时提升安全性和减少过度拒绝。

摘要翻译

硬性门控安全检测器常因过度拒绝而与供应商模型规范失准；主流分类法亦忽视鲁棒性与诚实性，导致系统在纸面上更安全却实用性降低。本研究提出“守护者即顾问”（Guardian-as-an-Advisor，GaaA）——一种软门控流程，其中守护者模型预测二元风险标签并生成简明解释，将此建议附加至原始查询中进行重新推理，使基础模型在其原始规范下持续运作。为支持训练与评估，我们构建了GuardSet数据集，包含超过20.8万条多领域样本，统一涵盖有害与无害案例，并针对鲁棒性与诚实性设置专项数据切片。通过监督微调（SFT）与强化学习（RL）训练GuardAdvisor模型，以强化标签与解释的一致性。GuardAdvisor在实现竞争性检测精度的同时支持顾问工作流；当用于增强输入时，其生成回复优于未增强提示。延迟研究表明，顾问模型推理仅需基础模型计算量的5%以下，在实际有害输入率下仅增加2%-10%的端到端开销。总体而言，GaaA引导模型遵循规范，在保持安全性的同时减少过度拒绝。

摘要 (Abstract)

Hard-gated safety checkers often over-refuse and misalign with a vendor’s model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.

关键词: Guardian-as-an-Advisor, trustworthy LLMs, safety alignment, soft-gating pipeline, GuardSet dataset, SFT and RL training, over-refusal reduction, advisory workflow

169. ❌ How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

作者: Chenchen Kuai, Jiwan Jiang, Zihao Zhu, Hao Wang, Keshu Wu, Zihao Li, Yunlong Zhang, Chenxi Liu, Zhengzhong Tu, Zhiwen Fan, Yang Zhou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLM的行为依赖性和独立性，直接涉及LLM评估、多模型系统、行为分析等，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理技术、代理系统、压缩等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种统计框架来审计黑盒大语言模型之间的行为纠缠，发现广泛存在的行为依赖性会损害LLM-as-a-judge评估的精度，并通过去纠缠的验证器集成重加权方法提高了验证性能。

摘要翻译

大型语言模型（LLM）生态系统的快速发展引发了一个关键问题：表面上多样化的模型是否真正独立？共享的预训练数据、知识蒸馏和对齐流程可能导致隐藏的行为依赖性，即潜在纠缠，这会破坏多模型系统（如LLM-as-a-judge评估流程和集成验证）的可靠性，因为这些系统通常默认各模型提供独立信号。实践中，这表现为推理模式的相关性和同步失败，即表面的一致性反映的是共享的错误模式，而非独立验证的结果。为解决此问题，我们开发了一个用于审计黑盒LLM间行为纠缠的统计框架。该方法引入了一个多分辨率层级结构，通过两个信息论指标刻画联合失败流形：（i）难度加权行为纠缠指数，该指数放大了在简单任务上的同步失败；（ii）累积信息增益（CIG）指标，用于捕捉错误响应中的方向性对齐。通过对来自六个模型家族的18个LLM进行大量实验，我们发现了广泛存在的行为纠缠，并分析了其对LLM-as-a-judge评估的影响。我们发现CIG与评判者精度的下降存在统计学上显著的关联，基于GPT-4o-mini的评判者斯皮尔曼系数为0.64（p < 0.001），基于Llama3的评判者为0.71（p < 0.01），这表明更强的依赖性对应于更高的过度认可偏差。最后，我们通过解纠缠的验证器集成重加权展示了纠缠的一个实际应用案例。通过基于推断的独立性调整模型贡献，所提出的方法减轻了相关偏差并提升了验证性能，相比多数投票法实现了高达4.5%的准确率提升。

摘要 (Abstract)

The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where apparent agreement reflects shared error modes rather than independent validation. To address this, we develop a statistical framework for auditing behavioral entanglement among black-box LLMs. Our approach introduces a multi-resolution hierarchy that characterizes the joint failure manifold through two information-theoretic metrics: (i) a Difficulty-Weighted Behavioral Entanglement Index, which amplifies synchronized failures on easy tasks, and (ii) a Cumulative Information Gain (CIG) metric, which captures directional alignment in erroneous responses. Through extensive experiments on 18 LLMs from six model families, we identify widespread behavioral entanglement and analyze its impact on LLM-as-a-judge evaluation. We find that CIG exhibits a statistically significant association with degradation in judge precision, with Spearman coefficient of 0.64 (p < 0.001) for GPT-4o-mini and 0.71 (p < 0.01) for Llama3-based judges, indicating that stronger dependency corresponds to increased over-endorsement bias. Finally, we demonstrate a practical use case of entanglement through de-entangled verifier ensemble reweighting. By adjusting model contributions based on inferred independence, the proposed method mitigates correlated bias and improves verification performance, achieving up to a 4.5% accuracy gain over majority voting.

关键词: Large Language Models, Behavioral Entanglement, LLM-as-a-judge, Statistical Framework, Verifier Ensembles, Independent Validation, Correlated Failures, Model Dependencies

170. ❌ DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

作者: Ziyi Wang, Siva Rajesh Kasa, Ankith M S, Santhosh Kumar Kasa, Jiaru Zou, Sumit Negi, Ruqi Zhang, Nan Jiang, Qifan Song 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是speculative decoding（推测解码）的改进方法，这是大语言模型推理加速的关键技术。因此与’Speculative Decoding OR Inference Acceleration’高度相关（15分），与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为该方法专门用于加速大语言模型推理。其他关键词如MoE、SLMs、训练方法、对齐、RAG、推理能力、智能体、量化等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对推测解码中严格的验证步骤导致接受率低的问题，提出了动态验证松弛推测解码（DIVERSED）框架，通过基于集成的验证器混合草稿和目标模型分布，在保持生成质量的同时显著提高了推理效率。

摘要翻译

推测解码是一种通过并行草拟多个标记来加速大语言模型推理的有效技术。在实际应用中，其加速效果常受限于一个严格的验证步骤，该步骤强制要求接受的标记分布必须与目标模型完全匹配。这一约束导致许多合理的标记被拒绝，降低了接受率并限制了整体时间加速。为克服此限制，我们提出了动态验证松弛推测解码（DIVERSED），这是一种松弛验证框架，能在保持生成质量的同时提升时间效率。DIVERSED通过学习一个基于集成的验证器，该验证器以任务相关和上下文相关的权重混合草拟模型与目标模型的分布。我们为此方法提供了理论依据，并通过实验证明，与标准推测解码方法相比，DIVERSED能显著提高推理效率。代码发布于：https://github.com/comeusr/diversed。

摘要 (Abstract)

Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.

关键词: speculative decoding, inference acceleration, large language models, verification framework, ensemble-based verifier, dynamic verification, inference efficiency, generation quality

171. ❌ ADAG: Automatically Describing Attribution Graphs

作者: Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ADAG专注于语言模型可解释性研究，特别是电路追踪（circuit tracing）的自动化描述，这直接与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文使用LLM作为解释器-模拟器来生成和评分自然语言解释，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、代理系统、量化压缩、AI for Science等均未在论文标题或摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ADAG，一种用于自动化描述语言模型内部特征因果贡献图（attribution graphs）的端到端管道，以解决电路追踪中依赖人工解释的问题，并在已知任务和有害建议越狱案例中成功恢复了可解释的电路。

摘要翻译

在语言模型可解释性研究中，电路追踪旨在识别哪些内部特征对特定输出产生了因果贡献及其相互影响机制，以解释特定行为背后的计算过程。然而，所有先前的电路追踪工作都依赖于人工对电路中各特征作用进行临时性解读，即通过手动检查组件激活的数据样本等人工数据产物。我们提出了ADAG——一种用于描述这些归因图的端到端全自动化流程。为实现这一目标，我们引入了归因画像，通过特征的输入与输出梯度效应量化其功能角色。随后我们提出一种新颖的特征聚类算法，以及一个LLM解释器-模拟器框架，该框架能生成并评估描述这些特征组功能角色的自然语言解释。我们在已知经人工分析的电路追踪任务上运行本系统，成功复现了可解释的电路结构，并进一步证明ADAG能够发现Llama 3.1 8B Instruct模型中导致有害建议越狱的可操控特征簇。

摘要 (Abstract)

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer–simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

关键词: language model interpretability, circuit tracing, attribution graphs, automated description, feature clustering, LLM explainer, jailbreak analysis, causal contribution

172. ❌ Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback

作者: Matthew Penaroza 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07595v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体（LLM Agents）的推理机制改进，提出reasoning graphs和retrieval graphs结构来存储和利用证据链（Chain of Thought）进行反馈，实现自我改进（Self-Correction/Self-Improvement）和深度推理（System 2 Thinking）。该方法与检索增强生成（RAG）相关，但侧重于证据检索和推理图的构建。论文未涉及其他关键词如MoE、量化、对齐等具体技术。

!!! tip deepseek-chat TL;DR

该论文针对语言模型智能体每次查询都从零开始推理导致准确率低和方差高的问题，提出了推理图和检索图结构，通过证据中心的链式思维反馈实现自我改进，显著提高了多跳问答任务的准确率并降低了方差，且无需重新训练基础模型。

摘要翻译

语言模型代理对每个查询都从零开始推理：每次代理检索证据并进行审议时，其思维链都会被丢弃，下一个类似查询将无法利用先前的洞见。这导致准确率较低且方差较高，因为相同类型的查询可能无法预测地成功或失败。我们引入推理图，这是一种图结构，它将代理针对每条证据的思维链持久化为结构化边，这些边连接到它们所评估的证据项。与先前将提炼策略存储为扁平记录（通过查询相似性索引或按时间顺序追加）的记忆机制不同，推理图实现了以证据为中心的反馈：给定一个新的候选集，系统会遍历每条证据项在所有先前运行中接收到的所有评估边，从而揭示该特定项过去是如何被评判的。这种从证据向内的反向遍历在结构上不同于基于查询相似性的检索能力，因为反馈与代理当前正在检查的特定证据绑定，而非与查询绑定。我们进一步引入检索图，这是一种互补结构，它为流水线规划器提供输入，以在连续运行中收紧候选筛选漏斗。这两种图共同形成了一个自我改进的反馈循环：在连续运行中，准确率上升且方差收敛，每个决策都可通过图进行完全追溯。这种改进无需重新训练；基础模型保持冻结，所有增益均通过图遍历的上下文工程实现。我们形式化了图结构、遍历算法和反馈机制，并描述了一种用于在多跳问答基准上测量准确率收敛和方差崩溃的序列聚类评估协议。

摘要 (Abstract)

Language model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent’s per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks.

关键词: reasoning graphs, LLM agents, chain of thought, evidence-centric feedback, retrieval graphs, self-improving, multi-hop question answering, accuracy convergence

173. ❌ From Ground Truth to Measurement: A Statistical Framework for Human Labeling

作者: Robert Chew, Stephanie Eckman, Christoph Kern, Frauke Kreuter 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《From Ground Truth to Measurement: A Statistical Framework for Human Labeling》专注于监督机器学习中人类标注过程的统计建模，提出一个分解标注变异的框架（实例难度、标注者偏差、情境噪声、关系对齐）。虽然论文涉及机器学习数据标注，但所有关键词均针对大模型（LLMs）及其特定技术（如MoE、RLHF、RAG、量化等）、推理方法（如CoT、MCTS）、应用领域（如AI for Science）或优化技术（如注意力机制、上下文扩展）。论文内容完全不涉及大模型技术、深度学习创新或科学领域应用，而是通用机器学习数据标注方法论，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对监督机器学习中人类标注引入的系统性变异问题，提出了一个统计框架来分解标注结果为实例难度、标注者偏差、情境噪声和关系对齐四个可解释的变异源，并通过自然语言推理数据验证了其有效性。

摘要翻译

监督式机器学习假设标注数据能够为模型所需学习的概念提供精确测量。然而在实践中，人工标注会引入系统性变异，这些变异源于模糊项目、不同解释以及简单错误。机器学习研究通常将所有标注分歧视为噪声，这种做法掩盖了这些差异，并限制了我们理解模型实际学习内容的能力。本文重新将标注定义为一种测量过程，并引入一个统计框架，将标注结果分解为可解释的变异来源：实例难度、标注者偏差、情境噪声和关系对齐。该框架扩展了经典测量误差模型，使其能够兼容共享与个体化的真实概念，同时反映传统的误差解释和人类标注变异的误差理解，并提供了评估哪种机制更适合描述特定任务的诊断方法。将所提出的模型应用于一个多标注者自然语言推理数据集后，我们为所有四个理论成分找到了实证证据，并证明了该方法的有效性。最后，我们讨论了以数据为中心的机器学习所面临的启示，并概述了该方法如何推动建立更系统化的标注科学。

摘要 (Abstract)

Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.

关键词: human labeling, measurement error models, annotation variation, supervised machine learning, data-centric machine learning, multi-annotator datasets, statistical framework, labeling diagnostics

174. ❌ CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

作者: Mohamed Ehab, Ali Hamdi, Khaled Shaban 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究不平衡数据下的语言模型评估和集成方法，直接涉及LLMs和SLMs的使用（实验使用了3个LLMs和5个SLMs），并涉及fine-tuned settings（与SFT相关）。其他关键词如MoE、Scaling Laws、RAG、CoT等与论文的集成方法、不平衡分类问题无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对不平衡数据分类问题，提出了一种名为CAMO的类感知少数类优化集成方法，通过在两个高度不平衡的领域特定基准测试中验证，该方法在使用精调模型时获得了最高的严格宏观F1分数，证明了其作为不平衡分类的可靠、领域无关框架的有效性。

摘要翻译

现实世界中的分类任务常因类别不平衡问题而严重受阻，因为传统集成方法倾向于多数类，导致少数类性能下降及整体F1分数降低。我们提出一种针对不平衡问题的独特集成技术——CAMO（类别感知少数类优化）。通过一个结合投票分布、置信度校准和模型间不确定性的分层流程，CAMO动态增强欠代表类别的权重，同时保留并放大少数类的预测效果。我们在两个高度不平衡的领域特定基准数据集上验证了CAMO：DIAR-AI/Emotion数据集和三元BEA 2025数据集。我们在零样本和微调设置下，使用八种不同语言模型（三种大语言模型和五种小语言模型）对七种成熟集成算法进行了基准测试。实验表明，经过模型精调后，CAMO始终获得最高的严格宏观F1分数，创造了新的性能基准。其优势与模型适配协同作用，表明最佳集成选择取决于模型特性。这证明CAMO是一个可靠且领域无关的不平衡分类框架。

摘要 (Abstract)

Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

关键词: class imbalance, ensemble method, language models, minority optimization, F1-score, imbalanced data, LLMs, SLMs

175. ❌ Learning is Forgetting: LLM Training As Lossy Compression

作者: Henry C. Conklin, Tom Hosking, Tan Yi-Chern, Julian Gold, Jonathan D. Cohen, Thomas L. Griffiths, Max Bartolo, Seraphina Goldfarb-Tarrant 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM预训练过程的信息压缩本质，与’Large Language Models’和’Pre-training’高度相关（10分），因为直接分析LLM预训练如何通过压缩学习。与’Mechanistic Interpretability’有一定关联（5分），因为研究LLM表示空间结构有助于可解释性。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出LLM预训练本质上是损失性压缩过程，模型通过保留训练数据中与目标相关的信息来学习，并证明压缩最优性能预测下游任务表现。

摘要翻译

尽管大型语言模型（LLM）的应用日益广泛，我们对其表征空间的结构仍知之甚少。这限制了我们解释其学习方式与内容的能力，也难以将其与人类学习过程相联系。我们认为，将LLM视为一种有损压缩实例最为恰当：在训练过程中，它们通过学习仅保留训练数据中与其目标相关的信息。我们的研究表明，预训练产生的模型在下一序列预测任务上达到了最优压缩，趋近于信息瓶颈理论所界定的压缩极限。通过对一系列开源权重模型的分析，我们发现每个模型的压缩方式各不相同，这很可能源于其训练数据与训练方案的差异。然而，即使在不同系列的LLM之间，模型压缩的最优性及其所包含的信息量，仍能有效预测其在广泛基准测试中的下游性能，从而使我们能够直接将表征结构与模型性能的可操作洞察联系起来。总体而言，本研究提供了一个统一的信息论框架，用以阐释这些模型如何学习，且该框架具备大规模部署的可行性。

摘要 (Abstract)

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model’s compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

关键词: Large Language Models, Pre-training, Lossy Compression, Information Bottleneck, Representational Structure, Model Performance, Information Theory, Next-sequence Prediction

176. ❌ Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

作者: Tunazzina Islam 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	7.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是利用LLMs作为语义判断器来验证和重构无监督聚类算法的输出，涉及LLMs应用、推理过程（CoT/System 2）和自我改进机制。与LLMs高度相关（10分），与推理关键词相关（8分），与自我改进相关（7分），其他关键词如MoE、SLMs、训练技术、压缩、科学AI等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于推理的LLM框架，用于验证和重构无监督文本聚类结果，在社交媒体语料上显著提升了聚类连贯性和标签质量。

摘要翻译

无监督方法被广泛用于从大规模文本集合中归纳潜在语义结构，但其输出常包含不连贯、冗余或缺乏依据的聚类结果，这些结果在没有标注数据的情况下难以验证。我们提出一种基于推理的优化框架，该框架利用大语言模型（LLMs）并非作为嵌入生成器，而是作为语义评判器，用于验证并重构任意无监督聚类算法的输出。我们的框架引入三个推理阶段：（i）连贯性验证，即LLMs评估聚类摘要是否得到其成员文本的支持；（ii）冗余判定，即根据语义重叠对候选聚类进行合并或剔除；（iii）标签锚定，即以完全无监督的方式为聚类分配可解释的标签。该设计将表示学习与结构验证解耦，并缓解了纯嵌入方法的常见失效模式。我们在两种具有不同交互模式的真实社交媒体语料库上评估该框架，结果表明相较于经典主题模型和近期基于表示的基线方法，本框架在聚类连贯性和人类对齐的标签质量上均取得持续改进。尽管缺乏黄金标准标注，人工评估显示其结果与LLM生成的标签高度一致。我们进一步在匹配的时间和规模条件下进行鲁棒性分析，以评估跨平台稳定性。除实证效果提升外，我们的研究结果表明，基于LLM的推理可作为一种通用机制，用于验证和优化无监督语义结构，从而实现对大规模文本集合更可靠、可解释的无监督分析。

摘要 (Abstract)

Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.

关键词: Large Language Models, unsupervised clustering, reasoning-based refinement, semantic validation, text analysis, cluster coherence, human-aligned labeling, social media corpora

177. ❌ TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

作者: Figen Eğin, Aytuğ Onan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究教育视频摘要的自动生成框架和土耳其语数据集创建，仅与’Large Language Models’关键词相关（摘要中提到LLM用于生成摘要作为比较基准），与其他技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于共识的自动框架AutoMUP来生成土耳其教育视频的黄金标准摘要，并创建了TR-EduVSum数据集，实验表明AutoMUP生成的摘要与强大LLM生成的摘要有高语义重叠。

摘要翻译

本研究提出了一种基于土耳其语教育视频的多人摘要自动生成可复现黄金标准摘要的框架。研究创建了名为TR-EduVSum的新数据集，涵盖"数据结构与算法"领域的82个土耳其语课程视频，包含总计3281份独立人工摘要。受现有基于金字塔的评估方法启发，本研究提出了AutoMUP（自动意义单元金字塔）方法，可从多份人工摘要中提取基于共识的内容。AutoMUP通过嵌入向量对人工摘要中提取的意义单元进行聚类，统计建模参与者间一致性，并依据共识权重生成分级摘要。在此框架中，黄金摘要对应最高共识度的AutoMUP配置，该配置由人工摘要中最受支持的意义单元构建而成。实验结果表明，AutoMUP摘要与Flash 2.5、GPT-5.1等鲁棒性大型语言模型（LLM）生成的摘要具有高度语义重叠。此外，消融研究清晰证明了共识权重与聚类对摘要质量的决定性作用。所提出的方法能够以较低成本推广至其他突厥语系语言。

摘要 (Abstract)

This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of “Data Structures and Algorithms” and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.

关键词: Educational Video Summarization, Turkish Dataset, Consensus Framework, AutoMUP, Meaning Unit Pyramid, LLM Summaries, Data Structures and Algorithms, Turkic Languages

178. ❌ EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

作者: Xueren Ge, Sahil Murtaza, Anthony Cortez, Homa Alemzadeh 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用多LLM代理生成合成医疗对话数据集，高度相关关键词包括：LLMs（使用多LLM代理生成对话）、Self-Reflection（生成流程包含自我精炼）、LLM Agents/Multi-agent Systems（多代理生成管道）、AI for Science（医疗领域应用）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多LLM代理的生成管道，从电子病历创建了高质量的合成多说话者EMS对话数据集EMSDialog，用于改善EMS对话诊断预测的准确性、及时性和稳定性。

摘要翻译

会话式诊断预测要求模型能够追踪临床对话流中不断演变的证据，并决定何时做出诊断。现有医疗对话语料库多为二元对话，或缺乏适用于此场景的多方工作流程及标注。我们提出一种基于电子患者护理记录（ePCR）与主题流的智能体生成流程，通过迭代式规划、生成及基于规则的事实与主题流自修正，构建多智能体对话。该流程生成了EMSDialog数据集，包含4,414段基于真实世界ePCR数据的合成多说话者急诊医疗服务（EMS）对话，标注涵盖43种诊断、说话者角色及话轮级主题。通过人工与大语言模型（LLM）评估，使用语句级与会话级指标验证了EMSDialog的高质量与真实性。实验结果表明，使用EMSDialog增强训练能提升EMS会话诊断预测的准确性、时效性与稳定性。

摘要 (Abstract)

Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.

关键词: multi-LLM agents, synthetic dialogue generation, emergency medical service, electronic patient care reports, conversational diagnosis prediction, self-refinement, multi-speaker conversations, dataset augmentation

179. ❌ Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

作者: Mengdan Zhu, Senhao Cheng, Liang Zhao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究视觉语言模型（VLMs）的复杂视觉推理问题，提出了DLR框架来改进多步推理能力。与关键词的相关性分析：1）论文涉及视觉语言模型，属于大模型范畴，但与纯文本LLMs有区别，给8分；2）核心创新是改进多步推理机制，与Chain of Thought高度相关，给10分；3）强调深入推理和可解释性，与System 2 Thinking和Explainable AI相关，各给8分；4）其他关键词如MoE、量化、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在复杂视觉推理中因文本思维链导致视觉信息丢失的问题，提出了DLR框架，通过分解查询、提取视觉潜在表示和基于理由的推理，在视觉基准测试中超越了现有方法并提供了更好的逐步可解释性。

摘要翻译

视觉语言模型常因文本思维链中的视觉信息损失而在复杂视觉推理任务中表现不佳。现有方法要么增加工具调用成本，要么依赖基于局部图像块的嵌入表示，这些方法在多步推理中难以充分提取语义信息。我们提出“分解、观察与推理”框架，这是一种强化的潜在推理框架，能够动态将查询分解为文本前提，提取基于前提的连续视觉潜在表征，并通过基于依据的理性推导得出答案。我们引入了一个三阶段训练流程，并提出一种新颖的球面高斯潜在策略，以实现在潜在空间中的有效探索。在多个以视觉为中心的基准测试上的广泛实验表明，DLR 始终优于包括纯文本、交错多模态思维链以及潜在推理方法在内的强基线模型，同时提供更优的逐步可解释性。

摘要 (Abstract)

Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{“Decompose, Look, and Reason” (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.

关键词: Vision-Language Models, Visual Reasoning, Multi-step Reasoning, Chain of Thought, Latent Reasoning, Interpretability, Reinforcement Learning, Spherical Gaussian Latent Policy

180. ❌ SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation

作者: Grace Jiarui Fan, Chengpiao Huang, Tianyi Peng, Kaizheng Wang, Yuhang Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在数字孪生模拟中的校准问题，属于LLM应用领域。与"Large Language Models"高度相关（10分），因为论文明确使用LLMs作为基础模拟器。与"Post-training"高度相关（10分），因为SYN-DIGITS是一个后处理校准框架，在LLM输出后进行操作。与"Alignment"有一定关联（5分），涉及将LLM预测与人类真实行为对齐。与"Hallucination Mitigation"有一定关联（5分），通过校准减少系统偏差和错误。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的数字孪生模拟中存在的系统偏差和校准不足问题，提出了一个轻量级的后处理校准框架SYN-DIGITS，通过潜在空间对齐将LLM预测与人类真实行为校准，实验表明该方法在个体相关性和分布一致性方面相比未校准基线有显著提升。

摘要翻译

基于人工智能的角色模拟——常被称为数字孪生模拟——正日益广泛地应用于市场研究、推荐系统和社会科学领域。尽管大型语言模型（LLMs）具有灵活性，但其相对于真实人类行为常表现出系统性偏差和校准失准，限制了其可靠性。受因果推断中合成控制方法的启发，我们提出了SYN-DIGITS（面向校准数字孪生模拟的合成控制框架），这是一个原则性且轻量级的校准框架，它从数字孪生的响应中学习潜在结构，并将其迁移以对齐预测与人类真实数据。SYN-DIGITS作为任何基于LLM的模拟器之上的后处理层运行，因此是模型无关的。我们开发了一个潜在因子模型，通过潜在空间对齐条件形式化地阐述了校准何时及为何成功，并系统评估了十种校准方法在十三种角色构建、三种LLMs和两个数据集上的表现。SYN-DIGITS支持针对先前未见问题和未观测人群的个体层面与分布层面模拟，且具有可证明的误差保证。实验表明，与未校准的基线相比，SYN-DIGITS在个体层面相关性上实现了高达50%的相对提升，并在分布差异上实现了50%至90%的相对降低。

摘要 (Abstract)

AI-based persona simulation – often referred to as digital twin simulation – is increasingly used for market research, recommender systems, and social sciences. Despite their flexibility, large language models (LLMs) often exhibit systematic bias and miscalibration relative to real human behavior, limiting their reliability. Inspired by synthetic control methods from causal inference, we propose SYN-DIGITS (SYNthetic Control Framework for Calibrated DIGItal Twin Simulation), a principled and lightweight calibration framework that learns latent structure from digital-twin responses and transfers it to align predictions with human ground truth. SYN-DIGITS operates as a post-processing layer on top of any LLM-based simulator and thus is model-agnostic. We develop a latent factor model that formalizes when and why calibration succeeds through latent space alignment conditions, and we systematically evaluate ten calibration methods across thirteen persona constructions, three LLMs, and two datasets. SYN-DIGITS supports both individual-level and distributional simulation for previously unseen questions and unobserved populations, with provable error guarantees. Experiments show that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50–90% relative reductions in distributional discrepancy compared to uncalibrated baselines.

关键词: digital twin simulation, large language models, calibration framework, synthetic control, latent factor model, post-processing, bias mitigation, human behavior alignment

181. ❌ ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

作者: Kai Qin, Liangxin Liu, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Houde Liu, Daiting Shi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Generative Reward Models (GRMs)在RLHF流程中的应用，直接涉及LLMs、Alignment、RLHF和Self-Reflection等关键词，这些是论文的核心内容，因此给予10分。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、Context Window、KV Cache、CoT、System 2、MCTS、Agents、Tool Use、Multi-agent、Quantization、Speculative Decoding、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等，论文未涉及或仅间接相关，因此给予0分。

!!! tip deepseek-chat TL;DR

论文提出ReflectRM，一种通过自反思在统一生成框架中联合建模响应偏好和分析偏好的生成式奖励模型，以提升RLHF中LLMs的对齐质量，实验表明其在多个基准上显著提高性能并减轻位置偏差。

摘要翻译

奖励模型（Reward Models, RMs）是基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）流程中的关键组成部分，直接决定了大型语言模型（Large Language Models, LLMs）的对齐质量。近年来，生成式奖励模型（Generative Reward Models, GRMs）作为一种更优范式崭露头角，相比传统的标量奖励模型，提供了更高的可解释性和更强的泛化能力。然而，现有的生成式奖励模型方法主要侧重于结果层面的监督，忽视了分析过程的质量，这限制了其潜力。为解决这一问题，我们提出了ReflectRM，一种新颖的生成式奖励模型，它利用自我反思来评估分析质量并增强偏好建模。ReflectRM在一个统一的生成框架下进行训练，以联合建模回答偏好与分析偏好。在推理过程中，我们利用其自我反思能力来识别最可靠的分析，并从中推导出最终的偏好预测。在四个基准测试上的实验表明，ReflectRM持续提升了性能，在Qwen3-4B模型上平均准确率提高了+3.7。进一步的实验证实，回答偏好与分析偏好是相互促进的。值得注意的是，ReflectRM显著缓解了位置偏差，与领先的生成式奖励模型相比取得了+10.2的改进，从而确立了其作为更稳定评估器的地位。

摘要 (Abstract)

Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.

关键词: Generative Reward Models, RLHF, Self-Reflection, Alignment, Large Language Models, Preference Modeling, Positional Bias, Qwen3-4B

182. ❌ GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics

作者: Jiaxin Wang, Dongxin Lyu, Zeyu Cai, Zhiyang Dou, Cheng Lin, Anpei Chen, Yuliang Xiu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机图形学中的4D形状动画重建与控制技术，提出了一种名为Skelebones的Scaffold-Skin Rigging系统，涉及高斯表示、骨架提取和运动匹配等方法。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究内容属于计算机图形学/计算机视觉领域，与这些关键词无直接关联。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术或AI for Science应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Skelebones的骨架-皮肤绑定系统，通过压缩可变形高斯模型为自由形态骨骼、提取平均曲率骨架并进行部分运动匹配，实现了对4D形状动态层次的高效重建和可控动画，在合成和真实数据集上显著提升了未见姿态的重新动画性能。

摘要翻译

自由形态骨骼虽能紧密贴合表面并有效捕捉非刚性形变，但缺乏直观控制所需的运动学结构。为此，我们提出名为“Skelebones”的支架-蒙皮绑定系统，包含三个核心步骤：（1）骨骼生成：将时间一致的可变形高斯模型压缩为自由形态骨骼，以逼近非刚性表面形变；（2）骨架提取：从规范高斯模型中提取平均曲率骨架并进行时序优化，确保获得与类别无关、运动自适应且拓扑正确的运动学结构；（3）绑定机制：通过非参数化部件运动匹配算法将骨架与骨骼绑定，通过对现有动作的匹配、检索与融合合成新颖骨骼运动。这三个步骤共同实现了将四维形状的动态层级压缩为兼具可控性与表现力的紧凑型骨骼系统。我们在合成与真实数据集上验证了该方法，在未见姿态的重动画性能上取得显著提升——相较于线性混合蒙皮算法提升17.3% PSNR，相较于Bag-of-Bones方法提升21.7%——同时保持优异的重建保真度，尤其对呈现复杂非刚性表面动力学的角色效果显著。我们的部件运动匹配算法对高斯模型与网格表征均展现出强大泛化能力，在低数据量场景（约1000帧）下尤为突出，其均方根误差较鲁棒性线性混合蒙皮降低48.4%，并超越基于GRU与MLP的学习方法超过20%。相关代码将于cookmaker.cn/gaussianimate公开以供研究使用。

摘要 (Abstract)

Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed “Skelebones”, with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by >20%. Code will be made publicly available for research purposes at cookmaker.cn/gaussianimate.

关键词: Gaussianimate, Skelebones, Scaffold-Skin Rigging, 4D shape animation, Free-form bones, Mean Curvature Skeleton, Partwise Motion Matching, Non-rigid deformation

183. ❌ When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

作者: Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到视频扩散模型中数字对齐问题，提出NUMINA框架改善计数准确性。与大多数关键词无关，仅与’Alignment’（指文本与视觉对齐）和’Hallucination Mitigation’（减少模型生成错误）有中等关联（5分），因为论文解决的是文本提示与生成视频之间的对齐问题，并减少计数错误（一种幻觉）。其他关键词主要涉及大语言模型技术、训练方法、推理优化等，而本文专注于扩散模型的视觉生成对齐问题，技术领域不同。

!!! tip deepseek-chat TL;DR

该论文针对文本到视频扩散模型生成对象数量不准确的问题，提出了NUMINA训练框架，通过识别和引导机制显著提高了计数准确性并保持了时间一致性。

摘要翻译

文本到视频扩散模型已实现开放式视频生成，但常难以准确生成提示词中指定数量的物体。我们提出NUMINA，一种无需训练的“识别-引导”框架以提升数值对齐能力。NUMINA通过筛选具有区分性的自注意力与交叉注意力头来识别提示-布局不一致性，从而推导出可计数的潜在布局。随后该框架对布局进行保守优化，并通过调制交叉注意力来引导视频重新生成。在新建的CountBench测试集上，NUMINA将Wan2.1-1.3B模型的计数准确率最高提升7.4%，在5B和14B模型上分别提升4.9%和5.5%。此外，该方法在保持时序一致性的同时提升了CLIP对齐分数。这些结果表明，结构引导能够与种子搜索和提示增强形成互补，为实现精确计数的文本到视频扩散提供了实用路径。代码发布于https://github.com/H-EmbodVis/NUMINA。

摘要 (Abstract)

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

关键词: text-to-video diffusion models, numerical alignment, counting accuracy, attention mechanisms, training-free framework, visual generation, prompt-layout consistency, cross-attention modulation

184. ❌ ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets

作者: Xiaoben Li, Jingyi Wu, Zeyu Cai, Yu Siyuan, Boqian Li, Yuliang Xiu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets》专注于计算机视觉和图形学领域，研究如何将参数化人体模型（如SMPL-X）拟合到穿着衣服的人体3D点云数据中，以提升动画和纹理等下游任务的性能。论文的核心贡献在于提出了一种新的拟合方法，通过“undress”和“dense fit”模块化阶段，利用可组合数据集（如CLOTH3D、AMASS、InterHand2.6M）来提高对服装、姿态和输入完整性的鲁棒性和表达性。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用（如生物信息学）相关，而本文未涉及这些主题，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出ETCH-X方法，通过模块化设计和可组合数据集，实现了对穿着衣服的人体3D点云的鲁棒且高表达性的身体拟合，显著提升了在可见和未见数据上的性能。

摘要翻译

人体拟合旨在将SMPL等参数化人体模型与穿着衣物的人体原始三维点云对齐，是动画与纹理映射等下游任务的关键预处理步骤。有效的拟合方法需同时具备局部表现力（捕捉手部、面部等精细细节）与全局鲁棒性，以应对服装动态、姿态变化以及噪声或部分输入等现实挑战。现有方法通常仅擅长单一维度，缺乏一体化解决方案。本研究将ETCH升级为ETCH-X，通过以下创新实现鲁棒且精细的全身拟合：采用紧密度感知拟合范式过滤服装动态影响（“去衣化”）；扩展SMPL-X模型以增强表现力；以隐式密集对应关系（“密集拟合”）替代对部分数据高度敏感的显式稀疏标记点，从而提升拟合的鲁棒性与细粒度精度。我们解耦的“去衣化”与“密集拟合”模块化阶段支持在可组合数据源上进行独立可扩展的训练，包括多样化模拟服装数据集（CLOTH3D）、大规模全身运动数据（AMASS）以及细粒度手势数据（InterHand2.6M），显著提升了模型对服装泛化能力及身体与手部姿态的鲁棒性。该方法在多样化服装、姿态及输入完整度条件下均实现了鲁棒且高表现力的拟合效果，相较于ETCH在以下两方面取得显著性能提升：1）已见数据，如4D-Dress（全身关节点平均误差MPJPE-All降低33.0%）与CAPE（手部顶点到顶点误差V2V-Hands降低35.8%）；2）未见数据，如BEDLAM2.0（MPJPE-All降低80.8%；全身顶点到顶点误差V2V-All降低80.5%）。代码与模型将在https://xiaobenli00.github.io/ETCH-X/ 发布。

摘要 (Abstract)

Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details such as hands and facial features-and globally robust to handle real-world challenges, including clothing dynamics, pose variations, and noisy or partial inputs. Existing approaches typically excel in only one aspect, lacking an all-in-one solution.We upgrade ETCH to ETCH-X, which leverages a tightness-aware fitting paradigm to filter out clothing dynamics (“undress”), extends expressiveness with SMPL-X, and replaces explicit sparse markers (which are highly sensitive to partial data) with implicit dense correspondences (“dense fit”) for more robust and fine-grained body fitting. Our disentangled “undress” and “dense fit” modular stages enable separate and scalable training on composable data sources, including diverse simulated garments (CLOTH3D), large-scale full-body motions (AMASS), and fine-grained hand gestures (InterHand2.6M), improving outfit generalization and pose robustness of both bodies and hands. Our approach achieves robust and expressive fitting across diverse clothing, poses, and levels of input completeness, delivering a substantial performance improvement over ETCH on both: 1) seen data, such as 4D-Dress (MPJPE-All, 33.0% ) and CAPE (V2V-Hands, 35.8% ), and 2) unseen data, such as BEDLAM2.0 (MPJPE-All, 80.8% ; V2V-All, 80.5% ). Code and models will be released at https://xiaobenli00.github.io/ETCH-X/.

关键词: human body fitting, SMPL-X, clothed humans, robust fitting, expressive fitting, composable datasets, 3D point clouds, modular stages

185. ❌ E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

作者: Mayur Deshmukh, Hiroyasu Akada, Helge Rhodin, Christian Theobalt, Vladislav Golyanik 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于事件相机驱动的3D人体姿态估计，属于计算机视觉和传感器融合领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何语言模型、模型训练/微调技术、推理优化、对齐方法、代理系统或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于事件相机的实时3D人体姿态估计方法E-3DPSM，通过连续姿态状态机将事件流动态与人体运动对齐，在两项基准测试中准确率提升达19%，运行频率达80Hz。

摘要翻译

事件相机在基于头戴设备的单眼第一人称三维人体姿态估计中具有多重优势，例如毫秒级时间分辨率、高动态范围以及可忽略的运动模糊。现有方法虽有效利用了这些特性，但其三维估计精度较低，难以满足许多应用场景（如沉浸式VR/AR）的需求。这源于现有设计未能完全适配事件流特性（例如其异步性与连续性），导致其对自遮挡和估计结果中的时间抖动高度敏感。本文重新审视了这一设定，提出E-3DPSM——一种用于基于事件的第一人称三维人体姿态估计的事件驱动连续姿态状态机。E-3DPSM将连续人体运动与细粒度事件动态对齐；它通过演化潜在状态并预测与观测事件相关联的三维关节位置的连续变化，再将其与直接三维人体姿态预测相融合，从而实现稳定且无漂移的最终三维姿态重建。E-3DPSM在单台工作站上能以80赫兹频率实时运行，并在两个基准测试中取得了新的最优性能，将精度（MPJPE）提升高达19%，并将时间稳定性提高至2.7倍。源代码与训练模型详见项目页面。

摘要 (Abstract)

Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.

关键词: event cameras, 3D human pose estimation, egocentric vision, state machine, real-time processing, monocular estimation, temporal stability, pose reconstruction

186. ❌ ParseBench: A Document Parsing Benchmark for AI Agents

作者: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Simon Suo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI代理在文档解析领域的基准测试，核心是评估AI代理在语义正确性方面的能力。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确研究AI代理的文档解析需求。与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为文档解析可视为代理的工具使用场景。与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分），因为论文提到使用LlamaParse（基于LLM）作为评估方法之一。其他关键词（如MoE、Scaling Laws、RLHF等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了ParseBench基准，用于评估AI代理在文档解析中的语义正确性，发现当前方法在表格、图表、内容忠实度等五个维度上能力分散，没有系统在所有方面都表现优异。

摘要翻译

AI智能体正在改变文档解析的需求标准。其核心在于语义正确性：解析输出必须保留自主决策所需的结构与含义，包括正确的表格结构、精确的图表数据、具有语义意义的格式以及视觉定位。现有基准测试未能充分涵盖企业自动化场景下的这一要求，它们依赖狭窄的文档分布和基于文本相似度的评估指标，这些指标无法捕捉对智能体至关重要的错误类型。我们推出了 ParseBench 基准，该数据集包含约2,000页来自保险、金融和政府领域企业文档的人工验证页面，围绕五大能力维度构建：表格、图表、内容忠实度、语义格式化和视觉定位。通过对涵盖视觉语言模型、专用文档解析器和LlamaParse在内的14种方法进行评估，基准测试揭示了当前能力分布的碎片化现状：没有任何方法能在所有五个维度上持续表现优异。LlamaParse Agentic以\agenticoverall%的综合得分位居首位，同时该基准也凸显了现有系统在各维度上仍存在的能力差距。数据集与评估代码已发布于HuggingFace和GitHub平台。

摘要 (Abstract)

AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

关键词: AI agents, document parsing, semantic correctness, benchmark, enterprise automation, vision-language models, LlamaParse, capability gaps

187. ❌ Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

作者: Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, Xiaowei Zhou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08542v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction》专注于计算机视觉中的3D场景重建任务，提出了一种用于高效压缩和保留长距离场景信息的神经全局上下文表示方法，并通过测试时自监督目标快速适应轻量级神经子网络。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理（如MoE、缩放定律、训练对齐方法、推理优化、智能体等）或特定科学AI应用（如生物信息学）直接相关。该论文的研究内容（3D重建、神经表示、测试时训练）与这些关键词的主题领域（自然语言处理、大模型技术、特定科学领域AI）完全不同，没有涉及任何语言模型、模型训练技术、对齐方法、推理加速或指定的科学AI应用，因此所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对从长视频序列进行大规模3D场景重建的任务，提出了一种可扩展的测试时训练方法，通过神经全局上下文表示和自监督适应，在多个大规模基准测试中实现了领先的位姿精度和最先进的3D重建精度，同时保持了效率。

摘要翻译

本文针对从长视频序列进行大规模三维场景重建的任务展开研究。近期，前馈式重建模型通过直接从RGB图像回归三维几何结构，无需显式的三维先验或几何约束，已展现出有前景的结果。然而，由于内存容量有限且无法有效捕捉全局上下文信息，这些方法在长序列上往往难以保持重建的准确性与一致性。相比之下，人类能够自然地利用对场景的全局理解来指导局部感知。受此启发，我们提出了一种新颖的神经全局上下文表示方法，它能高效压缩并保留长距离场景信息，使模型能够利用广泛的上下文线索来提升重建的准确性和一致性。该上下文表示通过一组轻量级神经子网络实现，这些子网络在测试时通过自监督目标进行快速自适应，从而在不显著增加计算开销的情况下大幅提升了内存容量。在多个大规模基准数据集（包括KITTI Odometry~\cite{Geiger2012CVPR}和Oxford Spires~\cite{tao2025spires}数据集）上的实验表明，我们的方法在处理超大规模场景方面具有显著效果，在保持高效性的同时，实现了领先的位姿估计精度和最先进的三维重建精度。代码发布于https://zju3dv.github.io/scal3r。

摘要 (Abstract)

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

关键词: 3D reconstruction, large-scale scenes, test-time training, neural global context, self-supervised adaptation, video sequences, pose accuracy, memory efficiency

188. ❌ Fail2Drive: Benchmarking Closed-Loop Driving Generalization

作者: Simon Gerstenecker, Andreas Geiger, Katrin Renz 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶领域的封闭环驾驶泛化基准测试，涉及CARLA模拟器、路线配对、场景类别和专家策略验证。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是自动驾驶系统评估，未涉及任何大模型技术、深度学习创新或AI在生物/化学信息学中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了Fail2Drive基准测试，用于评估自动驾驶系统在分布偏移下的封闭环驾驶泛化能力，发现现有模型平均成功率下降22.8%，并揭示了意外的失败模式。

摘要翻译

分布偏移下的泛化能力仍是闭环自动驾驶的核心瓶颈。尽管CARLA等模拟器能够实现安全且可扩展的测试，但现有基准测试很少衡量真正的泛化能力：它们通常在测试时重复使用训练场景。因此，成功可能反映的是记忆而非稳健的驾驶行为。我们推出Fail2Drive——首个针对CARLA中闭环泛化能力的配对路线基准测试，包含200条路线和17个涵盖外观、布局、行为及鲁棒性偏移的新场景类别。每条偏移路线均与一条分布内（in-distribution）对应路线匹配，从而隔离偏移效应，并将定性故障转化为定量诊断。通过对多个前沿模型进行评估，我们发现其性能普遍下降，平均成功率降低22.8%。分析揭示了意料之外的故障模式，例如忽略激光雷达（LiDAR）中清晰可见的物体，以及未能掌握自由空间与占用空间的基本概念。为加速后续研究，Fail2Drive提供了开源工具箱，可用于创建新场景并通过特权专家策略验证可解性。这些组件共同为基准测试和提升闭环驾驶泛化能力建立了可复现的基础。所有代码、数据与工具均已开源：https://github.com/autonomousvision/fail2drive。

摘要 (Abstract)

Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .

关键词: closed-loop autonomous driving, generalization benchmark, distribution shift, CARLA simulator, paired-route evaluation, failure mode analysis, expert policy validation, driving robustness

189. ❌ Self-Improving 4D Perception via Self-Distillation

作者: Nan Huang, Pengcheng Yu, Weijia Zeng, James M. Rehg, Angjoo Kanazawa, Haiwen Feng, Qianqian Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出SelfEvo框架，通过自蒸馏实现4D感知模型的自我改进，与’Self-Correction OR Self-Improvement OR Self-Reflection’高度相关（10分）。论文涉及多视图重建模型，属于计算机视觉在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文提到使用预训练模型，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词主要涉及大语言模型、推理、对齐、优化等，与本文的计算机视觉和4D感知主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出SelfEvo框架，通过自蒸馏和时空上下文不对称性，使预训练的多视图重建模型能够利用无标签视频进行自我改进，在动态场景的深度估计和相机估计任务上取得了显著性能提升。

摘要翻译

大规模多视角重建模型已取得显著进展，但现有方法大多仍依赖带有真实3D/4D标注的全监督训练。此类标注成本高昂，且在动态场景中尤为稀缺，限制了方法的可扩展性。我们提出SelfEvo——一种自改进框架，能够利用未标注视频持续优化预训练的多视角重建模型。SelfEvo引入了一种基于时空上下文不对称性的自蒸馏机制，使得基于学习的4D感知无需外部标注即可实现自我提升。我们系统研究了实现有效自改进的关键设计要素，包括损失信号、不对称性形式及其他训练策略。在涵盖多样化数据集与领域的八个基准测试中，SelfEvo持续提升了预训练基线的性能，并能够泛化至不同基础模型（如VGGT和$π^3$），尤其在动态场景中取得显著增益。总体而言，在不使用任何标注数据的情况下，SelfEvo在视频深度估计任务中实现了最高36.5%的相对性能提升，在相机参数估计任务中提升了20.1%。项目页面：https://self-evo.github.io/。

摘要 (Abstract)

Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

关键词: Self-improving, Self-distillation, 4D perception, Multi-view reconstruction, Unlabeled videos, Spatiotemporal context asymmetry, Video depth estimation, Camera estimation

190. ❌ FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

作者: Johanna Karras, Yuanhao Wang, Yingwei Li, Ira Kemelmacher-Shlizerman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08526v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On》专注于计算机视觉领域的虚拟试衣任务，特别是解决服装合身性（fit）的合成问题。研究核心是创建大规模合成数据集（FIT）和训练基线模型，涉及3D服装生成、物理模拟、重纹理和身份保持等技术。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文属于传统的计算机视觉/图形学应用，未涉及大模型、MoE、缩放定律、训练对齐、推理优化、智能体、模型压缩等主题，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了虚拟试衣中忽略服装合身性的问题，通过创建包含精确尺寸信息的大规模合成数据集FIT并训练基线模型，实现了合身感知的虚拟试衣，为该领域设立了新基准。

摘要翻译

给定一张人物图像和一件服装图像，虚拟试穿（VTO）的目标是合成该人物穿着该服装的真实感图像，同时保持其原始姿态和身份特征。尽管近期的VTO方法在呈现服装外观方面表现出色，但它们大多忽略了一个试穿体验中的关键方面：服装合身度的准确性——例如，描绘一件超大号衬衫穿在超小体型人物身上的效果。一个主要障碍在于缺乏提供精确服装与人体尺寸信息的数据集，特别是针对服装明显过大或过小的“不合身”情况。因此，当前的VTO方法默认生成合身的结果，而忽略了服装或人物尺寸的实际差异。
本文中，我们为解决这一开放性问题迈出了第一步。我们提出了FIT（Fit-Inclusive Try-on，包容性合身试穿）数据集，这是一个包含超过113万组试穿图像三元组的大规模VTO数据集，并附有精确的人体和服装尺寸信息。我们通过一种可扩展的合成策略克服了数据收集的挑战：（1）我们使用GarmentCode程序化生成3D服装，并通过物理模拟进行悬垂处理，以捕捉真实的服装合身效果。（2）我们采用一种新颖的重纹理框架，将合成渲染图转化为逼真的图像，同时严格保持几何形状。（3）我们在重纹理模型中引入了人物身份保持机制，以生成用于监督训练的配对人物图像（同一人物，不同服装）。最后，我们利用FIT数据集训练了一个具有合身感知能力的基线虚拟试穿模型。我们的数据和成果为合身感知虚拟试穿设定了新的技术标杆，并为未来研究提供了一个坚实的基准。我们将在项目页面公开所有数据和代码：https://johannakarras.github.io/FIT。

摘要 (Abstract)

Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit – for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for “ill-fit” cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.

关键词: virtual try-on, garment fit, synthetic dataset, 3D garment generation, physics simulation, re-texturing, identity preservation, fit-aware model

191. ❌ UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

作者: Joungbin An, Agrim Jain, Kristen Grauman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出UniversalVTG，一个用于视频时序定位的通用轻量级基础模型。核心相关关键词：1. ‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文虽未直接使用LLM，但属于基础模型范畴，且对比了MLLM方法，属于大模型应用研究。2. ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）：论文核心方法包括大规模跨数据集预训练，直接相关。其他关键词如MoE、SLMs、SFT、RLHF等与论文内容无关，论文专注于视频时序定位任务，未涉及这些具体技术。

!!! tip deepseek-chat TL;DR

该论文解决了视频时序定位中模型跨域泛化能力差和计算成本高的问题，通过提出UniversalVTG模型，采用大规模跨数据集预训练和查询统一方法，在保持轻量级的同时实现了与大型多模态语言模型相当或更优的性能。

摘要翻译

视频时序定位通常采用针对特定数据集训练的模型，这些模型在跨领域和查询风格时迁移效果较差。近期为克服这一局限性，研究者尝试将大型多模态语言模型适配于视频时序定位任务，但其高昂的计算成本与有限的视频上下文理解能力仍阻碍着长视频定位的发展。本研究转而通过扩展统一监督信号的方式保持模型轻量化。我们提出了UniversalVTG——一个通过大规模跨数据集预训练的统一视频时序定位模型。离线查询统一器将异构查询格式规范化为共享的声明性空间，既减少了语言不匹配现象，也避免了简单联合训练中出现的负迁移效应。结合高效的定位头模块，UniversalVTG能够扩展到长时未修剪视频的处理。在多样化基准测试集（GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA和ActivityNet-Captions）上，单个UniversalVTG检查点相较于专用视频时序定位模型取得了最先进的性能表现。此外，尽管模型规模比近期基于大型多模态语言模型的方法小$>100$倍，UniversalVTG在多个基准测试中达到或超越了这些方法的准确率，为参数量庞大的大型多模态语言模型提供了实用的替代方案。

摘要 (Abstract)

Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

关键词: Video Temporal Grounding, Foundation Model, Cross-dataset Pretraining, Query Unifier, Lightweight Model, Multimodal Language Models, Long-video Grounding, State-of-the-art Performance

192. ❌ MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

作者: Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08516v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究开放视觉网页代理（MolmoWeb），属于大模型在特定领域（网页交互）的应用创新。高度相关的关键词包括：LLM Agents（核心主题，10分）、Tool Use（网页操作工具使用，8分）、Instruction Tuning（任务指令条件化，8分）、Post-training/SFT（使用演示数据训练，8分）、Large Language Models（基础模型，8分）。中等相关的关键词：Small Language Models（4B/8B规模，5分）、Scaling Laws（测试时扩展，5分）、Pre-training（可能涉及，5分）、In-context Learning（隐含，5分）。其余关键词如MoE、RLHF、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有网页代理依赖闭源模型的问题，提出了开放视觉网页代理MolmoWeb及其训练数据集MolmoWebMix，在多个网页任务基准上实现了最先进的性能。

摘要翻译

网络智能体——代表用户在网络上导航并执行任务的自主系统——有潜力彻底改变人类与数字世界的交互方式。然而，当前最强大的网络智能体依赖于训练数据和配方未公开的专有模型，这限制了科学理解、可复现性以及社区驱动的进展。我们坚信，为开放网络构建的智能体本身也应是开放的。为此，我们推出：（1）MolmoWebMix，一个大规模、多样化的浏览器任务演示与网络图形用户界面感知数据混合集；（2）MolmoWeb，一个完全开放的多模态网络智能体系列。具体而言，MolmoWebMix 整合了来自多个互补生成管道的超过10万条合成任务轨迹、3万条以上的人工演示、原子化网络技能轨迹以及图形用户界面感知数据（包括指代表达式定位和屏幕截图问答）。MolmoWeb 智能体作为指令条件化的视觉-语言动作策略运行：给定任务指令和网页截图，它们预测下一个浏览器动作，无需访问HTML、无障碍功能树或专用API。 MolmoWeb 智能体提供40亿和80亿参数两种规模。在 WebVoyager、Online-Mind2Web 和 DeepShop 等浏览器使用基准测试中，它们取得了最先进的结果，超越了同规模仅开放权重的模型，如 Fara-7B、UI-Tars-1.5-7B 和 Holo1-7B。MolmoWeb-8B 也超越了基于 GPT-4o 等更大规模闭源前沿模型构建的标记集（Set-of-Marks, SoM）智能体。我们进一步通过测试时扩展（即采用并行推演结合N选优策略）展示了性能的持续提升，在 WebVoyager 和 Online-Mind2Web 上分别实现了 94.7% 和 60.5% 的 pass@4 成功率（对比其 pass@1 成功率的 78.2% 和 35.3%）。我们将发布模型检查点、训练数据、代码以及统一的评估工具，以保障可复现性并加速网络智能体的开放研究。

摘要 (Abstract)

Web agents–autonomous systems that navigate and execute tasks on the web on behalf of users–have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

关键词: web agents, visual-language action policies, multimodal, open-source, browser task demonstrations, state-of-the-art, instruction-conditioned, test-time scaling

193. ❌ When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations

作者: Kabilan Elangovan, Daniel Ting 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究医学图像分类中迁移学习与微调对模型解释稳定性的影响，核心涉及监督微调（SFT）、可解释AI和AI在科学（医学）领域的应用。论文明确研究“fine-tuning”对解释证据的影响，与“Post-training OR Supervised Fine-tuning OR SFT”高度相关（10分）。研究关注归因图（attribution maps）的稳定性，属于模型可解释性范畴，与“Mechanistic Interpretability OR Explainable AI”高度相关（10分）。应用场景为胸部X光分类，属于医学AI，与“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（10分）。其他关键词（如LLMs、MoE、RAG等）涉及大模型技术或非相关领域，论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了在医学图像分类中，迁移学习后的微调会导致模型解释证据发生语义漂移，尽管分类性能稳定，但归因结构会随模型架构和优化阶段发生重组。

摘要翻译

迁移学习结合微调因其在诊断性能上的持续提升而被广泛应用于医学图像分类。然而，在视觉特征重叠的多类别场景中，准确率的提高并不能保证模型用于支持预测的视觉证据的稳定性。我们将语义漂移定义为：在迁移学习与完整微调之间，支撑模型预测的归因结构发生的系统性变化，这反映了尽管分类性能稳定，但潜在的视觉推理可能已发生偏移。通过一项五分类胸部X光任务，我们在两阶段训练协议下评估了DenseNet201、ResNet50V2和InceptionV3模型，并采用无参考指标量化了漂移程度，这些指标捕捉了归因图的空间定位与结构一致性。在所有架构中，粗略的解剖定位保持稳定，而重叠交并比（IoU）则揭示了证据结构存在明显的、依赖于架构的重组。超越单一方法分析，在预测性能收敛的情况下，LayerCAM与GradCAM++的稳定性排序可能出现逆转，这表明解释稳定性是架构、优化阶段与归因目标三者相互作用的结果。

摘要 (Abstract)

Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model’s predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.

关键词: fine-tuning, semantic drift, medical image classification, attribution maps, explanation stability, chest X-ray, transfer learning, model interpretability

194. ❌ Visually-grounded Humanoid Agents

作者: Hang Ye, Xiaoxuan Ma, Fan Lu, Wayne Wu, Kwan-Yee Lin, Yizhou Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉基础的人形智能体，属于具身AI和自主智能体领域，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文核心是创建自主人形智能体；与’World Models AND General World Models’高度相关（10分），因为论文明确提出了’World Layer’来重建3D场景作为世界模型；与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（各5分），因为论文提到智能体进行迭代推理和空间感知规划；其他关键词主要涉及大语言模型技术、训练方法、优化技术等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何让数字人仅通过视觉观察在3D新场景中自主行为，提出了一个两层（世界-智能体）范式，实现了在重建环境中具有空间感知和迭代推理能力的自主人形智能体，实验表明其任务成功率更高且碰撞更少。

摘要翻译

数字人生成技术已历经数十年研究，并支撑着广泛的现实应用。然而，现有系统大多依赖特权状态或脚本控制进行被动驱动，这限制了其在新环境中的可扩展性。我们转而提出：数字人如何仅凭视觉观察与指定目标，在新场景中主动产生行为？实现这一目标将能大规模地在任意三维环境中部署数字人，使其表现出自发、自然且目标导向的行为。为此，我们提出视觉基元人形智能体——一种耦合的双层（世界-智能体）范式，从多层次复现人类特质：它们能够像真实人类一样观察、感知、推理，并在真实三维场景中行动。世界层通过遮挡感知的流程从真实世界视频重建语义丰富的三维高斯场景，并适配可驱动的基于高斯模型的人体化身。智能体层将这些化身转化为自主人形智能体，为其配备第一人称RGB-D感知能力，使其能够基于空间认知与迭代推理执行精确的具身规划，进而通过底层的全身动作驱动其在场景中的行为。我们进一步构建了一个评估框架，用于在多样化的重建环境中评估人形智能体与场景的交互。实验表明，我们的智能体实现了稳健的自主行为，在任务成功率和碰撞规避方面均优于消融实验及现有先进规划方法。这项工作为实现主动式数字人群体部署提供了可能，并推动了以人为中心的具身人工智能发展。相关数据、代码与模型将开源发布。

摘要 (Abstract)

Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.

关键词: Visually-grounded Humanoid Agents, autonomous agents, 3D Gaussian scenes, embodied planning, spatial awareness, iterative reasoning, human-scene interaction, embodied AI

195. ❌ Novel View Synthesis as Video Completion

作者: Qi Wu, Khiem Vuong, Minsik Jeon, Srinivasa Narasimhan, Deva Ramanan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究稀疏视角下的新视角合成（NVS），使用视频扩散模型将多视角图像输入视为低帧率视频完成问题。论文核心是计算机视觉中的3D重建和生成模型应用，不涉及大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体领域。所有评分关键词均与大语言模型技术、训练方法、推理优化、AI代理或科学AI应用相关，而本文专注于视觉任务的视频模型适应，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出将稀疏新视角合成问题重新定义为视频完成任务，通过修改视频扩散模型架构实现输入顺序不变性，在稀疏视角合成基准上取得了有竞争力的性能。

摘要翻译

我们致力于利用视频扩散模型解决稀疏新视角合成问题：给定场景的$K$（约5张）多视角图像及其相机位姿，我们预测目标相机位姿下的视图。先前许多方法借助通过扩散模型编码的生成式图像先验，但基于单张图像训练的模型缺乏多视角知识。我们认为，视频模型已隐含多视角知识，因此应更容易适应新视角合成任务。我们的核心思路是将稀疏新视角合成定义为低帧率视频补全任务。然而，一个挑战在于稀疏新视角合成的输入是无序集合，且通常过于稀疏而难以定义有效顺序，因此模型应对输入集合的排列具有$\textit{不变性}$。为此，我们提出FrameCrafter模型，它通过多项架构改进——包括逐帧潜在编码和去除时序位置嵌入——将自然训练于有序帧序列的视频模型适配为排列不变的新视角合成系统。实验结果表明，视频模型能够通过极少量监督轻松“遗忘”时序信息，在稀疏视角新视角合成基准测试中取得具有竞争力的性能。项目页面：https://frame-crafter.github.io/

摘要 (Abstract)

We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to “forget” about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/

关键词: novel view synthesis, video diffusion models, sparse multi-view images, video completion, permutation invariance, FrameCrafter, 3D reconstruction, generative models

196. ❌ Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

作者: Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成模型（Phantom）的物理一致性改进，通过联合建模视觉内容和潜在物理动力学来生成物理上合理的视频。虽然论文涉及生成模型和AI应用，但所有关键词均明确针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等），而本文研究的是视频生成模型，未提及任何语言模型或LLM技术。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过联合建模视觉内容和潜在物理动力学来改进视频生成模型的物理一致性，提出的Phantom模型在物理动态遵循和感知保真度方面优于现有方法。

摘要翻译

近期，基于大规模数据集与强大架构的生成式视频建模技术取得了显著进展，实现了令人瞩目的视觉真实感。然而，越来越多的证据表明，单纯扩大数据与模型规模并不能使这些系统理解支配现实世界动态的底层物理规律。现有方法往往未能捕捉或强化这种物理一致性，导致生成的运动与动态效果不真实。在本研究中，我们探讨将潜在物理属性的推断直接整合到视频生成过程中，是否能使模型具备生成物理合理视频的能力。为此，我们提出Phantom——一种物理融合视频生成模型，它同时对视觉内容与潜在物理动态进行建模。在观测视频帧与推断物理状态的条件下，Phantom联合预测潜在物理动态并生成未来视频帧。该模型利用一种物理感知视频表征，作为底层物理的抽象而信息丰富的嵌入表示，从而无需显式指定复杂的物理动态与属性集合，即可促进物理动态与视频内容的联合预测。通过将物理感知视频表征的推断直接融入视频生成流程，Phantom生成的视频序列既具备视觉真实感，又保持物理一致性。在标准视频生成与物理感知基准测试上的定量与定性结果表明，Phantom不仅在遵循物理动态方面优于现有方法，同时提供了具有竞争力的感知保真度。

摘要 (Abstract)

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

关键词: Physics-Infused Video Generation, Latent Physical Dynamics, Video Generation Model, Physical Consistency, Physics-aware Video Representation, Generative Video Modeling, Visual Realism, Physical Dynamics Prediction

197. ❌ LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

作者: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出LAMP方法，将图像编辑作为3D先验来提取物体间的3D变换，用于开放世界机器人操作。摘要中提到LLMs和VLMs在3D感知方面有限，因此论文与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文讨论了LLMs的局限性并提出了替代方案。其他关键词主要涉及大模型的具体技术（如MoE、SFT、RLHF、RAG等）、推理方法（如CoT、System 2）、代理系统、优化技术（如量化、推理加速）或特定科学领域（如生物信息学），论文未涉及这些具体内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文解决了开放世界机器人操作中缺乏细粒度3D感知的问题，通过提出LAMP方法将图像编辑提升为3D先验来提取几何感知的3D变换表示，实现了零样本泛化能力。

摘要翻译

在开放世界中实现类人的泛化能力仍然是机器人操作领域的根本性挑战。现有的基于学习的方法，包括强化学习、模仿学习和视觉-语言-动作模型（Vision-Language-Action Models, VLAs），在处理新任务和未见环境时常常面临困难。另一个有前景的方向是探索可泛化的表征，以捕捉开放世界操作所需的细粒度空间与几何关系。尽管大语言模型（Large-Language-Models, LLMs）和视觉语言模型（Vision-Language-Models, VLMs）能够基于语言或标注的二维表征提供强大的语义推理能力，但其有限的三维感知能力制约了它们在细粒度操作任务中的应用。为解决这一问题，我们提出了LAMP，该方法将图像编辑提升为三维先验，以提取物体间的三维变换作为连续、几何感知的表征。我们的核心洞见在于，图像编辑过程本身编码了丰富的二维空间线索，将这些隐含线索提升至三维变换，能为开放世界操作提供细粒度且精确的指导。大量实验表明，\codename 能够生成精确的三维变换，并在开放世界操作中实现了强大的零样本泛化能力。项目页面：https://zju3dv.github.io/LAMP/。

摘要 (Abstract)

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.

关键词: robotic manipulation, 3D priors, image-editing, open-world, zero-shot generalization, geometry-aware representations, vision-language-action models, spatial relations

198. ❌ SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

作者: Wenli Zhang, Xianglong Shi, Sirui Zhao, Xinqi Chen, Guo Cheng, Yifan Xu, Tong Xu, Yong Liao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究音频驱动说话头生成中的多模态对抗攻击，属于计算机视觉、多媒体安全和对抗机器学习领域，与所有评分关键词（主要关于大模型技术原理、训练方法、推理优化、对齐、科学应用等）均无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对基于扩散的音频驱动说话头生成模型，提出了一种名为SyncBreaker的阶段感知多模态对抗攻击框架，通过联合扰动肖像和音频输入来有效降低唇部同步和面部动态质量，同时保持输入感知质量。

摘要翻译

基于扩散模型的音频驱动说话头生成技术能够实现逼真的肖像动画，同时也带来了滥用风险，例如欺诈和错误信息传播。现有的保护方法大多局限于单一模态，无论是仅针对图像还是仅针对音频的攻击都无法有效抑制语音驱动的面部动态。为弥补这一不足，我们提出了SyncBreaker——一种阶段感知的多模态保护框架，该框架在特定模态感知约束下联合扰动肖像与音频输入。我们的核心贡献包括两方面：首先，针对图像流，我们引入了结合多区间采样（Multi-Interval Sampling, MIS）的消解监督机制，通过聚合来自多个去噪区间的引导信号，将生成过程导向静态参考肖像；其次，针对音频流，我们提出了跨注意力干扰（Cross-Attention Fooling, CAF）方法，以抑制特定区间内音频条件化的跨注意力响应。两个流经独立优化，在推理阶段结合以实现灵活部署。我们在白盒主动保护场景下评估SyncBreaker。大量实验表明，相较于强大的单模态基线方法，SyncBreaker能更有效地降低唇部同步性与面部动态质量，同时保持输入感知质量，并在净化攻击下保持鲁棒性。代码：https://github.com/kitty384/SyncBreaker。

摘要 (Abstract)

Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

关键词: audio-driven talking head generation, multimodal adversarial attacks, diffusion models, lip synchronization, perceptual constraints, cross-attention fooling, stage-aware protection, proactive protection

199. ❌ BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

作者: Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia, Dongsheng Luo, Kailun Yang, Zhiyong Li, Yaonan Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文BLaDA提出一个将自然语言指令解析为结构化约束，并用于机器人灵巧操作的零样本框架。核心创新在于语言到动作的推理链和可解释的模块化架构。高度相关的关键词包括：‘Chain of Thought’（论文明确建立可解释推理链）、‘LLM Agents’（框架本质是语言驱动的自主代理）、‘Mechanistic Interpretability’（强调可解释性和模块化设计）、‘Tool Use’（将语言指令转化为控制约束）。‘Large Language Models’相关（使用语言模型解析指令），‘System 2 Thinking’相关（涉及深度空间推理）。‘AI for Science’有一定关联（属于AI在机器人学中的应用）。其他关键词（如MoE、量化、RAG等）与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究解决了在非结构化环境中如何将开放词汇的自然语言指令转化为可解释的感知和控制约束以实现功能性灵巧操作的问题，提出的BLaDA框架通过建立语言解析、空间定位和执行解码的推理链，显著提高了功能操作的准确性和成功率。

摘要翻译

在非结构化环境中，实现功能性灵巧抓取需要语义理解、精确的三维功能定位与物理可解释执行的紧密结合。模块化分层方法相比端到端视觉语言动作模型更具可控性与可解释性，但现有方法仍依赖预定义的功能可供性标签，且缺乏功能性灵巧操作所需的紧密语义-位姿耦合。为此，我们提出BLaDA（在3DGS场中桥接语言与灵巧动作），这是一个可解释的零样本框架，能够将开放词汇指令落地为功能性灵巧操作的感知与控制约束。BLaDA通过知识引导语言解析模块首先将自然语言解析为结构化的六元组操作约束，从而建立可解释的推理链。为实现位姿一致的空间推理，我们提出三角功能点定位模块，该模块利用三维高斯溅射作为连续场景表示，并在三角几何约束下识别功能区域。最后，三维关键点抓取矩阵变换执行模块将这些语义-几何约束解码为物理合理的腕部位姿与手指级指令。在复杂基准测试上的大量实验表明，BLaDA在不同类别与任务的功能可供性定位精度和功能性操作成功率方面均显著优于现有方法。代码将在https://github.com/PopeyePxx/BLaDA公开。

摘要 (Abstract)

In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic–pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.

关键词: functional dexterous manipulation, language grounding, interpretable reasoning chain, 3D Gaussian Splatting, zero-shot framework, open-vocabulary instructions, spatial reasoning, robot manipulation

200. ❌ SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

作者: Chensheng Dai, Shengjun Zhang, Min Chen, Yueqi Duan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3DGS）在稀疏视图表面重建中的应用，提出了一种名为SurfelSplat的前馈框架。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关。然而，该论文的研究内容属于计算机视觉和3D重建领域，具体涉及3D高斯表示、表面重建、稀疏视图处理等，与提供的大模型相关关键词（如LLMs、MoE、RLHF、RAG等）以及AI for Science（生物信息学、化学信息学）没有直接关联。论文未涉及任何语言模型、模型训练技术、推理方法、代理系统或特定科学领域的AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SurfelSplat的前馈框架，用于从稀疏视图图像中学习高效且可泛化的高斯面元表示，以解决现有3D高斯泼溅方法在稀疏视图表面重建中需要密集输入视图和高耗时每场景优化的问题，实验表明该方法在DTU基准测试上达到了与最先进方法相当的结果，并在1秒内预测高斯面元，实现了100倍的加速。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）在三维场景重建中展现出卓越的性能。除新颖视角合成外，该方法在多视角表面重建领域亦显示出巨大潜力。现有研究多采用基于优化的重建流程，能够实现精确且完整的表面提取，但这些方法通常需要密集的输入视角，且针对每个场景的优化耗时较长。为克服这些局限，本文提出SurfelSplat——一种前馈式框架，能够从稀疏视角图像中生成高效且可泛化的像素对齐高斯面元表示。我们观察到，传统前馈结构难以恢复高斯面元的精确几何属性，因为像素对齐基元的空间频率超过了奈奎斯特采样率。为此，我们基于奈奎斯特采样定理提出了一种跨视角特征聚合模块。具体而言，我们首先利用空间采样率引导的低通滤波器调整高斯面元的几何形态，随后将滤波后的面元投影至所有输入视角以获取跨视角特征关联。通过专门设计的特征融合网络处理这些关联信息，最终能够回归出具有精确几何结构的高斯面元。在DTU重建基准上的大量实验表明，本模型取得了与先进方法相当的结果，且能在1秒内预测出高斯面元，在无需昂贵单场景训练的前提下实现了百倍的速度提升。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.

关键词: 3D Gaussian Splatting, surface reconstruction, sparse-view images, feed-forward framework, Gaussian surfel representations, cross-view feature aggregation, Nyquist sampling theorem, DTU benchmark

201. ❌ MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

作者: Junyao Gao, Sibo Liu, Jiaxing Li, Yanan Sun, Yuanpeng Tu, Fei Shen, Weidong Zhang, Cairong Zhao, Jun Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究利用大生成模型（如文本到图像模型）构建风格数据集，涉及大模型应用（LLMs/Foundation Models）和数据质量（Scaling Laws AND Data Quality），但并非核心研究大模型技术原理。论文提到微调风格编码器（Post-training/SFT）和基于FLUX的模型训练（Pre-training/Domain Adaptation），因此这些关键词得5分。其他关键词如MoE、SLMs、对齐、推理加速、科学AI等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MegaStyle，一种利用大生成模型构建高质量、多样化的风格数据集的方法，并基于此训练了风格编码器和风格迁移模型，显著提升了风格迁移任务的性能。

摘要翻译

本文提出MegaStyle，一种新颖且可扩展的数据构建流程，用于创建风格内一致、风格间多样且高质量的风格数据集。我们通过利用当前大型生成模型所具备的一致文本-图像风格映射能力实现这一目标，该模型能够根据给定的风格描述生成相同风格的图像。在此基础上，我们构建了一个包含17万条风格提示词和40万条内容提示词的多样化、均衡的提示词库，并通过内容-风格提示词组合生成了大规模风格数据集MegaStyle-1.4M。基于MegaStyle-1.4M，我们提出风格监督对比学习来微调风格编码器MegaStyle-Encoder，以提取具有表现力且风格特定的表征；同时训练了基于FLUX的风格迁移模型MegaStyle-FLUX。大量实验证明了保持风格内一致性、风格间多样性和高质量数据对风格数据集的重要性，以及所提出的MegaStyle-1.4M的有效性。此外，当在MegaStyle-1.4M上训练时，MegaStyle-Encoder和MegaStyle-FLUX能够提供可靠的风格相似度度量和可泛化的风格迁移效果，为风格迁移领域作出重要贡献。更多结果请访问项目网站https://jeoyal.github.io/MegaStyle/。

摘要 (Abstract)

In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.

关键词: style dataset, text-to-image, large generative models, style transfer, contrastive learning, FLUX, data curation, intra-style consistency

202. ❌ Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

作者: Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, Xirong Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像（眼底图像）分析，属于AI for Science（生物信息学）应用领域，核心是训练一个多模态大语言模型（MLLM），因此与’Large Language Models’、‘AI for Science’高度相关（10分）。论文明确使用’post-train’和’supervised finetuning (SFT)‘方法，因此与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文提出了一种基于RAG的方法来生成知识感知的推理轨迹，因此与’Retrieval-Augmented Generation OR RAG’高度相关（10分）。论文强调’knowledge-aware reasoning’和’reasoning traces’，与’Chain of Thought OR CoT Reasoning’概念高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、PEFT等未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了使用公开数据训练高性能眼底图像解读多模态大语言模型的挑战，通过提出一种基于RAG的知识感知推理轨迹生成方法和增强的强化学习奖励机制，成功训练出Fundus-R1模型，在多个基准测试中显著优于基线模型。

摘要翻译

彩色眼底照相（CFP）、光学相干断层扫描（OCT）及超广角成像（UWF）等眼底成像技术对于视网膜异常与疾病的早期检测至关重要。眼底图像理解因其知识密集型特性，构成了一项具有挑战性的视觉-语言任务。当前一种新兴的解决方法是，通过在大量内部样本及其对应的高质量临床报告上进行监督微调（SFT）或基于可验证奖励的强化学习（RLVR），对通用的多模态大语言模型（MLLM）进行后训练。然而，这些宝贵样本并未公开，这不仅阻碍了研究的可复现性，也实际限制了仅有少数机构能开展相关研究。为突破此障碍，我们首次尝试仅使用公开数据集训练一个推理增强的眼底解读MLLM，并将其命名为Fundus-R1；其中超过94%的数据仅带有图像级标签。我们的技术贡献包括两方面：首先，我们提出一种基于检索增强生成（RAG）的方法，用于构建图像特异性、知识感知的推理轨迹。此类自动生成的轨迹将通用MLLM识别出的视觉发现与图像标签通过眼科知识联系起来。其次，我们通过引入过程奖励来增强RLVR，该奖励鼓励在每个决策步骤中生成自洽的推理轨迹。在三个眼底解读基准（即FunBench、Omni-Fundus和GMAI-Fundus）上的大量实验表明，Fundus-R1明显优于多个基线模型，包括其通用对应模型（Qwen2.5-VL）以及一个未使用生成轨迹进行后训练的更强版本。这项工作为利用公开数据训练强大的眼底解读MLLM开辟了道路。

摘要 (Abstract)

Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

关键词: Fundus image understanding, Multimodal large language model (MLLM), Retrieval-Augmented Generation (RAG), Knowledge-aware reasoning, Supervised fine-tuning (SFT), Reinforcement learning with verifiable rewards (RLVR), Public datasets, Ophthalmic knowledge

203. ❌ Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

作者: Richard Petersen, Fredrik Kahl, Jennifer Alvén 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于3D医学图像分割，特别是肺结节分割，属于AI for Science（生物信息学/医学影像）领域，因此该关键词得10分。论文提到使用预训练的rectified flow模型和predictor模型，涉及预训练和微调概念，因此Pre-training和Post-training相关关键词得5分。其他关键词主要涉及大语言模型、推理、对齐、压缩等技术，与论文的医学图像分割主题无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种弱监督的肺结节分割方法，通过训练免费的3D rectified flow模型指导，仅使用图像级标签微调预测器，在LUNA16数据集上实现了优于基线方法的分割效果。

摘要翻译

密集标注（如分割掩码）的获取成本高昂且耗时，尤其对于需要专家进行体素级标注的三维医学图像而言。弱监督方法旨在解决这一局限，但通常依赖基于归因的方法，这类方法难以准确捕捉肺结节等微小结构。本文提出一种针对肺结节的弱监督分割方法，通过即插即用方式结合预训练的最优整流流模型与预测器模型。我们的方法采用免训练引导的三维整流流模型，仅需使用图像级标签对预测器进行微调，而无需重新训练生成模型。所提方法为两种独立的预测器生成了质量更高的分割结果，能稳定检测不同尺寸和形状的肺结节。在LUNA16数据集上的实验表明，该方法较基线模型有所改进，凸显了生成式基础模型作为弱监督三维医学图像分割工具的潜力。

摘要 (Abstract)

Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

关键词: weakly-supervised segmentation, lung nodule segmentation, 3D medical images, rectified flow, generative foundation models, training-free guidance, image-level labels, LUNA16 dataset

204. ❌ GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

作者: Yishen Liu, Hongcang Chen, Pengcheng Zhao, Yunfan Bao, Yuxi Tian, Jieming Zhang, Hao Chen, Zheng Zhi, Yongchun Liu, Ying Li, Dongpu Cao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的异常检测，提出了一种基于扩散模型的少样本异常合成方法，与绝大多数关键词（涉及大语言模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其应用场景（工业质量控制的视觉异常检测）可广义地视为AI在工业/工程科学领域的应用，但论文本身并非直接针对生物信息学或化学信息学，也未明确强调’AI for Science’这一宽泛主题，故相关性较弱。

!!! tip deepseek-chat TL;DR

该论文针对工业视觉异常检测中真实异常样本稀缺的问题，提出了一种名为GroundingAnomaly的少样本异常图像生成框架，通过空间条件模块和门控自注意力模块实现了对合成异常的精确空间控制与稳定适应，并在多个下游任务中取得了最先进的性能。

摘要翻译

工业质量控制中的视觉异常检测性能常受限于真实异常样本的稀缺性。为此，异常合成技术被开发用于扩充训练集并提升下游检测能力。然而，现有方法或因修复操作导致融合效果不佳，或无法提供精确的掩码标注。为应对这些局限，我们提出GroundingAnomaly——一种新颖的小样本异常图像生成框架。该框架引入了空间条件模块（Spatial Conditioning Module），利用逐像素语义图实现对合成异常的精确空间控制。此外，我们设计了门控自注意力模块（Gated Self-Attention Module），通过门控注意力层将条件标记注入冻结的U-Net网络。这种设计在保持预训练先验知识的同时，确保了稳定的小样本适应能力。在MVTec AD和VisA数据集上的大量实验表明，GroundingAnomaly能够生成高质量的异常样本，并在异常检测、分割及实例级检测等多个下游任务中取得最先进的性能。

摘要 (Abstract)

The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.

关键词: anomaly synthesis, few-shot learning, diffusion models, industrial visual inspection, spatial conditioning, anomaly detection, MVTec AD, VisA

205. ❌ CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild

作者: Siyuan Yao, Hao Sun, Ruiqi Yu, Xiwei Jiang, Wenqi Ren, Xiaochun Cao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08287v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild》专注于计算机视觉领域，特别是视频伪装目标检测（VCOD）的基准数据集构建和评估。研究内容涉及数据集创建、深度学习算法评估、以及挑战分析，但未涉及大模型、深度学习技术原理创新、或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理、或AI在科学领域的应用相关，而本文是纯粹的计算机视觉数据集工作，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文构建了一个高质量的视频伪装移动目标检测基准数据集CAMotion，并评估了现有SOTA模型，以推动该领域的研究进展。

摘要翻译

在计算机视觉领域，由于伪装物体与其背景环境高度相似，发现伪装物体是一项极具挑战性的任务。尽管针对连续视频帧的伪装物体检测问题日益受到关注，但现有视频伪装物体检测（VCOD）数据集的规模和多样性仍存在较大局限，这阻碍了对当前依赖海量数据训练策略的深度学习算法进行更深入分析和更广泛评估。为突破这一瓶颈，本文构建了CAMotion——一个高质量、涵盖广泛物种的野外运动伪装物体检测基准数据集。CAMotion包含多种序列，具备不确定边缘、遮挡、运动模糊和形状复杂性等多重挑战性属性。我们从多个角度呈现了序列标注细节与统计分布，使CAMotion能够深入分析不同挑战场景下伪装物体的运动特性。此外，我们在CAMotion上对现有前沿模型进行了全面评估，并探讨了VCOD任务面临的主要挑战。该基准数据集可通过https://www.camotion.focuslab.net.cn获取，我们期望CAMotion能够推动该研究领域的进一步发展。

摘要 (Abstract)

Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object’s motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.

关键词: camouflaged object detection, video dataset, benchmark, deep learning, computer vision, motion analysis, evaluation, challenging scenarios

206. ❌ Revisiting Radar Perception With Spectral Point Clouds

作者: Hamza Alsharif, Jing Gu, Pavol Jancura, Satish Ravindran, Gijs Dubbelman 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究雷达感知模型，专注于雷达点云与频谱数据的输入表示比较，并提出了一种增强点云的频谱信息方法。所有评分关键词均涉及大语言模型（LLMs）、深度学习技术原理或AI在科学领域的特定应用（如生物信息学），而本文主题是雷达信号处理与计算机视觉，属于传感器感知领域，与LLMs、深度学习技术原理创新或AI for Science（如生物/化学信息学）无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文挑战了雷达感知中密集频谱输入优于稀疏点云的传统观点，通过引入频谱点云范式并注入频谱信息，证明了增强的点云可以达到甚至超越密集频谱基准的性能，为统一的雷达感知输入表示提供了候选方案。

摘要翻译

雷达感知模型通常采用不同输入数据进行训练，其范围从距离-多普勒谱到稀疏点云不等。传统观点认为密集谱数据优于稀疏点云，但实际谱数据会因传感器型号与配置差异产生显著变化，这阻碍了模型的迁移能力。本文提出了将谱信息融入雷达点云的替代方案，并证明点云性能未必逊色于谱数据。我们提出谱点云范式——将点云视为雷达频谱的稀疏压缩表征，并论证当点云经过谱信息增强后，可作为统一输入表征的有力候选方案，其对传感器特定差异具有更强鲁棒性。我们开发了一个实验框架，将不同密度等级的谱点云模型与密集距离-多普勒基准模型进行对比，并确定了点云配置达到基准性能所需的密度阈值。此外，我们尝试了两种基础谱增强方法，向点云注入额外的目标相关信息。与“密集距离-多普勒方法具有绝对优势”的普遍认知相反，我们的研究表明点云能达到同等性能，且在增强后能超越距离-多普勒基准。因此，谱点云可作为统一雷达感知的理想候选方案，为未来雷达基础模型的发展开辟道路。

摘要 (Abstract)

Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.

关键词: radar perception, spectral point clouds, range-Doppler spectra, input representation, sensor robustness, performance benchmark, foundation models, unified perception

207. ❌ Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising

作者: Panagiotis Gkotsis, Athanasios A. Rontogiannis 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	2.0/10	0.0

评分理由: 该论文专注于使用深度图像先验（DIP）进行高光谱图像去噪，并提出了一种结合鲁棒数据保真度和显式灵敏度正则化的方法来缓解过拟合。论文的核心是深度学习在图像处理（特别是高光谱图像去噪）中的应用，属于计算机视觉和图像处理领域，而非大语言模型（LLM）或大模型技术。所有关键词均与大语言模型、大模型技术原理、训练对齐方法、推理优化、智能体系统等直接相关，而该论文未涉及这些内容。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为高光谱图像分析在遥感、环境科学等科学领域有应用，但论文本身更侧重于方法改进而非广泛的科学AI应用，因此给予2分（低相关）。其他关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合鲁棒数据保真度和显式灵敏度正则化的方法，以缓解深度图像先验在高光谱图像去噪中的过拟合问题，实验表明该方法有效防止过拟合并取得了优于现有方法的去噪性能。

摘要翻译

深度图像先验（Deep Image Prior，DIP）是一种无监督深度学习框架，已成功应用于多种逆成像问题。然而，基于DIP的方法本质上容易过拟合，这会导致性能下降并需要提前停止训练。本文提出一种方法，通过联合使用鲁棒数据保真项与显式灵敏度正则化，来缓解基于DIP的高光谱图像（Hyperspectral Image，HSI）去噪中的过拟合问题。所提方法在训练过程中采用平滑$\ell_1$数据项、基于散度的正则化以及输入优化。在受高斯噪声、稀疏噪声和条纹噪声污染的真实高光谱图像上的实验结果表明，与当前最先进的基于DIP的高光谱图像去噪方法相比，所提方法能有效防止过拟合，并取得更优的去噪性能。

摘要 (Abstract)

Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.

关键词: Deep Image Prior, Hyperspectral Image Denoising, Overfitting Prevention, Robust Data Fidelity, Sensitivity Regularization, Unsupervised Deep Learning, Inverse Imaging Problems, Image Restoration

208. ❌ $\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

作者: Arnav Devalapally, Poornima Jain, Kartik Srinivas, Vineeth N. Balasubramanian 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08238v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的域适应（Domain Adaptation）和机器遗忘（Machine Unlearning），特别是源自由域适应（SFDA）中的隐私保护问题。论文的核心贡献是提出SCADA-UL方法，通过对抗优化来遗忘源域特定类别的知识。虽然论文涉及深度学习模型，但所有关键词都针对大语言模型（LLMs）及其相关技术，而本文研究的是视觉模型，因此只有’Domain Adaptation’关键词高度相关（10分），其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

本文研究了源自由域适应中视觉模型会无意泄露源域特定类别知识的问题，提出了一种通过对抗优化和重新缩放标签策略的机器遗忘方法SCADA-UL，有效实现了遗忘性能并保护了源域数据隐私。

摘要翻译

随着视觉模型在卫星影像、医学扫描等领域的日益广泛应用，一种新兴的隐私风险逐渐显现：模型可能在目标域中无意间保留并泄露敏感源域特定信息。这为机器遗忘技术提供了一个迫切的应用场景，以保护敏感源域数据的隐私。在各类域适应技术中，无源域适应因其在适应过程中保护了源数据本身，但暴露的源模型仍会编码其影响，从而对机器遗忘提出了尤为紧迫的需求。我们的实验表明，现有的无源域适应方法在目标域中针对源域独有类别表现出强大的零样本性能，这意味着即使这些类别未出现在目标数据中，模型仍会无意间将其相关知识泄露至目标域。我们通过提出一种名为SCADA-UL的机器遗忘设定来识别并应对此风险：即域适应中的源域独有类别遗忘。现有的机器遗忘方法因未考虑数据分布偏移问题，均未涉及此设定。我们提出了一种新的遗忘方法，该方法在域适应过程中，通过一种新颖的重缩放标签策略和对抗性优化，使模型遗忘对抗生成的遗忘类别样本。我们还将研究拓展至两种变体：该问题设定的持续学习版本，以及待遗忘的具体源类别可能未知的情况。结合理论阐释，我们全面的实验结果表明，在提出的设定下，我们的方法始终优于基线模型，同时在基准数据集上达到了接近重新训练水平的遗忘性能。代码发布于https://github.com/D-Arnav/SCADA。

摘要 (Abstract)

The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA

关键词: Domain Adaptation, Machine Unlearning, Source-Free Domain Adaptation, Privacy Protection, Adversarial Optimization, Zero-Shot Transfer, Vision Models, Source-Exclusive Classes

209. ❌ Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges

作者: Saniya M. Deshmukh, Kailash A. Hambarde, Hugo Proença 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于跨域目标检测（CDOD）的综述，主要讨论深度学习目标检测模型在不同领域间的泛化问题、适应策略和挑战。论文内容与绝大多数关键词（特别是大模型相关技术）完全无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，因为论文涉及领域适应（domain adaptation）的概念，但并非针对大模型，而是传统深度学习目标检测模型。其他关键词均未在论文标题或摘要中体现，且论文主题与LLMs、MoE、SLMs、对齐、推理、代理等大模型核心技术无关。

!!! tip deepseek-chat TL;DR

这篇综述系统分析了跨域目标检测（CDOD）的问题，通过多阶段问题建模和分类法组织现有方法，揭示了领域偏移在检测阶段的传播复杂性，并提出了未来研究方向以指导更鲁棒的检测系统开发。

摘要翻译

在源域上训练的目标检测模型部署于未见目标域时，常因传感条件、环境及数据分布等多种差异而出现显著的性能下降。因此，尽管基于深度学习的检测技术近期取得了突破性进展，跨域目标检测（Cross-Domain Object Detection, CDOD）仍是一个关键的研究领域。此外，现有文献较为零散，缺乏对域偏移背后结构性挑战及适应策略有效性的统一视角。本综述对CDOD进行了全面而系统的分析。我们首先从问题定义出发，强调域偏移下目标检测的多阶段特性。随后，通过概念性分类法对现有方法进行梳理，依据适应范式、建模假设及流程组件对各类方法进行归类。进一步，我们分析了域偏移如何在检测阶段间传播，并探讨了为何目标检测中的适应本质上比分类任务更为复杂。此外，本文回顾了常用数据集、评估协议和基准测试实践。最后，我们指出了当前面临的关键挑战，并展望了未来潜在的研究方向。整体上，本综述旨在为理解CDOD提供一个统一框架，并指导开发更具鲁棒性的检测系统。

摘要 (Abstract)

Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.

关键词: cross-domain object detection, domain shift, object detection, adaptation strategies, deep learning, survey, generalization, robust detection systems

210. ❌ Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

作者: Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Armstrong Aboah 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于将视觉-语言基础模型应用于路面状况评估这一特定工程领域，核心创新在于通过领域特定的指令调优（instruction tuning）来提升模型在专业任务上的性能。因此，与基础模型、指令调优、监督微调（SFT）以及AI在科学/工程领域的应用高度相关（10分）。论文提到了模型在推理任务上的评估和改进，因此与多步推理和深度推理有一定关联（5分）。论文未涉及其他关键词所描述的具体技术（如MoE、量化、RAG、RLHF等），因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了通过领域特定的指令调优，能否使视觉-语言基础模型（PaveGPT）在路面状况评估这一专业工程任务上实现全面、准确的自动化评估，结果表明该方法显著提升了模型在感知、理解和推理任务上的性能，并生成了符合工程标准的输出。

摘要翻译

通用视觉语言模型在日常领域展现出强大性能，但在需要精确术语、结构化推理及符合工程标准的专业技术领域仍存在局限。本研究探讨了通过领域特定的指令微调能否实现基于视觉语言模型的综合路面状况评估。通过整合九个异构路面数据集的标注，我们构建了PaveInstruct数据集，其中包含涵盖32种任务类型的278,889个图像-指令-响应三元组。基于该数据集训练的路面基础模型PaveGPT，在感知、理解与推理任务上与前沿视觉语言模型进行了对比评估。指令微调显著改变了模型能力，在空间定位、推理和生成任务中实现了超过20%的性能提升，并能生成符合ASTM D6433标准的输出。这些成果使交通管理机构能够部署统一的对话式评估工具，替代多个专用系统，从而简化工作流程并降低专业技术门槛。该方法为开发面向基础设施领域（包括桥梁检测、铁路维护和建筑状况评估）的指令驱动型人工智能系统提供了可行路径。

摘要 (Abstract)

General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

关键词: Vision-Language Foundation Models, Automated Pavement Condition Assessment, Domain-specific Instruction Tuning, PaveInstruct Dataset, PaveGPT, Engineering Standards Compliance, Infrastructure Domains, Conversational Assessment Tools

211. ❌ SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

作者: You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, Xiaobai Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究AI生成科学图像的检测基准，属于AI在科学领域的应用（AI for Science），因此该关键词得8分。论文提到使用基于智能体的数据管道（agent-based data pipeline），这与LLM Agents有一定关联，得5分。论文涉及多模态理解和生成，与Large Language Models/Foundation Models在广义上相关（作为多模态模型的基础），得5分。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，与论文的检测基准研究主题无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个AI生成科学图像检测基准SciFigDetect，通过构建包含真实-合成图像对的数据集并评估现有检测方法，发现当前方法在零样本迁移、跨生成器泛化和抗图像退化方面存在显著不足。

摘要翻译

现代多模态生成器已能产出接近可发表质量的科学图表，这对视觉取证与研究诚信构成了新挑战。与传统人工智能生成的自然图像不同，科学图表具有结构化、文本密集且与学术语义紧密对齐的特点，使其成为一个独特而困难的检测目标。然而，现有的人工智能生成图像检测基准与方法几乎完全针对开放域图像开发，导致这一领域在很大程度上尚未被探索。我们提出了首个针对人工智能生成科学图表的检测基准。为构建该基准，我们开发了一种基于智能体的数据流程：检索已获授权的源论文，对论文文本与图表进行多模态理解，构建结构化提示，合成候选图表，并通过基于审阅的优化循环进行筛选。最终形成的基准覆盖了多种图表类别、多种生成来源以及对齐的真实-合成配对数据。我们在零样本、跨生成器及图像退化三种设置下对代表性检测器进行了基准测试。结果表明，现有方法在零样本迁移中表现严重不足，表现出强烈的生成器特定过拟合现象，且在常见后处理干扰下依然脆弱。这些发现揭示了当前人工智能生成图像检测能力与高质量科学图表新兴分布之间存在显著差距。我们希望该基准能为未来关于鲁棒且可泛化的科学图表取证研究奠定基础。数据集可通过 https://github.com/Joyce-yoyo/SciFigDetect 获取。

摘要 (Abstract)

Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real–synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

关键词: AI-generated scientific figure detection, benchmark, multimodal understanding, agent-based data pipeline, zero-shot transfer, cross-generator generalization, visual forensics, research integrity

作者: Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi, Hao Chen, Bin Qin, Yongjie Yang, Zhenbo Luo, Chunhua Shen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态（视频-音频）模型的强化学习后训练范式，核心是自监督的时序重排序代理任务，与大多数关键词无关。唯一高度相关的是“Post-training OR Supervised Fine-tuning OR SFT”（10分），因为论文明确扩展了强化学习后训练范式到多模态模型。其他关键词如LLMs、MoE、Scaling Laws、RAG、CoT、Agents、AI for Science等均未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了OmniJigsaw框架，通过时序重排序的代理任务扩展强化学习后训练到多模态模型，显著提升了视频、音频和协作推理能力。

摘要翻译

为将强化学习后训练范式扩展至全模态模型，以同步增强视频-音频理解与协同推理能力，我们提出OmniJigsaw——一个基于时序重排代理任务的通用自监督框架。该范式以打乱的视听片段时序重建为核心，通过三种差异化策略系统整合视觉与听觉信号，驱动跨模态融合：联合模态整合、样本级模态选择与片段级模态掩蔽。鉴于此类代理任务的效果本质上与拼图质量相关，我们设计了一个由粗到精的两阶段数据过滤流程，使OmniJigsaw能高效适配海量无标注全模态数据。我们的分析揭示了联合模态整合中存在的“双模态捷径现象”，并证明细粒度的片段级模态掩蔽能有效缓解此问题，其性能优于样本级模态选择。在15个基准测试上的广泛评估显示，该方法在视频、音频及协同推理任务中均取得显著提升，验证了OmniJigsaw作为一种可扩展的自监督全模态学习范式的有效性。

摘要 (Abstract)

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon’’ in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

关键词: omni-modal reasoning, reinforcement learning post-training, self-supervised learning, temporal reordering, audio-visual integration, modality masking, collaborative reasoning, proxy task

213. ❌ Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

作者: Yunxiang Peng, Mengmeng Ma, Ziyu Yao, Xi Peng 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Vision Transformers的泛化性能评估，提出基于内部电路（circuits）的度量方法，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、应用领域等）完全无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文通过分析模型内部工作机制（circuit discovery）来评估泛化能力，属于可解释AI范畴，但并非核心研究大模型，故给10分（高度相关但非核心）。

!!! tip deepseek-chat TL;DR

该论文针对Vision Transformers在分布偏移下的泛化评估问题，提出基于内部电路分析的两种无标签代理度量方法，显著提升了与泛化性能的相关性。

摘要翻译

可靠的泛化度量指标是评估机器学习模型的基础。尤其在标注目标数据稀缺的高风险应用中，评估模型在分布偏移下的泛化性能是一项迫切需求。我们聚焦于两个实际场景：（1）部署前，如何为未标注目标数据选择最佳模型？（2）部署后，如何在分布偏移下监测模型性能？这两种情况的核心需求都是可靠且无需标签的代理指标。然而现有代理指标（如模型置信度或准确率线性关系）往往不可靠，因为它们仅评估模型输出而忽略了产生这些输出的内部机制。我们通过引入新视角来解决这一局限：利用模型的内在运作机制——即电路（circuits）——作为泛化性能的预测性度量。借助电路发现技术，我们提取内部表征间的因果交互作为电路，并由此推导出两种分别针对上述实际场景的指标。（1）部署前，我们引入依赖深度偏差（Dependency Depth Bias），用于衡量不同模型在目标数据上的泛化能力。（2）部署后，我们提出电路偏移分数（Circuit Shift Score），用于预测模型在不同分布偏移下的泛化表现。在多种任务中，这两个指标均显示出与泛化性能显著提升的相关性，分别平均优于现有代理指标13.4%和34.1%。我们的代码公开于https://github.com/deep-real/GenCircuit。

摘要 (Abstract)

Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models’ generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models’ generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model’s generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4% and 34.1%, respectively. Our code is available at https://github.com/deep-real/GenCircuit.

关键词: generalization metrics, Vision Transformers, distribution shift, circuit discovery, internal mechanisms, label-free proxy, Dependency Depth Bias, Circuit Shift Score

214. ❌ On the Global Photometric Alignment for Low-Level Vision

作者: Mingjia Li, Tianle Du, Hainuo Wang, Qiming Hu, Xiaojie Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于低层视觉任务中的光度对齐问题，提出了一种新的损失函数（PAL）来解决训练数据中光度不一致的问题。研究内容属于计算机视觉领域，特别是图像增强和恢复任务。所有评分关键词都涉及大模型、深度学习技术原理、AI科学应用等主题，而本文完全不涉及这些领域，没有讨论任何大模型、语言模型、训练技术、推理方法、AI代理或科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了低层视觉任务中训练数据光度不一致导致的优化问题，提出了一种新的光度对齐损失函数（PAL），通过闭式仿射颜色对齐来减少无关光度差异，在多个任务、数据集和架构上一致提升了性能指标和泛化能力。

摘要翻译

有监督的低层视觉模型依赖于针对配对参考图像的逐像素损失函数，然而配对训练集存在每对图像间的光度不一致性，例如不同图像对需要不同的全局亮度、色彩或白平衡映射。这种不一致性可能源于任务固有的光度传递（如低光增强）或无意的采集偏移（如去雨），无论何种情况都会导致优化病态。标准重建损失会将不成比例的梯度预算分配给相互冲突的每对光度目标，从而挤占内容恢复的优化空间。本文深入研究了该问题，并证明在最小二乘分解下，预测目标残差的光度分量与结构分量是正交的，且空间密集的光度分量主导了梯度能量。基于此分析，我们提出了光度对齐损失（Photometric Alignment Loss, PAL）。这一灵活的监督目标通过闭式仿射色彩对齐来消除干扰性的光度差异，同时保留与恢复任务相关的监督信号，仅需协方差统计量与可忽略开销的微型矩阵求逆运算。在6项任务、16个数据集和16种架构上的实验表明，PAL能持续提升评价指标与泛化性能。具体实现详见附录。

摘要 (Abstract)

Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.

关键词: photometric alignment, low-level vision, supervised learning, loss function, image enhancement, color alignment, optimization pathology, generalization improvement

215. ❌ T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

作者: Pranjal Khadka 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学图像分割，提出了一种用于Vision-Language Models（VLMs）的时序适配器，以解决3D医学扫描中2D切片分割的噪声和连续性问题。论文的核心是计算机视觉和医学图像分析，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词（共27个）中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文高度相关，因为论文属于AI在生物医学（具体是医学影像）领域的应用，符合“AI for Science”的范畴。其他26个关键词主要涉及大语言模型的技术细节（如训练、推理、对齐、代理等），与论文的视觉语言模型（VLMs）和医学分割主题完全无关，因此评分为0。加权总分计算为：10.0（AI for Science相关度）× 1.0（权重）= 10.0。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language Models在3D医学图像分割中因忽略切片间连续性而产生噪声的问题，提出了一种轻量级时序适配器，通过注入相邻切片上下文来提升分割精度，在多个数据集上实现了显著的性能提升和更好的跨模态泛化能力。

摘要翻译

传统医学图像分割通常依赖于全监督的3D架构，这些架构需要临床专家提供大量密集的体素级标注，这是一个成本极其高昂的过程。视觉语言模型（Vision Language Models, VLMs）提供了一种强大的替代方案，它利用了从数十亿图像中学到的广泛视觉语义表征。然而，当独立应用于3D扫描的2D切片时，这些模型通常会产生噪声大且在解剖学上不合理的分割结果，违反了解剖结构固有的连续性。我们提出了一种时序适配器来解决这一问题，它将相邻切片上下文直接注入到模型的视觉令牌表征中。该适配器包含一个在令牌级别上跨固定上下文窗口进行注意力计算的时序变换器、一个细化切片内表征的空间上下文块，以及一个平衡时序特征与单切片特征的自适应门控机制。在FLARE22数据集的30个标注体积上进行训练后，我们的方法在13个腹部器官上实现了0.704的平均Dice分数，相比未使用时序上下文的基线VLM提升了+0.206。在BTCV和AMOS22数据集上的零样本评估分别取得了+0.210和+0.230的稳定提升，平均跨域性能下降从38.0%减少至24.9%。此外，在AMOS22 MRI数据上进行跨模态评估时（两个模型均未接受任何MRI监督），我们的方法实现了0.366的平均Dice分数，优于仅在CT上训练的全监督3D基线模型（DynUNet，0.224），这表明CLIP的视觉语义表征相比卷积特征，能更优雅地泛化到不同的成像模态之间。

摘要 (Abstract)

Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model’s visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP’s visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

关键词: Medical image segmentation, Vision Language Models, Temporal adapter, 3D scan, Anatomical continuity, Cross-domain evaluation, Cross-modality generalization, Dice score

216. ❌ Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

作者: Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音频-视觉表示学习，提出TG-DP框架以解耦重建和对齐目标，提升多模态预训练效果。其核心是深度学习在音频-视觉领域的应用，但未涉及大模型（LLMs）或指定的技术原理（如MoE、RLHF等）。仅与’Pre-training’有一定关联（提及’large-scale audio-visual pretraining’），其他关键词均无关。

!!! tip deepseek-chat TL;DR

论文提出教师引导的双路径框架（TG-DP），通过解耦重建和对齐目标来减少语义噪声，从而提升音频-视觉表示学习在零样本检索和线性探测任务上的性能。

摘要翻译

视听表征学习的最新进展表明，将对比对齐与掩码重建相结合具有重要价值。然而，在单次前向传播中联合优化这些目标，会迫使对比分支依赖于为重建设计的随机可见图像块，而非跨模态对齐，从而引入语义噪声和优化干扰。我们提出TG-DP，一种教师引导的双路径框架，将重建与对齐解耦至独立的优化路径。通过分离两个分支的掩码机制，TG-DP使对比路径能够使用更适合跨模态对齐的可见性模式。教师模型进一步为该分支中可见标记的组织提供辅助引导，有助于减少干扰并稳定跨模态表征学习。TG-DP在零样本检索任务中取得了最先进的性能。在AudioSet数据集上，视频到音频检索的R@1从35.2%提升至37.4%，音频到视频检索的R@1从27.9%提升至37.1%。所学表征在语义上保持鲁棒性，在AS20K和VGGSound数据集上实现了最先进的线性探测性能。综上所述，我们的结果表明，解耦多模态目标并在对比路径中引入教师引导结构，为改进大规模视听预训练提供了一个有效框架。代码发布于https://github.com/wanglg20/TG-DP。

摘要 (Abstract)

Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2% to 37.4% for video-to-audio retrieval and from 27.9% to 37.1% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.

关键词: audio-visual representation learning, contrastive alignment, masked reconstruction, teacher-guided dual-path framework, semantic noise reduction, zero-shot retrieval, multimodal pretraining, cross-modal alignment

217. ❌ Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval

作者: Sharva Gogawale, Gal Grudka, Daria Vasyutinsky-Shapira, Omer Ventura, Berat Kurar-Barakat, Nachum Dershowitz 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是手稿碎片图像检索的计算机视觉方法（Bag of Bags），使用稀疏卷积自编码器和聚类技术进行图像表示和匹配。所有关键词都涉及大模型、深度学习技术原理或AI在科学领域的应用，但该论文专注于传统的计算机视觉和图像处理技术，没有涉及任何大语言模型、深度学习创新或AI for Science的具体应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Bag of Bags的图像表示方法，用于从同一物理手稿中检索碎片图像，在Cairo Genizah数据集上实现了比传统Bag of Words方法更高的检索准确率。

摘要翻译

“缀合”指被鉴定为源自同一原始手稿的一组残片集合。本研究致力于手稿缀合检索任务：给定一张残片查询图像，检索出源自同一物理手稿的其他残片。我们提出“词袋之袋”（BoB）方法，这是一种图像级表征方法，其用基于残片特定局部视觉词汇表替代了经典词袋模型（BoW）的全局视觉词典。我们的流程包括：在二值化残片图像块上训练稀疏卷积自编码器，编码每页中的连通分量，通过每幅图像的$k$-均值聚类对所得嵌入进行聚类，并利用其局部词汇表之间的集合对集合距离进行图像比对。在开罗秘库残片上的评估显示，最佳BoB变体（即Chamfer距离）的Hit@1达到0.78，MRR达到0.84；而最强的BoW基线方法（BoW-RawPatches-$χ^2$）分别为0.74和0.80，实现了top-1准确率6.1%的相对提升。我们进一步研究了一种融入聚类规模的加权变体BoB-OT，该变体将聚类样本量纳入原型匹配，并给出了形式化的近似保证，以约束其与完整分量级最优传输的偏差。采用BoW初筛后接BoB-OT重排序的两阶段流程，在检索效能与计算成本之间提供了实用平衡，支持该方法在更大型手稿收藏中的应用。

摘要 (Abstract)

A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$χ^2$), a 6.1% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.

关键词: manuscript join retrieval, image retrieval, Bag of Bags, visual vocabulary, sparse convolutional autoencoder, optimal transport, Cairo Genizah, fragment matching

218. ❌ PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

作者: Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PolySLGen专注于多模态人机交互中的反应生成，特别是多参与者场景下的说话和倾听反应。研究内容涉及计算机视觉、多模态融合、社交信号处理和实时生成，但未涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统等直接相关，而本论文的核心是视觉-语音多模态交互框架，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了PolySLGen框架，解决了多参与者交互中实时生成上下文适当、时间一致的多模态（语音、身体动作、说话状态）说话和倾听反应的问题，并在实验中超越了现有基线方法。

摘要翻译

类人多模态反应生成对于人类与具身人工智能之间的自然群体交互至关重要。然而，现有方法仅限于双人交互中的单模态或纯言语反应，使其难以适用于真实社交场景。许多方法还忽视了非语言线索和多人交互的复杂动态，而这两者对参与度和对话连贯性都至关重要。在本研究中，我们提出了PolySLGen，一个用于多人多模态说话与倾听反应生成的在线框架。给定所有参与者的历史对话与动作，PolySLGen可为目标参与者生成未来的说话或倾听反应，包括语音、身体动作和说话状态评分。为有效建模群体交互，我们提出了姿态融合模块和社会线索编码器，共同聚合来自群体的动作信号与社会信号。大量实验结合定量与定性评估表明，PolySLGen能生成符合上下文且时序连贯的多模态反应，在动作质量、动作-语音对齐、说话状态预测以及人类感知的真实性方面均优于多个改编模型和前沿基线方法。

摘要 (Abstract)

Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

关键词: multimodal reaction generation, polyadic interaction, speaking-listening, social cue encoder, pose fusion, real-time generation, embodied AI, group interactions

219. ❌ Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?

作者: Yunusa Haruna, Adamu Lawan, Ibrahim Haruna Abdulhamid, Hamza Mohammed Dauda, Jiaquan Zhang, Chaoning Zhang, Shamsuddeen Hassan Muhammad 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器遗忘（machine unlearning）在视觉模型（CLIP）中的公平性影响，特别是偏见重新分配现象。所有评分关键词均聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理、优化、应用等），而论文研究对象是视觉模型（CLIP），未涉及LLM、深度学习技术原理创新或大模型在不同领域的应用。因此，所有关键词均不相关，得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了机器遗忘在视觉模型中是否会消除偏见还是将偏见重新分配到相关群体，发现遗忘主要沿性别边界重新分配偏见，揭示了当前遗忘方法的局限性。

摘要翻译

机器遗忘技术使模型能够选择性遗忘训练数据，这主要受到GDPR和CCPA等隐私法规的推动。然而，其公平性影响仍未得到充分探讨：当模型遗忘某个人口统计群体时，它是消除了该概念，还是将其重新分配到相关群体中，从而可能加剧偏见？我们在CelebA数据集上，基于年龄和性别定义的交叉群体，在零样本分类设置下，使用CLIP模型（ViT/B-32、ViT-L/14、ViT-B/16）研究了这种偏见再分配现象。我们评估了三种遗忘方法——提示擦除（Prompt Erasure）、提示重加权（Prompt Reweighting）和拒绝向量（Refusal Vector），采用群体准确率变化、人口统计均等差距和再分配分数作为指标。结果表明，遗忘并未消除偏见，而是主要沿性别边界而非年龄边界重新分配偏见。特别是，移除主导的年轻女性群体后，性能会持续转移到老年女性群体，这一现象在所有模型规模中均存在，揭示了CLIP嵌入空间中存在性别主导的结构。虽然拒绝向量方法减少了再分配，但未能实现完全遗忘，并且显著降低了保留性能。这些发现凸显了当前遗忘方法的一个根本局限：若不考虑嵌入空间的几何结构，它们可能加剧保留群体中的偏见。

摘要 (Abstract)

Machine unlearning enables models to selectively forget training data, driven by privacy regulations such as GDPR and CCPA. However, its fairness implications remain underexplored: when a model forgets a demographic group, does it neutralize that concept or redistribute it to correlated groups, potentially amplifying bias? We investigate this bias redistribution phenomenon on CelebA using CLIP models (ViT/B-32, ViT-L/14, ViT-B/16) under a zero-shot classification setting across intersectional groups defined by age and gender. We evaluate three unlearning methods, Prompt Erasure, Prompt Reweighting, and Refusal Vector using per-group accuracy shifts, demographic parity gaps, and a redistribution score. Our results show that unlearning does not eliminate bias but redistributes it primarily along gender rather than age boundaries. In particular, removing the dominant Young Female group consistently transfers performance to Old Female across all model scales, revealing a gender-dominant structure in CLIP’s embedding space. While the Refusal Vector method reduces redistribution, it fails to achieve complete forgetting and significantly degrades retained performance. These findings highlight a fundamental limitation of current unlearning methods: without accounting for embedding geometry, they risk amplifying bias in retained groups.

关键词: machine unlearning, bias redistribution, fairness, CLIP models, demographic groups, zero-shot classification, embedding geometry, visual models

220. ❌ Coordinate-Based Dual-Constrained Autoregressive Motion Generation

作者: Kang Ding, Hongsong Wang, Jie Gui, Liang Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到运动生成，提出了一种基于坐标的双约束自回归运动生成框架（CDAMD），属于计算机视觉和运动合成领域。论文未涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用，也未提及任何评分关键词中的技术（如MoE、量化、RAG、对齐等）。虽然论文使用了自回归和扩散模型，但这些是通用生成模型，而非针对大语言模型的技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对文本到运动生成中扩散模型的误差放大和自回归模型的模式崩溃问题，提出了一种基于坐标的双约束自回归运动生成框架（CDAMD），在新建基准上实现了最先进的保真度和语义一致性性能。

摘要翻译

文本驱动动作生成近年来在学术界受到日益广泛的关注，其在动画制作、虚拟现实、机器人学及人机交互等领域具有潜在应用价值。扩散模型与自回归模型是当前文本驱动动作生成领域两个主流且并行发展的研究方向。然而，扩散模型在噪声预测过程中常受误差放大问题困扰，而自回归模型则因动作离散化易出现模式坍塌现象。为克服这些局限，本文提出一种灵活、高保真且语义忠实度的文本驱动动作生成框架，命名为基于坐标的双约束自回归动作生成模型（Coordinate-based Dual-constrained Autoregressive Motion Generation, CDAMD）。该模型以动作坐标作为输入，遵循自回归范式，并借鉴扩散模型思想，通过多层感知机增强预测动作的保真度。此外，我们引入双约束因果掩码机制来引导自回归生成过程，其中动作标记作为先验信息与文本编码进行拼接。鉴于当前基于坐标的动作合成研究较为有限，我们为文本驱动动作生成及动作编辑任务建立了新的基准测试体系。实验结果表明，我们的方法在这些基准测试中，在动作保真度与语义一致性方面均达到了最先进的性能水平。

摘要 (Abstract)

Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.

关键词: text-to-motion generation, autoregressive models, diffusion models, motion coordinates, dual-constrained causal mask, motion synthesis, motion editing, semantic consistency

221. ❌ EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

作者: Junbo Wang, Liangyu Fu, Yuke Li, Yining Zhu, Xuecheng Wu, Kun Hu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于微表情识别（Micro-expression Recognition），提出了一种基于Transformer的改进框架EPIR，旨在平衡高识别性能和低计算复杂度。论文的核心是计算机视觉和深度学习在特定应用领域（微表情分析）的技术创新，而非大语言模型（LLM）或通用大模型技术。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”与论文有一定关联（微表情识别可视为AI在行为科学或心理学中的应用），但并非核心内容，因此给予5分。其他关键词均涉及大模型技术原理、训练方法、推理优化、智能体等，与论文的视觉Transformer优化和微表情识别应用完全无关，故均为0分。

!!! tip deepseek-chat TL;DR

该论文针对微表情识别中Transformer模型计算复杂度高和小规模数据集难以学习有效表示的问题，提出了一种高效的EPIR框架，通过双范数移位标记化、标记整合和判别性标记提取等模块，在多个公开数据集上实现了显著的性能提升。

摘要翻译

微表情识别能够获取个体当前时刻的真实情绪。尽管基于深度学习的方法，尤其是基于Transformer的方法已取得显著成果，但这些方法因多头自注意力机制中存在大量令牌而导致计算复杂度较高。此外，现有微表情数据集规模较小，使得基于Transformer的模型难以学习有效的微表情表征。为此，我们提出了一种新颖的高效令牌化、集成与表征框架（Efficient Patch tokenization, Integration and Representation framework, EPIR），该框架能够平衡高识别性能与低计算复杂度。具体而言，我们首先提出双范数移位令牌化（dual norm shifted tokenization, DNSPT）模块，通过学习面部区域相邻像素间的空间关系来实现，该模块通过精细的空间变换和双范数投影实现。随后，我们提出令牌集成模块，用于在多个级联的Transformer块之间集成部分令牌，从而在不损失信息的情况下减少令牌数量。此外，我们设计了一个判别性令牌提取器，该提取器首先改进Transformer块中的注意力机制，以减少注意力计算对自身令牌的不必要关注，并利用动态令牌选择模块（dynamic token selection module, DTSM）筛选关键令牌，从而捕获更具判别力的微表情表征。我们在四个主流公开数据集（即CASME II、SAMM、SMIC和CAS(ME)$^3$）上进行了大量实验。实验结果表明，我们的方法相较于现有最优方法取得了显著性能提升，例如在CAS(ME)$^3$数据集上UF1指标提升9.6%，在SMIC数据集上UAR指标提升4.58%。

摘要 (Abstract)

Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.

关键词: Micro-expression Recognition, Transformer, Computational Complexity, Tokenization, Attention Mechanism, Discriminative Representation, Efficient Framework, Facial Expression Analysis

222. ❌ DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

作者: Junbo Wang, Liangyu Fu, Yuke Li, Yining Zhu, Ya Jing, Xuecheng Wu, Jiangbin Zheng 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08084v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DiffVC专注于视频描述生成任务，提出了一种基于扩散模型的非自回归框架。虽然涉及深度学习（扩散模型、非自回归语言模型）和多模态交互（视频-文本），但所有评分关键词均围绕大语言模型（LLM）及其相关技术（如MoE、SFT、RAG、Agent等）或特定科学领域AI应用（如生物信息学）。论文未提及或使用任何大语言模型、相关训练对齐技术、推理方法、代理系统或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型的非自回归框架DiffVC，用于视频描述生成，解决了自回归方法生成速度慢和累积误差大的问题，并在多个数据集上取得了优于以往非自回归方法、与自回归方法相当的性能，同时生成速度更快。

摘要翻译

当前视频描述生成方法通常采用编码器-解码器结构进行自回归文本生成。然而，自回归方法存在固有局限性，如生成速度慢、累积误差大。此外，少数非自回归方法因缺乏充分的多模态交互建模而存在生成质量缺陷。为此，我们提出一种基于扩散模型的非自回归视频描述生成框架（DiffVC）以解决这些问题。其并行解码机制能有效解决生成速度与累积误差问题，同时我们提出的判别式条件扩散模型可生成更高质量的文本描述。具体而言，我们首先将视频编码为视觉表征。在训练阶段，向真实描述文本的表征添加高斯噪声，随后以视觉表征作为条件约束，通过判别式去噪器生成新的文本表征。最后，我们将新文本表征输入非自回归语言模型以生成描述文本。在推理阶段，我们直接从高斯分布采样噪声进行生成。在MSVD、MSR-VTT和VATEX数据集上的实验表明，本方法性能优于现有非自回归方法，并达到与自回归方法相当的水平——例如在CIDEr指标上最高提升9.9分，在B@4指标上提升2.6分，同时保持更快的生成速度。源代码即将公开。

摘要 (Abstract)

Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

关键词: video captioning, non-autoregressive, diffusion model, multimodal interaction, parallel decoding, generation speed, discriminative denoiser, visual representation

223. ❌ DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

作者: Christof Leitgeb, Thomas Puchleitner, Max Peter Ronecker, Daniel Watzenig 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的雷达-相机融合目标检测，使用DINOv3视觉基础模型提取特征，但研究内容与所有评分关键词（均涉及大语言模型、深度学习技术原理或AI for Science应用）完全无关。论文未涉及任何语言模型、模型训练/微调技术、推理优化、AI代理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于DINOv3视觉基础模型的雷达-相机融合目标检测方法DinoRADE，在恶劣天气条件下显著提升了多类别物体检测性能，在K-Radar数据集上比现有方法提高了12.1%。

摘要翻译

可靠且具备天气鲁棒性的感知系统对于安全自动驾驶至关重要，这类系统通常采用多模态传感器配置以实现全面的环境感知。尽管近期基于汽车调频连续波雷达的方法在恶劣天气条件下的检测任务中取得了显著性能，但它们在解析细粒度空间细节方面仍存在局限，这对于检测体型较小且易受伤害的道路使用者尤为关键。此外，现有研究尚未充分解决在恶劣天气数据集（如K-Radar）中的VRU检测问题。本文提出DinoRADE，这是一个以雷达为中心的检测流程，它处理密集的雷达张量，并通过可变形交叉注意力在相机视角下围绕变换后的参考点聚合视觉特征。视觉特征由DINOv3视觉基础模型提供。我们在K-Radar数据集的所有天气条件下进行了全面的性能评估，并率先针对五个物体类别分别报告了检测性能。此外，我们将本方法与现有的单类别检测方法进行了比较，并以12.1%的优势超越了近期的雷达-相机融合方法。代码发布于https://github.com/chr-is-tof/RADE-Net。

摘要 (Abstract)

Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.

关键词: Radar-camera fusion, Object detection, Adverse weather, DINOv3, Vision foundation model, K-Radar dataset, Multi-class detection, Autonomous driving

224. ❌ Tensor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels

作者: Chia-Wei Hsing, Wei-Lin Tu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是卷积神经网络（CNN）的架构改进，提出用张量增强的卷积核来提升浅层网络的表达能力。所有评分关键词均与大语言模型（LLMs）、大模型技术原理或AI在科学领域的应用直接相关，而本文专注于传统CNN的架构创新，未涉及任何大模型、语言模型、对齐、推理、代理、压缩等关键词领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种张量增强的卷积神经网络（TACNN），通过用通用张量替换传统卷积核来增强浅层网络的表达能力，在Fashion-MNIST上仅用两层卷积就达到了与深层模型相当的93.7%准确率。

摘要翻译

卷积神经网络（CNNs）在分层提取局部特征方面表现卓越，但其捕捉复杂关联的性能严重依赖于深层架构，而这类架构通常计算成本高昂且难以解释。为解决这些问题，我们提出一种物理引导的浅层模型：张量增强卷积神经网络（TACNN），该模型使用通用张量替代传统卷积核以增强表示能力。这一选择的动机在于，一个$N$阶张量天然地编码了维度为$d^N$的希尔伯特空间中的任意量子叠加态（其中$d$为局部物理维度），从而提供了显著更丰富的表达能力。此外，在我们的设计中，每一层的卷积输出成为能够捕获高阶特征关联的多线性形式，从而使浅层多级架构具备与深度卷积神经网络相媲美的表达能力。在Fashion-MNIST基准测试中，TACNN相较于传统卷积神经网络展现出明显优势，仅用少数层数即实现了显著的准确率。特别地，仅包含两个卷积层的TACNN达到了93.7$%$的测试准确率，超越或匹配了如VGG-16（93.5$%$）和GoogLeNet（93.7$%$）等深度大得多的模型。这些发现凸显了TACNN作为一个有前景的框架，在保持架构简洁性的同时增强了模型表达能力，为构建更具可解释性和高效性的深度学习模型开辟了道路。

摘要 (Abstract)

Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$%$) and GoogLeNet (93.7$%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.

关键词: Tensor-Augmented CNN, Convolutional Neural Networks, Generic Tensor Kernels, Model Expressivity, Shallow Architecture, High-order Feature Correlations, Fashion-MNIST, Interpretable Models

225. ❌ Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning

作者: Emanuele Balloni, Emanuele Frontoni, Chiara Matti, Marina Paolanti, Roberto Pierdicca, Emiliano Santarnecchi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文Brain3D提出了一种从脑电图（EEG）解码视觉信息以重建3D表示的多模态架构，属于大模型在科学领域的应用。它明确使用了多模态大语言模型（multimodal large language model）来提取结构化3D感知描述，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。同时，该研究属于AI在生物医学/神经科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词主要涉及大模型的技术原理（如MoE、缩放定律、训练方法、推理优化、代理系统等），论文未涉及这些具体技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了从脑电图（EEG）信号直接解码三维视觉表示的难题，提出了一种多阶段的多模态架构Brain3D，通过EEG到图像解码、大语言模型引导的3D描述生成和扩散模型，实现了高精度的EEG驱动的3D重建，实验显示达到85.4%的10路Top-1解码准确率和0.648 CLIPScore。

摘要翻译

从脑电图（EEG）中解码视觉信息的研究近期取得了显著进展，但主要集中于从大脑活动中重建二维（2D）图像。然而，三维（3D）表征的重建在很大程度上仍未得到探索。这限制了几何理解，并降低了神经解码在不同情境下的适用性。为填补这一空白，我们提出了Brain3D——一种基于EEG到图像解码的多模态架构，用于实现EEG到3D的重建。该架构通过几何感知的生成式推理，逐步将神经表征转换至3D领域。我们的流程首先从EEG信号生成视觉基础图像，随后利用多模态大语言模型提取结构化的3D感知描述，这些描述引导一个基于扩散模型的生成阶段，其输出最终通过单图像到3D模型转换为连贯的3D网格。通过将问题分解为结构化阶段，所提出的方法避免了直接的EEG到3D映射，实现了可扩展的大脑驱动3D生成。我们进行了全面评估，将重建的3D输出与原始视觉刺激进行对比，同时评估语义对齐度和几何保真度。实验结果表明，所提架构性能优异，在10选一EEG解码准确率上达到85.4%，CLIPScore达到0.648，验证了多模态EEG驱动3D重建的可行性。

摘要 (Abstract)

Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.

关键词: EEG decoding, 3D reconstruction, multimodal architecture, large language model, diffusion model, brain activity, visual representation, geometric fidelity

226. ❌ EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文EEG2Vision主要研究从脑电图（EEG）信号重建视觉刺激，属于认知神经科学和AI交叉领域。它使用多模态大语言模型（LLM）提取语义描述，并利用图像到图像的扩散模型进行后处理增强，因此与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分）。同时，该研究属于AI在科学（特别是神经科学）中的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（KV Cache、Speculative Decoding）、代理系统、模型压缩等均未在论文中涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了EEG2Vision框架，通过结合EEG条件扩散重建和多模态大语言模型的语义引导后处理，显著提升了从低分辨率脑电图信号重建2D视觉图像的质量和可行性。

摘要翻译

由于非侵入性脑电图（EEG）空间分辨率低且噪声高，尤其是在现实低密度电极配置下，从EEG重建视觉刺激仍具挑战性。为此，我们提出EEG2Vision——一个模块化、端到端的EEG到图像框架，该系统评估了不同EEG分辨率（128、64、32和24通道）下的重建性能，并通过提示引导的后重建增强机制提升视觉质量。该框架从EEG条件化扩散重建出发，在增强阶段使用多模态大语言模型提取语义描述，并借助图像到图像扩散技术优化几何结构与感知连贯性，同时保留基于EEG的结构信息。实验表明，语义解码准确率随通道数减少显著下降（例如50分类Top-1准确率从89%降至38%），而重建质量仅轻微降低（如FID从76.77升至80.51）。所提出的增强机制在所有配置下均能持续改善感知指标，在低通道设置中实现最高9.71%的IS提升。用户研究证实了增强重建结果在感知层面具有明显优势。该方法显著提升了使用低分辨率EEG设备实现实时脑-图像应用的可行性，有望推动此类应用突破实验室环境的限制。

摘要 (Abstract)

Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.

关键词: EEG, visual reconstruction, multimodal large language model, diffusion model, cognitive neuroscience, brain-to-image, semantic decoding, low-density EEG

227. ❌ ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

作者: Daichi Yashima, Shuhei Kurita, Yusuke Oda, Shuntaro Suzuki, Seitaro Otsuki, Komei Sugiura 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确聚焦于多模态大语言模型（MLLMs）在视频描述任务中的应用，属于大模型在不同领域的研究应用。摘要中明确提到’fully open multimodal large language models (MLLMs)’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关，评10分。论文的核心创新在于提出ABMamba模型，采用线性复杂度的状态空间模型替代Transformer的二次注意力机制，以提高视频序列处理的效率。然而，论文并未涉及其他关键词所描述的具体技术（如MoE、SLMs、Scaling Laws、各种训练方法、推理加速、幻觉缓解、可解释性、智能体、量化等），也未涉及生物信息学等特定科学领域，因此这些关键词评0分。

!!! tip deepseek-chat TL;DR

该论文针对视频描述任务中现有Transformer模型计算复杂度高的问题，提出了ABMamba模型，通过线性复杂度的状态空间模型和分层双向扫描模块，在保持竞争力的性能下实现了约三倍的吞吐量提升。

摘要翻译

本研究聚焦于完全开放的多模态大语言模型（MLLMs）在视频描述生成任务中的应用。由于视频序列具有复杂的时间依赖性和较长的序列长度，其视觉内容理解面临挑战。现有基于Transformer的方法其核心注意力机制的计算复杂度随序列长度呈二次方增长，导致计算成本高昂。为克服这些限制，我们提出了对齐分层双向扫描Mamba模型（Aligned Hierarchical Bidirectional Scan Mamba, ABMamba），这是一种具有线性计算复杂度的完全开放多模态大语言模型，能够实现对视频序列的可扩展处理。ABMamba以深度状态空间模型（Deep State Space Models）作为其语言主干网络，替代了计算代价高昂的二次方注意力机制，并采用了一种新颖的对齐分层双向扫描模块，该模块可在多时间分辨率下处理视频。在VATEX和MSR-VTT等标准视频描述生成基准测试中，ABMamba相较于典型的多模态大语言模型展现出具有竞争力的性能，同时实现了约三倍的吞吐量提升。

摘要 (Abstract)

In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

关键词: Multimodal Large Language Model, Video Captioning, Aligned Hierarchical Bidirectional Scan, State Space Models, Linear Computational Complexity, Efficient Video Processing, Throughput Improvement

228. ❌ Guiding a Diffusion Model by Swapping Its Tokens

作者: Weijia Zhang, Yuehao Liu, Shanyan Guan, Wu Ran, Yanhao Ge, Wei Li, Chao Ma 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的推理时引导技术（Self-Swap Guidance），属于计算机视觉和生成模型领域，与所有提供的大模型/深度学习技术关键词（主要围绕语言模型、训练方法、推理优化、对齐、代理等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Self-Swap Guidance（SSG）的新方法，通过交换扩散模型中语义最不相似的token潜在表示来生成扰动预测，从而在无需文本条件的情况下实现类似Classifier-Free Guidance的引导，提高了图像保真度和提示对齐，并适用于条件生成和无条件生成。

摘要翻译

无分类器引导是一种广泛应用的推理时技术，用于提升扩散模型的图像生成质量。然而，其对于文本条件的依赖使其无法应用于无条件生成任务。本文提出一种简单方法，可在条件生成与无条件生成中实现类似无分类器引导的效果。其核心思想是通过简单的令牌交换操作生成扰动预测，并利用该扰动预测与原始干净预测之间的方向引导采样过程，使其朝向更高保真度的分布。具体实践中，我们在空间维度或通道维度上交换语义差异最大的成对令牌潜在表示。与现有在全局或约束较弱范围内施加扰动的方法不同，我们的方法选择性地交换并重组令牌潜在表示，从而能更精细地控制扰动及其对生成样本的影响。在MS-COCO 2014、MS-COCO 2017和ImageNet数据集上的实验表明，所提出的自交换引导（Self-Swap Guidance, SSG）应用于主流扩散模型时，在不同设置下均能在图像保真度与提示对齐度上超越先前的无条件方法。其细粒度的扰动控制也提升了鲁棒性，在更广泛的扰动强度范围内减少了副作用。总体而言，SSG将无分类器引导的应用范围扩展至包含条件生成与无条件生成的更广泛场景，并能作为即插即用模块便捷地嵌入任何扩散模型，获得即时性能提升。

摘要 (Abstract)

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

关键词: Diffusion Models, Classifier-Free Guidance, Token Swap, Self-Swap Guidance, Unconditional Generation, Image Fidelity, Perturbation, Sampling

229. ❌ Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

作者: Francesca Fati, Alberto Rota, Adriana V. Gregory, Anna Catozzo, Maria C. Giuliano, Mrinal Dhar, Luigi De Vitis, Annie T. Packard, Francesco Multinu, Elena De Momi, Carrie L. Langstraat, Timothy L. Kline 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学图像分割，使用预训练的DINOv3视觉基础模型进行迁移学习，属于AI for Science（生物医学应用）领域，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文核心是利用预训练基础模型进行领域适应。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文解决医学图像分析问题。其他关键词主要涉及大语言模型（LLMs）、推理、对齐、优化等技术，与论文的计算机视觉和医学影像焦点无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练DINOv3视觉基础模型的标签高效分割框架，用于超声图像中的附件肿块分割，在临床数据集上实现了最先进的性能，并在数据稀缺情况下保持了高精度。

摘要翻译

基于超声的附件包块评估是一项具有挑战性的临床任务，常受限于主观判读及显著的观察者间差异。尽管自动分割是定量风险评估的基础步骤，但传统的全监督卷积架构通常需要大量像素级标注，且难以应对医学影像中常见的域偏移问题。本研究提出一种标签高效的分割框架，该框架利用预训练的DINOv3基础视觉变换器（vision transformer）骨干网络所具备的强健语义先验知识。通过将此骨干网络与密集预测变换器（Dense Prediction Transformer, DPT）风格解码器相结合，我们的模型能够分层重组多尺度特征，从而将全局语义表征与细粒度空间细节相融合。在一个包含112名患者、共7,777帧标注图像的临床数据集上进行评估，相较于包括U-Net、U-Net++、DeepLabV3和MAnet在内的成熟全监督基线模型，我们的方法取得了最先进的性能。具体而言，我们获得了0.945的Dice分数，并改善了边界贴合度，将95百分位豪斯多夫距离（Hausdorff Distance）相对于最强的卷积基线模型降低了11.4%。此外，我们进行了深入的效率分析，结果表明我们基于DINOv3的方法在数据匮乏条件下仍能保持显著更高的性能，即使仅使用25%的数据进行训练，也能维持强劲的结果。这些发现表明，利用大规模自监督基础模型为数据受限的临床环境中的医学图像分割提供了一个前景广阔且数据高效的解决方案。项目仓库：https://github.com/FrancescaFati/MESA

摘要 (Abstract)

Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA

关键词: adnexal mass segmentation, foundation models, DINOv3, medical image segmentation, label-efficient learning, ultrasound images, domain adaptation, self-supervised learning

230. ❌ Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

作者: Jun Li, Yingying Shi, Zhixuan Ruan, Nan Guo, Jianhua Xu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的交通目标检测，提出了一种结合Mamba状态空间模型和可变形膨胀卷积的混合网络架构。虽然论文涉及Mamba（一种状态空间模型），但所有评分关键词都专门针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、应用等）。论文的研究内容（目标检测、卷积网络、特征金字塔）与LLM技术领域完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对交通场景中多尺度目标检测的挑战，提出了一种结合Mamba状态空间模型和可变形膨胀卷积的混合网络（MDDCNet），通过改进的层次特征表示和多尺度特征融合，在公开基准和真实数据集上取得了优于现有检测器的性能。

摘要翻译

在真实交通场景中，多尺度目标通常分布于杂乱背景中，这对精确检测提出了巨大挑战。尽管当前基于Mamba的方法能够高效建模长程依赖关系，但仍难以捕捉具有丰富局部细节的小目标，这阻碍了局部结构与全局语义的联合建模。此外，由于采用平坦的序列建模方式及空间归纳偏置不足，状态空间模型表现出有限的分层特征表征能力和较弱的跨尺度交互性，导致其在复杂场景中的性能未达最优。为解决这些问题，本研究提出了一种结合可变形膨胀卷积的Mamba网络（MDDCNet），以实现精确的交通目标检测。在MDDCNet中，通过精心设计的混合骨干网络——依次堆叠的多尺度可变形膨胀卷积（MSDDC）模块与Mamba模块——实现了从局部细节到全局语义的分层特征表征。同时，本研究进一步设计了通道增强前馈网络（CE-FFN）以克服传统前馈网络通道交互能力有限的问题，并构建了基于Mamba的注意力聚合特征金字塔网络（A^2FPN），以增强多尺度特征的融合与交互。在公开基准数据集和真实场景数据集上的大量实验结果证明了本方法相较于各类先进检测器的优越性。代码已发布于https://github.com/Bettermea/MDDCNet。

摘要 (Abstract)

In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.

关键词: traffic object detection, state-space models, Mamba, deformable dilated convolutions, multi-scale feature fusion, hierarchical feature representation, attention-aggregating feature pyramid network, channel-enhanced feed-forward network

231. ❌ Rotation Equivariant Convolutions in Deformable Registration of Brain MRI

作者: Arghavan Rezvani, Kun Han, Anthony T. Wu, Pooya Khosravi, Xiaohui Xie 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像配准领域，使用旋转等变卷积改进脑MRI配准网络，属于计算机视觉和医学图像分析的交叉研究。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为医学图像分析可视为AI在科学（生物医学）领域的应用，但论文未明确提及生物信息学或化学信息学，且未涉及大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过将旋转等变卷积集成到可变形脑MRI配准网络中，解决了传统CNN缺乏旋转等变性的问题，实验表明该方法提高了配准精度、减少了参数、增强了旋转鲁棒性并提升了样本效率。

摘要翻译

图像配准是一项对齐图像间解剖结构的基础任务。虽然卷积神经网络表现良好，但其缺乏旋转等变性——旋转后的输入不会产生相应旋转的输出。这导致其无法利用解剖结构（尤其是脑部磁共振成像）固有的旋转对称性，从而限制了性能。在本研究中，我们将旋转等变卷积集成到可变形脑部磁共振成像配准网络中。我们通过在三种基线架构中用等变编码器替换标准编码器来评估该方法，并在多个公共脑部磁共振成像数据集上进行测试。
实验表明，等变编码器具有三个关键优势：1）在减少网络参数的同时实现了更高的配准精度，证实了这种解剖学归纳偏置的益处；2）在旋转输入对上表现优于基线模型，显示出对临床实践中常见方向变化的鲁棒性；3）在较少训练数据下表现出性能提升，表明其具有更高的样本效率。我们的研究结果证明，融入几何先验是构建更鲁棒、更精确、更高效配准模型的关键一步。

摘要 (Abstract)

Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.

关键词: image registration, rotation-equivariant convolutions, deformable registration, brain MRI, CNN, anatomical structures, inductive bias, sample efficiency

232. ❌ Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI

作者: Minh Sao Khue Luu, Evgeniy N. Pavlovskiy, Bair N. Tuchinov 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（脑部MRI）中的小病灶分割，提出了一种名为CATMIL的损失函数，结合了组件自适应和病灶级监督。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词均针对大型语言模型或通用AI技术，而本文是计算机视觉/医学影像领域的特定应用研究。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学影像分析可视为AI在科学（生物医学）领域的应用，但论文并未直接涉及生物信息学或化学信息学，且创新点在于分割方法而非大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合组件自适应和病灶级监督的统一损失函数CATMIL，用于改善脑部MRI中不平衡数据下的小病灶分割，实验表明其在提高分割精度、增强小病灶召回率和控制误报方面取得了更均衡的性能。

摘要翻译

我们提出一种名为CATMIL的统一目标函数，该函数在基础分割损失的基础上增加了两个在不同层级起作用的辅助监督项。第一项为组件自适应Tversky损失，它基于连通分量对体素贡献进行重新加权，以平衡不同尺寸病灶的影响。第二项基于多示例学习，通过促进每个病灶实例的检测来引入病灶级监督。这些项与标准nnU-Net损失相结合，共同优化体素级分割精度和病灶级检测性能。我们在MSLesSeg数据集上使用一致的nnU-Net框架和五折交叉验证对提出的目标函数进行评估。结果表明，CATMIL在分割精度、病灶检测和误差控制方面取得了最均衡的性能：相比标准损失函数，它提高了Dice分数（0.7834）并降低了边界误差；更重要的是，该方法显著提升了小病灶的召回率并减少了漏检，同时在对比方法中保持了最低的假阳性体积。这些发现证明，在统一目标函数中整合组件级和病灶级监督，为高度不平衡场景下的小病灶分割提供了一种有效且实用的改进途径。所有代码与预训练模型已发布于\href{https://github.com/luumsk/SmallLesionMRI}{此链接}。

摘要 (Abstract)

We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.

关键词: brain MRI, small lesion segmentation, component-adaptive supervision, lesion-level supervision, multiple instance learning, imbalanced data, nnU-Net, CATMIL

233. ❌ Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

作者: Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态大语言模型（MLLMs）在视频时空定位任务中的应用与改进，核心贡献在于提出Bridge-STG框架以解决MLLMs在时空对齐和视觉令牌冗余方面的挑战。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的一种扩展形式，属于大模型在视频理解领域的创新应用。其他关键词涉及的技术原理（如MoE、量化、推理加速、对齐方法等）或特定应用领域（如科学AI）均未在论文中提及或讨论，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视频时空定位任务中存在的时空对齐纠缠和视觉令牌冗余问题，提出了Bridge-STG框架，通过解耦时空定位和语义桥接机制，在多个基准测试中实现了最先进的性能。

摘要翻译

时空视频定位任务要求根据自然语言查询在时间与空间维度上联合定位目标物体，这对现有多模态大语言模型构成了根本性挑战。我们识别出两大核心挑战：其一是纠缠的时空对齐问题，源于将两个异构子任务耦合在同一自回归输出空间中；其二是双域视觉令牌冗余问题，即目标物体在时间和空间上同时呈现稀疏性，导致绝大多数视觉令牌与定位查询无关。为解决这些问题，我们提出Bridge-STG——一个在保持语义连贯性的同时解耦时间与空间定位的端到端框架。尽管解耦是应对此纠缠问题的自然方案，但它可能导致时序MLLM与空间解码器之间产生语义鸿沟。Bridge-STG通过两项关键设计解决此问题：具备显式时序对齐的时空语义桥接机制将MLLM的时序推理语境提炼为增强的桥接查询，构建为鲁棒的语义接口；而查询引导的空间定位模块则利用这些查询驱动一个专设的空间解码器，该解码器结合多层交互式查询与正/负帧采样策略，共同消除双域视觉令牌冗余。在多个基准测试上的大量实验表明，Bridge-STG在多模态大语言模型方法中实现了最先进的性能。在VidSTG数据集上，Bridge-STG将平均m_vIoU从$26.4$提升至$34.3$，并在统一的多任务训练框架下，于多种细粒度视频理解任务中展现出强大的跨任务迁移能力。

摘要 (Abstract)

Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM’s temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.

关键词: Spatio-Temporal Video Grounding, Multimodal Large Language Models, Temporal Alignment, Spatial Localization, Visual Token Redundancy, Semantic Bridging, Video Understanding, Cross-task Transfer

234. ❌ SAT: Selective Aggregation Transformer for Image Super-Resolution

作者: Dinh Phu Tran, Thao Do, Saad Wazir, Seongah Kim, Seon Kwon Kim, Daeyoung Kim 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像超分辨率任务，提出了一种基于Transformer的Selective Aggregation Transformer (SAT)方法，通过Density-driven Token Aggregation算法减少计算复杂度。所有评分关键词均与大语言模型、深度学习技术原理创新或科学领域应用相关，而本文研究的是图像处理中的Transformer架构优化，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

本文提出了一种用于图像超分辨率的Selective Aggregation Transformer (SAT)，通过选择性聚合关键值矩阵减少97%的token数量，在降低27%计算成本的同时，性能比现有方法PFT提升0.22dB。

摘要翻译

基于Transformer的方法通过建模长距离依赖关系，彻底改变了图像超分辨率领域。然而，原始自注意力机制的二次计算复杂度带来了重大挑战，常常导致在效率与全局上下文利用之间做出妥协。近期基于窗口的注意力方法通过局部化计算缓解了这一问题，但其感受野往往受限。为克服这些局限性，我们提出了选择性聚合Transformer（SAT）。这一新型Transformer通过我们的密度驱动令牌聚合算法，在保持查询矩阵全分辨率的同时，选择性聚合键值矩阵（令牌数量减少97%），从而高效捕获长距离依赖并扩大模型感受野。该设计显著降低了计算成本，实现了更低复杂度，并能在不损害重建保真度的前提下进行可扩展的全局交互。SAT利用密度与隔离度指标识别每个聚类并以单一聚合令牌进行表征，确保关键的高频细节得以保留。实验结果表明，SAT以高达0.22dB的优势超越当前最优方法PFT，同时总FLOPs运算量可降低达27%。

摘要 (Abstract)

Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27%.

关键词: Image Super-Resolution, Transformer, Selective Aggregation, Density-driven Token Aggregation, Computational Complexity, Long-range Dependencies, Receptive Field, FLOPs Reduction

235. ❌ Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

作者: Yun Zhu, Jianjun Qian, Jian Yang, Jin Xie, Na Zhao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FI3Det专注于3D物体检测，使用视觉语言模型（VLMs）进行少样本增量学习，但所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、对齐、推理、代理等），而论文未涉及LLMs、文本生成或相关技术，仅使用VLMs进行视觉特征提取，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了FI3Det框架，通过视觉语言模型实现少样本增量3D物体检测，在动态室内环境中仅需少量新样本即可有效感知未知类别，并在ScanNet V2和SUN RGB-D数据集上超越了基线方法。

摘要翻译

增量式三维物体感知是实现动态室内环境中具身智能的关键步骤。然而，现有的增量式三维检测方法依赖对新类别的大量标注才能获得满意性能。为应对这一局限，我们提出FI3Det——一个少样本增量式三维检测框架，该框架通过利用视觉-语言模型学习未见类别的知识，仅需少量新样本即可实现高效的三维感知。FI3Det在基础阶段引入了一个VLM引导的未知物体学习模块，以增强对未见类别的感知能力。具体而言，它利用视觉-语言模型挖掘未知物体并提取综合性表征，包括二维语义特征和类别无关的三维边界框。为减轻这些表征中的噪声，我们进一步设计了一种加权机制，根据每个框内的空间位置和特征一致性，对点级和框级特征的贡献进行重新加权。此外，FI3Det提出了一种门控多模态原型印记模块，其中类别原型通过对齐的二维语义特征和三维几何特征构建，用于计算分类得分，随后通过多模态门控机制进行融合，以实现新物体的检测。作为首个面向少样本增量式三维物体检测的框架，我们在ScanNet V2和SUN RGB-D两个数据集上建立了批处理和顺序评估设置，实验表明FI3Det相较于基线方法取得了显著且一致的性能提升。代码发布于https://github.com/zyrant/FI3Det。

摘要 (Abstract)

Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.

关键词: Few-shot learning, Incremental 3D object detection, Vision-language models, Dynamic indoor environments, Multimodal prototype imprinting, Unknown object learning, ScanNet V2, SUN RGB-D

236. ❌ MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

作者: Zile Guo, Zhan Chen, Enze Zhu, Kan Wei, Yongkang Zou, Xiaoxuan Liu, Lei Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心贡献是创建了一个用于世界模型的大规模无人机视频数据集MotionScape，与’World Models AND General World Models’高度相关（10分）。论文使用大语言模型进行语义标注，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。数据集旨在提升无人机智能体的决策和规划能力，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分）。数据集关注数据质量和分布偏差，与’Scaling Laws AND Data Quality’有一定关联（5分）。研究属于AI在科学/工程领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有世界模型在高度动态无人机视角下缺乏真实运动先验数据的问题，提出了一个大规模、真实世界、高动态的无人机视频数据集MotionScape，并通过实验证明该数据集能有效提升世界模型模拟复杂3D动态和处理大视角变化的能力。

摘要翻译

世界模型的最新进展已展现出模拟物理现实的强大能力，使其日益成为具身智能的重要基础。尤其对于无人机智能体而言，在无约束环境中实现自主导航与鲁棒决策，精准预测复杂的三维动态至关重要。然而，在无人机视角典型的高度动态相机轨迹下，现有世界模型往往难以保持时空物理一致性。一个关键原因在于当前训练数据的分布偏差：大多数现有数据集呈现受限的2.5维运动模式，例如地面约束的自动驾驶场景或相对平滑的以人为中心的第一人称视频，因而缺乏真实的高动态六自由度无人机运动先验。为填补这一空白，我们提出了MotionScape——一个面向世界建模的大规模高动态真实世界无人机视角视频数据集。MotionScape包含超过30小时的4K无人机视角视频，总计逾450万帧。该新颖数据集以语义与几何对齐的训练样本为特色，其中多样化的真实世界无人机视频与精确的六自由度相机轨迹及细粒度自然语言描述紧密耦合。为构建该数据集，我们开发了一套自动化多阶段处理流程，整合了基于CLIP的相关性过滤、时序分割、用于轨迹恢复的鲁棒视觉SLAM以及大语言模型驱动的语义标注。大量实验表明，引入此类语义与几何对齐的标注能有效提升现有世界模型模拟复杂三维动态及处理大视角变化的能力，从而有益于无人机智能体在复杂环境中的决策与规划。本数据集已公开于https://github.com/Thelegendzz/MotionScape。

摘要 (Abstract)

Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape

关键词: World Models, UAV Dataset, 6-DoF Camera Trajectories, Highly Dynamic Motion, Large-scale Video Dataset, Embodied Intelligence, Autonomous Navigation, Semantic Annotation

237. ❌ SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

作者: Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个大规模视频数据集SceneScribe-1M，专注于3D几何感知和视频合成的数据需求，包含详细的几何和语义标注。所有评分关键词均与大语言模型、深度学习技术原理、模型训练优化、推理加速、对齐技术、AI代理等具体技术相关，而本文核心是数据集创建和基准测试，不涉及这些具体的大模型技术或AI for Science应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SceneScribe-1M，一个包含一百万视频的大规模多模态数据集，具有详细的文本描述、相机参数、深度图和3D点轨迹标注，旨在为3D感知和视频生成任务提供统一资源，并通过多个下游任务基准展示了其价值。

摘要翻译

三维几何感知与视频合成的融合，对兼具丰富语义信息与时空信息的大规模视频数据产生了前所未有的需求。现有数据集虽已分别推动了三维理解或视频生成领域的发展，但仍缺乏能够大规模同时支持这两个领域的统一资源。为弥补这一空白，我们推出了SceneScribe-1M——一个全新的大规模多模态视频数据集。该数据集包含一百万段真实场景视频，每段视频均经过精细标注，配有详细的文本描述、精确的相机参数、稠密深度图以及一致的三维点轨迹。我们通过在一系列广泛的下游任务中建立基准测试，证明了SceneScribe-1M的多功能性与价值，这些任务包括单目深度估计、场景重建、动态点跟踪，以及生成式任务如文本到视频合成（无论是否包含相机控制）。通过开源SceneScribe-1M，我们旨在提供一个全面的基准和推动研究的催化剂，促进能够同时感知动态三维世界并生成可控、逼真视频内容的模型的发展。

摘要 (Abstract)

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

关键词: large-scale video dataset, 3D geometric perception, video synthesis, multi-modal annotations, monocular depth estimation, scene reconstruction, text-to-video synthesis, dynamic point tracking

238. ❌ DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

作者: Tingxi Chen, Zhengxue Cheng, Houqiang Zhong, Su Wang, Rong Xie, Li Song 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的4D场景重建，提出了一种基于高斯分解的动态概率框架（DP-DeGauss），用于解决第一人称视角视频中的背景、手部和物体分离问题。论文内容涉及3D高斯表示、动态分解、渲染优化等，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或任何评分关键词中的主题（如MoE、SFT、RAG、量化等）。所有关键词均与大模型、深度学习技术或AI在科学领域的应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种动态概率高斯分解框架（DP-DeGauss），用于解决第一人称视角4D场景重建中背景、手部和物体的分离问题，实现了最先进的解缠效果和渲染质量提升。

摘要翻译

以自我为中心的视频对于下一代4维场景重建至关重要，在增强现实/虚拟现实（AR/VR）与具身人工智能领域具有广泛应用。然而，由于复杂的自身运动、遮挡以及手-物交互，重建动态的第一人称场景极具挑战。现有的分解方法假设固定视点或将动态部分合并为单一前景，因此并不适用。为解决这些局限，我们提出了DP-DeGauss，一个用于自我中心4维重建的动态概率高斯分解框架。我们的方法基于COLMAP先验初始化一个统一的3D高斯集合，为每个高斯增加一个可学习的类别概率，并动态地将它们路由至专门处理背景、手部或物体建模的形变分支。我们采用特定类别的掩码以实现更好的解耦，并引入亮度和运动流控制以改进静态渲染与动态重建。大量实验表明，DP-DeGauss在峰值信噪比（PSNR）上平均优于基线方法+1.70dB，同时在结构相似性（SSIM）和学习感知图像块相似度（LPIPS）指标上均有提升。更重要的是，我们的框架首次实现了背景、手部和物体组件最先进的解耦，实现了显式、细粒度的分离，为更直观的自我中心场景理解与编辑铺平了道路。

摘要 (Abstract)

Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.

关键词: egocentric video, 4D scene reconstruction, Gaussian decomposition, dynamic probabilistic framework, background-hand-object disentanglement, 3D Gaussian representation, AR/VR applications, embodied AI

239. ❌ Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching

作者: Qihao Huang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07980v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和自动驾驶领域的传统立体视觉深度估计技术，涉及Block Matching、Semi Global Matching、Census变换、模板匹配、在线校准等具体方法。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于自动驾驶的集成立体测距系统，通过结合密集视差图、基于Census变换的对象中心模板匹配和单目几何先验，解决了传统方法在长距离、计算成本和辐射差异方面的局限性，实现了实时鲁棒的深度估计。

摘要翻译

精确的深度估计对于自动驾驶感知系统至关重要，尤其在高速公路远距离车辆检测场景中。传统的稠密立体匹配方法（如块匹配BM与半全局匹配SGM）虽能生成逐像素视差图，但存在计算成本高、对立体相机间辐射差异敏感，以及在远距离视差值较小时精度不足的问题。本报告提出一套综合立体测距系统，该系统在统一的检测-测距-跟踪流程中整合了三种互补的深度估计方法：稠密BM/SGM视差、以目标为中心的基于Census变换的模板匹配，以及单目几何先验。我们的核心贡献是一种新颖的以目标为中心的基于Census变换的模板匹配算法，该算法在检测到的边界框内直接执行GPU加速的稀疏立体匹配，采用了远近分治策略、前后向验证、遮挡感知采样和鲁棒的多区块聚合技术。我们进一步描述了一种在线校准优化框架，该框架结合了自动校正偏移搜索、基于雷达-立体投票的视差校正，以及目标级雷达-立体关联，以实现持续的外参漂移补偿。整套系统通过异步GPU流水线设计实现了实时性能，并在包括夜间、雨天和变化光照在内的多种驾驶条件下提供了鲁棒的测距能力。

摘要 (Abstract)

Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.

关键词: stereo ranging, autonomous driving, depth estimation, Census-based template matching, object-centric, online calibration, GPU acceleration, real-time performance

240. ❌ Lighting-grounded Video Generation with Renderer-based Agent Reasoning

作者: Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基于扩散模型的视频生成技术，特别是通过3D场景表示和渲染控制信号来实现可控视频生成。虽然论文提到了一个’场景代理’（scene agent）来翻译用户指令，但这与关键词列表中的’LLM Agents’、‘Autonomous Agents’等概念不同，后者特指基于大语言模型的智能体。论文的核心是计算机视觉和图形学中的视频生成、3D场景理解和扩散模型，与提供的关键词（主要围绕大语言模型的技术原理、训练方法、推理优化、对齐、智能体等）没有直接关联。所有关键词均评分为0分，因为论文未涉及任何大语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了LiVER框架，通过引入基于3D场景表示的渲染控制信号和轻量级条件模块，解决了扩散模型在视频生成中场景因素（如布局、光照、相机轨迹）可控性不足的问题，实现了高保真且可精确控制的视频生成。

摘要翻译

扩散模型在视频生成领域取得了显著进展，但其可控性仍是主要局限。关键场景要素如布局、光照和摄像机轨迹往往相互纠缠或仅被弱建模，限制了其在电影制作和虚拟制片等需要显式场景控制的领域中的应用。我们提出LiVER，一个基于扩散模型的场景可控视频生成框架。为实现这一目标，我们引入了一种新颖框架，该框架以显式三维场景属性为条件进行视频合成，并依托一个包含物体布局、光照和摄像机参数密集标注的大规模新数据集。我们的方法通过从统一三维表征渲染控制信号来解耦这些属性。我们提出了轻量级条件模块和渐进式训练策略，将这些信号集成到基础视频扩散模型中，确保稳定收敛与高保真度。我们的框架支持广泛的应用，包括底层三维场景完全可编辑的图像到视频和视频到视频合成。为进一步提升可用性，我们开发了一个场景智能体，能够自动将高级用户指令转化为所需的三维控制信号。实验表明，LiVER在实现最先进的光照相片真实感与时间一致性的同时，能够对场景要素进行精确解耦控制，为可控视频生成设立了新标准。

摘要 (Abstract)

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

关键词: video generation, diffusion models, 3D scene representation, controllable synthesis, lighting control, scene agent, photorealism, temporal consistency

241. ❌ DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

作者: Gyanendra Das, Sai Satyam Jena 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	8.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于视觉语言模型（VLM）的终身编辑问题，提出DSCA方法通过正交子空间分解实现精确、非干扰的知识编辑。与关键词的相关性分析：1）与LLMs/Foundation Models相关（5分），因VLM是大模型的一种；2）与Post-training/SFT（8分）和PEFT（8分）高度相关，因方法涉及模型编辑和参数高效微调；3）与Hallucination Mitigation（8分）相关，因方法降低幻觉3-5%；4）与Model Merging（8分）相关，因涉及参数合并技术；5）与Pre-training/Domain Adaptation（5分）和Instruction Tuning/Alignment（5分）有一定关联，因涉及持续指令调优和跨模态对齐；6）与Mechanistic Interpretability（5分）相关，因方法关注表示空间分解和概念隔离；其余关键词如MoE、SLMs、RAG、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型终身编辑中概念纠缠和干扰的问题，提出了动态子空间概念对齐方法，通过正交子空间分解实现精确编辑，在保持基础模型冻结的情况下实现了98%的单次编辑成功率、95%以上的千次序列编辑稳定性，并降低了幻觉率。

摘要翻译

模型编辑旨在无需重新训练即可更新知识，以添加新概念并修改相关信息。终身编辑是一项具有挑战性的任务，容易破坏先前学习到的概念，对于视觉语言模型（Vision Language Models, VLMs）而言尤其如此，因为连续的编辑可能导致推理能力下降和跨模态错位。现有的基于门控适配器、激活编辑和参数合并技术的VLM知识编辑方法，旨在解决全量微调中出现的灾难性遗忘问题；然而，这些方法仍在VLM的共享表示空间中操作，其中概念相互纠缠，因此编辑会干扰其他不相关的概念。我们假设这种不稳定性持续存在，是因为当前方法主要通过算法优化来控制编辑，而非在结构上分离知识。我们提出了动态子空间概念对齐（Dynamic Subspace Concept Alignment, DSCA），该方法通过设计将表示空间分解为一组正交的语义子空间，并仅在转换后的这些子空间内进行编辑，从而缓解上述局限。这些子空间是通过对联合视觉语言表示进行增量聚类和主成分分析（PCA）获得的。这一过程在结构上隔离了概念，通过将隔离从软性训练目标转变为架构属性，实现了精确且无干扰的编辑。这些精准的编辑由一个多目标损失函数引导，以保持任务保真度、编辑局部性和跨模态对齐。在基础模型参数冻结的情况下，我们的方法实现了98%的单次编辑成功率，在1000次连续编辑后仍保持在95%以上，将幻觉率降低了3%至5%，并在持续指令调优基准测试中取得了最佳的后向迁移（BWT）分数。大量实验表明，DSCA在各种数据集和基准测试的持续终身编辑任务中，具备最先进的稳定性和知识保留能力。

摘要 (Abstract)

Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

关键词: Vision Language Models, Model Editing, Lifelong Editing, Catastrophic Forgetting, Parameter-efficient Fine-tuning, Hallucination Mitigation, Continual Learning, Knowledge Retention

242. ❌ Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

作者: Mu Nan, Muquan Yu, Weijian Mai, Jacob S. Prince, Hossein Adeli, Rui Zhang, Jiahang Cao, Benjamin Becker, John A. Pyles, Margaret M. Henderson, Chunfeng Song, Nikolaus Kriegeskorte, Michael J. Tarr, Xiaoqing Hu, Andrew F. Luo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究视觉脑信号解码的跨主体泛化问题，提出了一种基于元优化的上下文学习方法，属于AI for Science（神经科学）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。方法核心是上下文学习，与’In-context Learning OR Many-shot Learning’高度相关（10分）。论文未涉及大模型、深度学习技术原理创新或其他关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对fMRI脑信号视觉解码中跨主体泛化的挑战，提出了一种元优化的上下文学习方法，无需微调即可实现对新个体的鲁棒解码，并在跨主体和跨扫描仪场景中展示了强泛化能力。

摘要翻译

视觉解码从脑信号是计算机视觉与神经科学交叉领域的关键挑战，需要能够桥接神经表征与视觉计算模型的方法。该领域的共同目标是实现可泛化的跨被试模型。达成此目标的主要障碍在于个体间神经表征存在显著差异，迄今为止这通常需要为每位被试训练定制模型或进行单独微调。为应对这一挑战，我们提出一种基于元优化的语义视觉解码方法，能够泛化至新被试而无需任何微调。仅通过以新个体的一小组图像-脑激活示例作为条件，我们的模型便能快速推断其独特的神经编码模式，从而实现稳健高效的视觉解码。我们的方法经过显式优化，专门用于对新被试编码模型进行上下文学习，并通过分层推理（即编码器逆变换）执行解码。首先，针对多个脑区，我们通过构建多组刺激与响应的上下文来估计每个体素的视觉响应编码器参数。其次，我们构建一个包含多体素编码器参数与响应值的上下文，以执行聚合功能逆变换。实验证明，该方法在不同视觉骨干网络上均展现出强大的跨被试与跨扫描仪泛化能力，且无需重新训练或微调。此外，我们的方法既不需要解剖结构对齐，也不要求刺激重叠。这项工作是迈向非侵入式脑解码通用基础模型的关键一步。

摘要 (Abstract)

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject’s encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.

关键词: visual decoding, fMRI, cross-subject generalization, meta-optimized approach, in-context learning, brain signals, neural representations, functional inversion

243. ❌ The Impact of Dimensionality on the Stability of Node Embeddings

作者: Tobias Schumacher, Simon Reichelt, Markus Strohmaier 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图神经网络中节点嵌入的维度对稳定性和性能的影响，评估了ASNE、DGI、GraphSAGE、node2vec和VERSE等方法。所有评分关键词均专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），或特定应用（如AI for Science）。论文内容完全不涉及LLM、深度学习技术原理创新或大模型在不同领域的应用，而是专注于传统的图表示学习，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了节点嵌入维度对图表示学习方法稳定性和下游任务性能的影响，发现不同方法的稳定性随维度变化呈现不同模式，且最高稳定性不一定对应最优性能。

摘要翻译

先前的研究已证实，基于神经网络的节点嵌入方法即使在相同数据集上使用相同参数进行训练，仅因采用不同的训练随机种子也会产生不同的结果。然而，关键超参数（如嵌入维度）如何影响这种不稳定性尚未得到深入分析。本研究探讨了节点嵌入维度的变化如何影响其稳定性与下游任务性能。我们系统评估了五种广泛应用的方法——ASNE、DGI、GraphSAGE、node2vec和VERSE——在多种数据集和嵌入维度下的表现。我们从表征视角和功能视角评估稳定性，并同步进行性能评价。实验结果表明，嵌入稳定性随维度变化而显著不同，且不同方法呈现相异规律：例如node2vec和ASNE等方法在更高维度下趋于更稳定，而其他方法并未表现出相同趋势。此外，我们发现最高稳定性并不一定对应最优任务性能。这些发现凸显了谨慎选择嵌入维度的重要性，并为图表示学习中稳定性、性能与计算效率之间的权衡关系提供了新的见解。

摘要 (Abstract)

Previous work has established that neural network-based node embeddings return different outcomes when trained with identical parameters on the same dataset, just from using different training seeds. Yet, it has not been thoroughly analyzed how key hyperparameters such as embedding dimension could impact this instability. In this work, we investigate how varying the dimensionality of node embeddings influences both their stability and downstream performance. We systematically evaluate five widely used methods – ASNE, DGI, GraphSAGE, node2vec, and VERSE – across multiple datasets and embedding dimensions. We assess stability from both a representational perspective and a functional perspective, alongside performance evaluation. Our results show that embedding stability varies significantly with dimensionality, but we observe different patterns across the methods we consider: while some approaches, such as node2vec and ASNE, tend to become more stable with higher dimensionality, other methods do not exhibit the same trend. Moreover, we find that maximum stability does not necessarily align with optimal task performance. These findings highlight the importance of carefully selecting embedding dimension, and provide new insights into the trade-offs between stability, performance, and computational effectiveness in graph representation learning.

关键词: node embeddings, embedding dimensionality, stability, graph representation learning, downstream performance, neural network, hyperparameters

244. ❌ Persistence-Augmented Neural Networks

作者: Elena Xinyi Wang, Arnur Nigmetov, Dmitriy Morozov 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于拓扑数据分析（TDA）与深度学习的结合，提出了一种基于持久性的数据增强框架，用于编码局部梯度流区域及其层次演化。论文的核心贡献在于将局部拓扑特征集成到深度学习管道中，以提高在科学应用（如组织病理学图像分类和3D多孔材料回归）中的性能。因此，它与大多数关键词（如LLMs、MoE、SFT、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术。然而，论文在两个方面与关键词有一定关联：1) “Mechanistic Interpretability OR Explainable AI”：论文强调其方法的可解释性，因为拓扑特征提供了数据形状的结构化洞察，这与可解释AI的目标一致，但并非核心焦点，因此评分为5分。2) “AI for Science OR Bioinformatics OR Cheminformatics”：论文在生物信息学（组织病理学图像）和材料科学（3D多孔材料）领域进行了评估，这直接属于AI for Science的范畴，因此评分为8分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于持久性的数据增强框架，将局部拓扑特征集成到深度学习管道中，以解决拓扑数据分析在保留局部几何结构方面的挑战，并在组织病理学图像分类和3D多孔材料回归任务中显著优于基线方法。

摘要翻译

拓扑数据分析（Topological Data Analysis, TDA）提供了描述数据形状的工具，但将拓扑特征整合到深度学习流程中仍具挑战性，尤其是在需要保留局部几何结构而非对其进行全局概括时。我们提出了一种基于持续同调的数据增强框架，该框架利用莫尔斯-斯梅尔复形（Morse-Smale complex）对局部梯度流区域及其层次演化进行编码。这种表示方法兼容卷积神经网络与图神经网络，能够在多个尺度上保留空间局部化的拓扑信息。重要的是，该增强过程本身具有高效性，其计算复杂度为 $O(n \log n)$，使其适用于大规模数据集。我们在组织病理学图像分类和三维多孔材料回归任务上评估了所提方法，其表现始终优于基线方法以及全局TDA描述符（如持续同调图像和持续同调景观）。我们还表明，通过剪枝层次结构的基层，可以在保持竞争力的性能的同时减少内存使用。这些结果凸显了局部结构化拓扑增强在跨数据模态的可扩展与可解释学习方面的潜力。

摘要 (Abstract)

Topological Data Analysis (TDA) provides tools to describe the shape of data, but integrating topological features into deep learning pipelines remains challenging, especially when preserving local geometric structure rather than summarizing it globally. We propose a persistence-based data augmentation framework that encodes local gradient flow regions and their hierarchical evolution using the Morse-Smale complex. This representation, compatible with both convolutional and graph neural networks, retains spatially localized topological information across multiple scales. Importantly, the augmentation procedure itself is efficient, with computational complexity $O(n \log n)$, making it practical for large datasets. We evaluate our method on histopathology image classification and 3D porous material regression, where it consistently outperforms baselines and global TDA descriptors such as persistence images and landscapes. We also show that pruning the base level of the hierarchy reduces memory usage while maintaining competitive performance. These results highlight the potential of local, structured topological augmentation for scalable and interpretable learning across data modalities.

关键词: Topological Data Analysis, Persistence-based Augmentation, Morse-Smale Complex, Local Geometric Structure, Deep Learning Pipelines, Histopathology Image Classification, 3D Porous Material Regression, Scalable Learning

245. ❌ Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks

作者: Haokai Ma, Lee Yan Zhen, Gang Yang, Yunshan Ma, Ee-Chien Chang, Tat-Seng Chua 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08454v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在高风险任务中的置信度忠实性问题，提出了一种混合后训练框架HyTuning，结合推理蒸馏和强化学习内部反馈。与论文高度相关的关键词包括：大语言模型（核心研究对象）、后训练（提出HyTuning框架）、思维链推理（使用推理轨迹指导）、幻觉缓解（解决过度自信问题）。系统2思维和自我改进有一定关联，因为论文涉及推理过程和模型自我优化。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在高风险任务中置信度不忠实的问题，提出了一种混合后训练框架HyTuning，通过渐进推理增益度量自适应结合推理蒸馏和强化学习内部反馈，在有限监督下提高了准确性并实现了置信度忠实性。

摘要翻译

大型语言模型正日益被部署于高风险任务中，其自信但错误的推断可能导致严重的现实危害，这使得此前被忽视的置信忠实性问题重新回到前沿。一种有前景的解决方案是将无监督的内部反馈强化学习与推理轨迹引导的推理蒸馏联合优化，但该方法可能面临三个持续存在的挑战：高质量训练语料的稀缺性、事实依据不足的过度自信，以及不加区分的融合机制可能放大错误更新。受人类从不确定性到确定性的置信积累过程启发，我们提出渐进推理增益指标，用于衡量推理步骤是否逐步增强对最终答案的支持。进一步，我们引入HyTuning——一种混合式后训练框架，该框架通过渐进推理增益式指标自适应地重新加权推理蒸馏与内部反馈强化学习，利用稀缺的监督推理轨迹作为稳定锚点，同时挖掘大量未标注查询以实现可扩展性。在多个领域专用及通用基准测试上的实验表明，HyTuning在有限监督条件下既能提升准确性，又能实现置信忠实性，印证了“以少近似多”的实际效应。

摘要 (Abstract)

Large language models are increasingly deployed in high-stakes tasks, where confident yet incorrect inferences may cause severe real-world harm, bringing the previously overlooked issue of confidence faithfulness back to the forefront. A promising solution is to jointly optimize unsupervised Reinforcement Learning from Internal Feedback (RLIF) with reasoning-trace-guided Reasoning Distillation (RD), which may face three persistent challenges: scarcity of high-quality training corpora, factually unwarranted overconfidence and indiscriminate fusion that amplifies erroneous updates. Inspired by the human confidence accumulation from uncertainty to certainty, we propose Progressive Reasoning Gain (PRG) to measure whether reasoning steps progressively strengthen support for the final answer. Furthermore, we introduce HyTuning, a hybrid post-training framework that adaptively reweights RD and RLIF via a PRG-style metric, using scarce supervised reasoning traces as a stable anchor while exploiting abundant unlabeled queries for scalability. Experiments on several domain-specific and general benchmarks demonstrate that HyTuning improves accuracy while achieving confidence faithfulness under limited supervision, supporting a practical “Less Approximates More” effect.

关键词: Large Language Models, Confidence Faithfulness, Post-training, Hybrid Framework, Reasoning Distillation, Reinforcement Learning from Internal Feedback, High-stakes Tasks, Progressive Reasoning Gain

246. ❌ What a Comfortable World: Ergonomic Principles Guided Apartment Layout Generation

作者: Piotr Nieciecki, Aleksander Plocharski, Przemyslaw Musialski 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究基于transformer的公寓布局生成方法，通过引入建筑学的人体工学原则作为可微损失函数来优化生成质量。虽然使用了transformer架构，但论文专注于建筑设计的特定应用，并未涉及大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文属于建筑设计与生成模型的交叉应用，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于transformer的公寓布局生成方法，通过引入建筑学人体工学原则作为可微损失函数来指导训练，从而生成在人体工学合规性和结构有效性方面均优于基线方法的布局。

摘要翻译

当前数据驱动的平面图生成方法常复现现实训练数据集中存在的人体工学低效问题。为解决此问题，我们提出一种创新方法，将建筑设计原则直接整合到基于Transformer的生成过程中。我们依据文献中既定的建筑标准，构建了可微分的损失函数，以优化房间的相邻性与邻近度。通过在训练过程中运用这些人体工学先验知识引导模型，我们的方法生成的布局在宜居性指标上得到显著提升。对比评估表明，本方法在人体工学合规性方面优于基线模型，同时保持了较高的结构有效性。

摘要 (Abstract)

Current data-driven floor plan generation methods often reproduce the ergonomic inefficiencies found in real-world training datasets. To address this, we propose a novel approach that integrates architectural design principles directly into a transformer-based generative process. We formulate differentiable loss functions based on established architectural standards from literature to optimize room adjacency and proximity. By guiding the model with these ergonomic priors during training, our method produces layouts with significantly improved livability metrics. Comparative evaluations show that our approach outperforms baselines in ergonomic compliance while maintaining high structural validity.

关键词: floor plan generation, transformer-based generative process, ergonomic principles, differentiable loss functions, architectural design, room adjacency optimization, livability metrics, data-driven methods

247. ❌ Provably Adaptive Linear Approximation for the Shapley Value and Beyond

作者: Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Shapley值等半值的近似计算算法，属于合作博弈论和机器学习可解释性领域，但完全不涉及大模型、深度学习、AI for Science等关键词。论文专注于算法复杂度、空间约束和误差分析，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了Shapley值等半值在大规模应用中指数级计算复杂度的挑战，提出了一个线性空间的自适应随机算法Adalina，在O(n/ε² log(1/δ))查询复杂度下实现理论改进的均方误差。

摘要翻译

沙普利值及其更广义的半值族在各类归因问题中备受关注。一个长期存在的基础性挑战在于其高效近似计算，因为精确计算通常需要随参与者数量$n$呈指数级增长的效用查询次数。为应对大规模应用的需求，本文探索在$Θ(n)$空间约束下高效近似半值的极限。基于向量集中不等式，我们建立了一个理论框架，该框架能够为现有的无偏随机算法提供更严格的查询复杂度分析。在此框架内，我们系统性地提出了一种线性空间算法，该算法仅需$O(\frac{n}{ε^{2}}\log\frac{1}δ)$次效用查询即可确保对于所有常用半值满足$P(|\hat{\boldsymbolφ}-\boldsymbolφ|_{2}\geqε)\leq δ$。特别地，我们的框架自然地衔接了OFA、无偏核SHAP（unbiased kernelSHAP）、SHAP-IQ以及回归调整方法，并明确界定了配对采样何时具有优势。此外，该算法允许针对每个特定效用函数显式最小化均方误差。基于此，我们提出了首个自适应的、线性时间、线性空间的随机算法Adalina，该算法在理论上实现了更优的均方误差。所有理论结果均通过实验验证。

摘要 (Abstract)

The Shapley value, and its broader family of semi-values, has received much attention in various attribution problems. A fundamental and long-standing challenge is their efficient approximation, since exact computation generally requires an exponential number of utility queries in the number of players $n$. To meet the challenges of large-scale applications, we explore the limits of efficiently approximating semi-values under a $Θ(n)$ space constraint. Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper query complexities for existing unbiased randomized algorithms. Within this framework, we systematically develop a linear-space algorithm that requires $O(\frac{n}{ε^{2}}\log\frac{1}δ)$ utility queries to ensure $P(|\hat{\boldsymbolφ}-\boldsymbolφ|_{2}\geqε)\leq δ$ for all commonly used semi-values. In particular, our framework naturally bridges OFA, unbiased kernelSHAP, SHAP-IQ and the regression-adjusted approach, and definitively characterizes when paired sampling is beneficial. Moreover, our algorithm allows explicit minimization of the mean square error for each specific utility function. Accordingly, we introduce the first adaptive, linear-time, linear-space randomized algorithm, Adalina, that theoretically achieves improved mean square error. All of our theoretical findings are experimentally validated.

关键词: Shapley value, semi-values, approximation algorithms, linear-space algorithm, adaptive algorithm, utility queries, mean square error, randomized algorithms

248. ❌ Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

作者: Simon Zhang, Ryan P. DeMilt, Kun Jin, Cathy H. Xia 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图分类中的分布外泛化问题，提出了一种基于对抗训练的正则化方法RIA。论文内容聚焦于图神经网络、数据增强、对抗训练和分布偏移，与所有评分关键词（均涉及大模型、深度学习技术原理或科学AI应用）无直接关联。论文未提及任何大模型、语言模型、模型训练技术、推理方法、代理系统或科学AI应用的具体内容。

!!! tip deepseek-chat TL;DR

该论文针对图分类中的协变量偏移问题，提出了一种基于对抗标签不变数据增强的正则化方法RIA，通过交替梯度下降-上升算法实现分布外泛化，并在合成和自然分布偏移实验中取得了优于基线的准确率。

摘要翻译

分布外泛化问题出现在表征学习遭遇分布偏移时。当训练数据与测试数据来自不同环境时，这种现象在实践中频繁发生。协变量偏移是一种仅发生在输入数据中而概念分布保持不变的分布偏移类型。我们提出RIA——基于对抗训练的不变性正则化方法，这是一种针对协变量偏移下分布外泛化的新方法。受$Q$学习的启发，该方法通过对训练数据环境进行对抗性探索来构建新环境。这些新环境由对抗性标签不变数据增强技术生成，可防止模型退化为仅适应训练分布的普通学习器。本方法可与许多现有处理协变量偏移的分布外泛化方法兼容，这些方法均可表述为约束优化问题。我们开发了交替梯度下降-上升算法来求解该问题，并在多种合成与自然分布偏移场景下进行了大规模的分布外图分类实验。实验表明，相较于现有分布外泛化基线方法，我们的方法能够实现更高的预测准确率。

摘要 (Abstract)

Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This occurs frequently in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in the input data, while the concept distribution stays invariant. We propose RIA - Regularization for Invariance with Adversarial training, a new method for OoD generalization under convariate shift. Motivated by an analogy to $Q$-learning, it performs an adversarial exploration for training data environments. These new environments are induced by adversarial label invariant data augmentations that prevent a collapse to an in-distribution trained learner. It works with many existing OoD generalization methods for covariate shift that can be formulated as constrained optimization problems. We develop an alternating gradient descent-ascent algorithm to solve the problem, and perform extensive experiments on OoD graph classification for various kinds of synthetic and natural distribution shifts. We demonstrate that our method can achieve high accuracy compared with OoD baselines.

关键词: Out-of-distribution generalization, Graph classification, Adversarial training, Data augmentation, Covariate shift, Regularization, Alternating gradient descent-ascent, Distribution shift

249. ❌ Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training

作者: Constantin Le Cleï, Nils Thürey, Xiaoxiang Zhu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于偏微分方程（PDE）模拟的扩散模型优化，核心贡献在于提出自适应噪声调度和代理展开训练方法以提高精度和效率。所有关键词均与大型语言模型（LLM）、其训练/对齐技术、推理优化、代理系统或特定压缩方法直接相关，而本文研究的是科学计算中的扩散模型，并非LLM。唯一略有相关的是“AI for Science”，因为论文属于科学AI应用（PDE模拟），但并非LLM在科学领域的应用，因此给予5分（有一定关联）。其他关键词与论文主题完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对自回归PDE扩散模型在单步精度和展开训练成本方面的局限性，提出了自适应噪声调度框架和代理展开训练方法，显著提高了短期精度和长期稳定性。

摘要翻译

条件扩散模型是模拟复杂时空动力学的强大替代工具，但其在高精度任务中往往难以达到确定性神经模拟器的准确度。本研究针对自回归偏微分方程扩散模型的两个关键局限性展开：其欠佳的单步精度以及展开训练带来的过高计算成本。首先，我们分析了噪声调度、重构误差降低率与扩散暴露偏差之间的关系，证明标准调度方案会导致重构误差欠优。基于这一认识，我们提出一种自适应噪声调度框架，通过动态约束模型的暴露偏差来最小化推理重构误差。我们进一步证明，这种优化后的调度方案能够支持一种快速的代理展开训练方法，在不进行完整马尔可夫链采样的前提下稳定长期推演。在包括受迫纳维-斯托克斯方程、Kuramoto-Sivashinsky方程和跨音速流动在内的多种基准测试中，所提出的两种方法均在短期精度和长期稳定性上显著超越了扩散模型与确定性基线模型。

摘要 (Abstract)

Conditional Diffusion Models are powerful surrogates for emulating complex spatiotemporal dynamics, yet they often fail to match the accuracy of deterministic neural emulators for high-precision tasks. In this work, we address two critical limitations of autoregressive PDE diffusion models: their sub-optimal single-step accuracy and the prohibitive computational cost of unrolled training. First, we characterize the relationship between the noise schedule, the reconstruction error reduction rate and the diffusion exposure bias, demonstrating that standard schedules lead to suboptimal reconstruction error. Leveraging this insight, we propose an \textit{Adaptive Noise Schedule} framework that minimizes inference reconstruction error by dynamically constraining the model’s exposure bias. We further show that this optimized schedule enables a fast \textit{Proxy Unrolled Training} method to stabilize long-term rollouts without the cost of full Markov Chain sampling. Both proposed methods enable significant improvements in short-term accuracy and long-term stability over diffusion and deterministic baselines on diverse benchmarks, including forced Navier-Stokes, Kuramoto-Sivashinsky and Transonic Flow.

关键词: Diffusion Models, PDE Emulation, Noise Schedule, Reconstruction Error, Unrolled Training, Exposure Bias, Spatiotemporal Dynamics, Navier-Stokes

250. ❌ EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment

作者: Qiance Tang, Ziqi Wang, Jieyu Lin, Ziyun Li, Barbara De Salvo, Sai Qian Zhang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment》专注于计算机视觉领域，特别是第一人称视角（egocentric）视频理解，涉及长上下文视频分析、人类注意力信号（gaze data）和增强现实（AR）应用。摘要和标题中未提及任何大模型（LLM）、深度学习技术原理、模型训练方法（如预训练、微调、对齐）、推理技术（如CoT、MCTS）、模型优化（如量化、压缩）、代理系统或AI for Science的具体内容。所有评分关键词均与大模型和深度学习技术直接相关，而本文的核心是视频数据集构建和基准测试，属于传统的计算机视觉任务，未涉及大模型技术或其在科学领域的应用创新。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了EgoEverything基准，通过整合人类注意力信号生成问题，为AR环境中的长上下文第一人称视角视频理解提供了更符合自然人类行为的评估数据集。

摘要翻译

长上下文第一人称视频理解近期引发了广泛的研究关注，其中增强现实（AR）被强调为其最重要的应用领域之一。然而，由于需要对长时序背景和多样化、非结构化的活动进行推理，该任务仍极具挑战性。尽管已有若干基准数据集，但大多数第一人称数据集依赖于人体佩戴的摄像头，且主要关注视觉内容，在构建与视频相关的查询时，对用户潜在行为的考量有限。EgoEverything 是一个在生成问题时，通过利用从注视数据中抽象出的人类注意力信号，来明确考虑人类行为的基准。它包含超过5000个多项选择题答案对，涵盖超过100小时的视频。通过在问题生成过程中整合人类注意力信号，该基准更真实地捕捉了自然的人类行为，并为AR中的长上下文第一人称视频理解提供了一个现实的评估场景。

摘要 (Abstract)

Long context egocentric video understanding has recently attracted significant research attention, with augmented reality (AR) highlighted as one of its most important application domains. Nevertheless, the task remains highly challenging due to the need for reasoning over extended temporal contexts and diverse, unstructured activities. Although several benchmarks exist, most egocentric datasets rely on human worn cameras and focus mainly on visual content, with limited consideration of underlying user behavior when forming video-related queries. EgoEverything is a benchmark that explicitly considers human behavior by leveraging human attention signals, abstracted from gaze data, when generating questions. It comprises over 5,000 multiple choice question answer pairs, spanning more than 100 hours of video. By integrating human attention signals during question generation, it more faithfully captures natural human behavior and offers a realistic evaluation setting for long-context egocentric video understanding in AR.

关键词: egocentric video understanding, long context, human behavior, attention signals, gaze data, augmented reality, benchmark, multiple choice questions

251. ❌ Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers

作者: Danit Yanowsky, Daphna Weinshall 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于持续学习中的重放缓冲区样本选择问题，提出了一种结合监督和自监督嵌入的图基方法。虽然属于深度学习领域，但论文内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、对齐、应用等）完全无关。论文未涉及任何大模型、语言模型、训练技术、推理方法或科学AI应用，而是纯粹的计算机视觉持续学习方法研究。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合监督和自监督嵌入的图基重放选择方法MERS，在内存受限的持续学习场景中显著提升了性能，特别是在低内存条件下优于现有方法。

摘要翻译

灾难性遗忘是持续学习中的一个核心挑战。在具有严格内存限制的基于回放的持续学习方法中，性能关键取决于回放缓冲区的样本选择策略。现有方法大多利用在监督目标下学习到的嵌入来构建记忆缓冲区。然而，与类别无关的自监督表征通常编码了丰富且与类别相关的语义信息，这一点常被忽视。我们提出了一种新方法——多嵌入回放选择，该方法采用基于图的方法替代原有的缓冲区选择模块，该方法整合了监督嵌入和自监督嵌入。实验结果表明，在一系列持续学习算法中，相较于当前最先进的选择策略，该方法均取得了稳定的性能提升，在低内存限制下的增益尤为显著。在CIFAR-100和TinyImageNet数据集上，MERS在不增加模型参数或回放样本量的情况下，超越了单一嵌入的基线方法，使其成为基于回放的持续学习的一种实用、即插即用的增强方案。

摘要 (Abstract)

Catastrophic forgetting remains a key challenge in Continual Learning (CL). In replay-based CL with severe memory constraints, performance critically depends on the sample selection strategy for the replay buffer. Most existing approaches construct memory buffers using embeddings learned under supervised objectives. However, class-agnostic, self-supervised representations often encode rich, class-relevant semantics that are overlooked. We propose a new method, Multiple Embedding Replay Selection, MERS, which replaces the buffer selection module with a graph-based approach that integrates both supervised and self-supervised embeddings. Empirical results show consistent improvements over SOTA selection strategies across a range of continual learning algorithms, with particularly strong gains in low-memory regimes. On CIFAR-100 and TinyImageNet, MERS outperforms single-embedding baselines without adding model parameters or increasing replay volume, making it a practical, drop-in enhancement for replay-based continual learning.

关键词: Continual Learning, Replay Buffer, Sample Selection, Supervised Embeddings, Self-supervised Embeddings, Graph-based Method, Memory Constraints, Catastrophic Forgetting

252. ❌ An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

作者: Yichen Gao, Altay Unal, Akshay Rangamani, Zhihui Zhu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器遗忘（Machine Unlearning）方法，主要关注模型内部表示分析，与大多数大模型技术关键词无关。仅与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为论文涉及fine-tuning方法；与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文分析模型内部特征表示。其他关键词均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现当前机器遗忘方法主要因特征-分类器错位而产生遗忘假象，并提出基于类均值特征的遗忘方法能更有效减少表示中的遗忘信息。

摘要翻译

尽管近年来已开发出众多机器遗忘方法，其在消除遗忘数据、类别或概念的影响方面展现出良好效果，但这些方法也存在着显著脆弱性——例如，简单的微调操作便可能无意中重新激活已消除的概念。本文通过检视已遗忘模型的内部表征来解决这一矛盾，与先前主要关注输出层行为的研究形成对比。我们的分析表明，许多前沿机器遗忘方法之所以表现成功，主要源于末层特征与分类器之间的错位，这一现象我们称之为特征-分类器失准。实际上，隐藏特征仍保持高度判别性，简单的线性探测即可恢复接近原始模型的精度。基于原始模型存在神经坍缩的假设，我们进一步证明仅调整分类器即可实现可忽略的遗忘精度，同时保持保留精度，并通过仅分类器微调实验验证了这一结论。受这些发现启发，我们提出了基于类均值特征分类器的机器遗忘方法，该方法显式强化特征与分类器的对齐。在标准基准测试上的实验表明，基于类均值特征的遗忘方法能有效减少表征中的遗忘信息，同时维持较高的保留精度，这凸显了对机器遗忘方法进行忠实表征层面评估的必要性。

摘要 (Abstract)

While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.

关键词: machine unlearning, internal representations, feature-classifier misalignment, neural collapse, class-mean features, forget accuracy, retain accuracy, linear probing

253. ❌ Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations

作者: Finn Sommer, Vamika Rathi, Sebastian Goetschel, Daniel Ruprecht 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Maxey-Riley-Gatignol方程中Basset力的近似计算，使用神经网络和通用微分方程概念，属于计算流体力学和科学计算领域。所有关键词均与大模型、深度学习技术原理或特定AI应用（如生物信息学）相关，但论文仅使用神经网络作为近似工具，未涉及大模型、深度学习创新技术或AI for Science的具体应用（如生物信息学、化学信息学），因此除’AI for Science’因广义科学应用得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出使用神经网络和通用微分方程来近似Maxey-Riley-Gatignol方程中的Basset力，从而将原问题转化为可用标准数值方法求解的常微分方程组。

摘要翻译

马克西-莱利-加蒂尼奥尔方程（Maxey-Riley-Gatignol equations，简称MaRGE）描述了球形惯性粒子在流体中的运动。该方程包含巴塞特力（Basset force），这是一个积分项，用于模拟由尾流形成和边界层效应引起的历史效应。这导致作用在粒子上的力取决于其过往轨迹，并使MaRGE的数值求解复杂化。因此，尽管大量证据表明巴塞特力对模拟粒子的运动模式具有定量和定性的影响，该力仍常被忽略。利用通用微分方程（universal differential equations）的概念，我们提出通过神经网络对历史项进行近似，从而将MaRGE近似为一个常微分方程组，并可使用如龙格-库塔法（Runge-Kutta methods）等标准数值求解器进行求解。

摘要 (Abstract)

The Maxey-Riley-Gatignol equations (MaRGE) model the motion of spherical inertial particles in a fluid. They contain the Basset force, an integral term which models history effects due to the formation of wakes and boundary layer effects. This causes the force that acts on a particle to depend on its past trajectory and complicates the numerical solution of MaRGE. Therefore, the Basset force is often neglected, despite substantial evidence that it has both quantitative and qualitative impact on the movement patterns of modelled particles. Using the concept of universal differential equations, we propose an approximation of the history term via neural networks which approximates MaRGE by a system of ordinary differential equations that can be solved with standard numerical solvers like Runge-Kutta methods.

关键词: Maxey-Riley-Gatignol equations, Basset force, universal differential equations, neural networks, history term approximation, numerical solution, inertial particles, fluid dynamics

254. ❌ Introducing Echo Networks for Computational Neuroevolution

作者: Christian Kroos, Fabian Küch 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是用于极端边缘计算的小型循环神经网络（Echo Networks）及其进化算法，属于神经网络架构和进化计算领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一可能的相关性是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在ECG信号分类上进行了评估，这属于生物信息学/科学AI的应用范畴，但论文核心是网络架构和进化方法，而非专门针对生物信息学的AI方法创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了Echo Networks，一种用于极端边缘计算的紧凑型循环神经网络架构，通过矩阵编码实现更系统的进化操作，并成功应用于心电图信号分类。

摘要翻译

在极端边缘应用中，仅由数十个人工神经元组成的极小网络对于离散时间信号中的事件检测与分类具有极高价值。前馈网络、循环神经网络和通过进化算法演化的卷积网络均能在此领域取得成功，但若采用标准权重直接遗传编码（例如经典NEAT算法），则会导致突变与重组过程缺乏系统性。为此，我们提出回声网络——一种仅由连接矩阵构成的循环网络结构：其突触的源神经元对应矩阵行，目标神经元对应矩阵列，权重值构成矩阵元素。该网络无层级结构，神经元间可存在双向连接，技术上所有连接均具循环特性。输入与输出可任意分配给任意神经元，仅需在计算路径中添加额外（可选）函数（如通过S型函数获得二分类输出）。我们已成功将回声网络应用于心电图信号分类任务，但其最具潜力的特性在于基因组可表征为单一矩阵，这使得矩阵运算与分解能够作为突变和重组算子发挥作用。

摘要 (Abstract)

For applications on the extreme edge, minimal networks of only a few dozen artificial neurons for event detection and classification in discrete time signals would be highly desirable. Feed-forward networks, RNNs, and CNNs evolved through evolutionary algorithms can all be successful in this respect but pose the problem of allowing little systematicity in mutation and recombination if the standard direct genetic encoding of the weights is used (as for instance in the classic NEAT algorithm). We therefore introduce Echo Networks, a type of recurrent network that consists of the connection matrix only, with the source neurons of the synapses represented as rows, destination neurons as columns and weights as entries. There are no layers, and connections between neurons can be bidirectional but are technically all recurrent. Input and output can be arbitrarily assigned to any of the neurons and only use an additional (optional) function in their computational path, e.g., a sigmoid to obtain a binary classification output. We evaluated Echo Networks successfully on the classification of electrocardiography signals but see the most promising potential in their genome representation as a single matrix, allowing matrix computations and factorisations as mutation and recombination operators.

关键词: Echo Networks, computational neuroevolution, recurrent networks, evolutionary algorithms, edge computing, ECG classification, matrix encoding, genome representation

255. ❌ Long-Term Embeddings for Balanced Personalization

作者: Andrii Dzhoha, Egor Malykh 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是序列推荐系统中的长期偏好建模和特征一致性技术，主要涉及transformer架构、嵌入方法、生产部署问题等。虽然论文提到了transformer和语言建模（causal language modeling），但其核心内容与评分关键词列表中的大模型技术、训练方法、推理优化、对齐、代理、科学AI应用等主题均无直接关联。所有关键词均未在论文标题或摘要中出现，也未涉及相关概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对序列推荐系统中短期意图主导和特征版本不一致的问题，提出了Long-Term Embeddings框架来建模稳定长期偏好并确保跨版本兼容性，在线实验显示能显著提升用户参与度和商业指标。

摘要翻译

基于Transformer的现代序列推荐模型擅长捕捉短期意图，但常受近期偏好偏差影响，忽视稳定的长期兴趣。虽然延长序列长度是一种直观的解决方案，但其计算效率低下，且近期交互往往主导模型的注意力机制。为此，我们提出长期嵌入作为高惯性的上下文锚点以弥合这一差距。我们针对一个关键的生产环境挑战：由基础设施限制引起的时间点一致性问题——特征存储通常仅维护特征的单一“实时”版本——展开研究。这导致模型在部署和回滚过程中出现离线与在线不匹配，因为模型被迫处理训练阶段从未见过的演化表征。为解决此问题，我们引入一个长期嵌入框架，将嵌入约束于基于内容的物品表征的固定语义基上，从而确保跨版本兼容性。此外，我们研究了因果语言建模的集成策略，以应对长期嵌入与Transformer短期序列共享时间范围时可能出现的数据泄露问题。我们评估了两种表征形式：一种是启发式平均法，另一种是基于语义基固定解码器的非对称自编码器，该方法可在保持稳定性的同时实现行为微调。在Zalando平台上进行的在线A/B测试表明，采用滞后窗口将长期嵌入作为上下文前缀令牌集成到模型中，能显著提升用户参与度和商业指标。

摘要 (Abstract)

Modern transformer-based sequential recommenders excel at capturing short-term intent but often suffer from recency bias, overlooking stable long-term preferences. While extending sequence lengths is an intuitive fix, it is computationally inefficient, and recent interactions tend to dominate the model’s attention. We propose Long-Term Embeddings (LTE) as a high-inertia contextual anchor to bridge this gap. We address a critical production challenge: the point-in-time consistency problem caused by infrastructure constraints, as feature stores typically host only a single “live” version of features. This leads to an offline-online mismatch during model deployments and rollbacks, as models are forced to process evolved representations they never saw during training. To resolve this, we introduce an LTE framework that constrains embeddings to a fixed semantic basis of content-based item representations, ensuring cross-version compatibility. Furthermore, we investigate integration strategies for causal language modeling, considering the data leakage issue that occurs when the LTE and the transformer’s short-term sequence share a temporal horizon. We evaluate two representations: a heuristic average and an asymmetric autoencoder with a fixed decoder grounded in the semantic basis to enable behavioral fine-tuning while maintaining stability. Online A/B tests on Zalando demonstrate that integrating LTE as a contextual prefix token using a lagged window yields significant uplifts in both user engagement and financial metrics.

关键词: sequential recommenders, long-term embeddings, transformer, recency bias, feature consistency, causal language modeling, offline-online mismatch, A/B testing

256. ❌ Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

作者: Teng Pang, Zhiqiang Dong, Yan Zhang, Rongjian Xu, Guoqiang Wu, Yilong Yin 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线多智能体强化学习（MARL），提出了一种基于流的策略学习框架VGM²P，通过全局优势值引导智能体协作，并利用无分类器引导MeanFlow提高策略表达和推理效率。该研究与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）及其相关技术（如微调、对齐、推理、压缩等）或科学AI应用。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文核心研究多智能体系统中的策略学习和协作，评分为10分（高度相关，核心内容）。其他关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流的离线多智能体强化学习框架VGM²P，通过全局优势值引导智能体协作和条件行为克隆，在离散和连续动作空间任务中实现了与最先进方法相当的性能，同时提高了训练和推理效率。

摘要翻译

离线多智能体强化学习（MARL）旨在从预先收集的数据集中学习最优联合策略，需要在最大化全局回报与缓解离线数据分布偏移之间取得平衡。近期研究采用扩散或流生成模型来捕捉智能体间复杂的联合策略行为，但这些方法通常依赖多步迭代采样，从而降低了训练与推理效率。尽管后续研究通过蒸馏等方法提升了采样效率，但其对行为正则化系数仍较为敏感。为解决上述问题，我们提出价值引导多智能体均值流策略（VGM$^2$P），这是一个简洁而有效的基于流的策略学习框架，能够通过系数不敏感的条件行为克隆实现高效动作生成。具体而言，VGM$^2$P利用全局优势价值引导智能体协作，将最优策略学习视为条件行为克隆问题。此外，为提升多智能体场景下的策略表达能力与推理效率，该框架在策略训练与执行中均采用无分类器引导的均值流方法。在离散与连续动作空间任务上的实验表明，即使仅通过条件行为克隆进行训练，VGM$^2$P仍能高效达到与最先进方法相当的性能水平。

摘要 (Abstract)

Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.

关键词: Offline Multi-Agent Reinforcement Learning, Flow-based Policy Learning, Value Guidance, Conditional Behavior Cloning, Multi-agent Systems, MeanFlow, Global Advantage Values, Inference Efficiency

257. ❌ Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data

作者: Anders S. Olsen, Miriam L. Navarro, Claus Svarer, Jesper L. Hinrich, Morten Mørup, Gitte M. Knudsen 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于开发一种用于处理动态神经影像数据的非负矩阵分解方法，涉及信号处理、医学成像和计算神经科学。所有关键词均与大模型、深度学习技术原理或其在科学领域的应用无关，除了’AI for Science OR Bioinformatics OR Cheminformatics’，该论文属于AI在科学（具体为神经科学/医学成像）中的应用，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种移位和拉伸不变的非负矩阵分解框架，用于处理动态神经影像数据中的时间延迟和尺度差异问题，并在合成数据和脑发射断层扫描数据上验证了其能更详细地表征脑组织结构。

摘要翻译

动态神经影像数据，例如放射性示踪剂在血液或脑脊液中传输的发射断层扫描测量结果，常表现出类扩散特性。这些特性会引入距离相关的时间延迟、尺度差异与拉伸效应，从而限制传统线性建模与分解方法的有效性。为此，我们提出了平移与拉伸不变的非负矩阵分解框架。该方法能够同时估计整数与非整数时间平移以及时间拉伸，所有计算均在频域中实现——其中平移对应于相位调整，拉伸则通过零填充或截断处理。模型基于PyTorch实现（https://github.com/anders-s-olsen/shiftstretchNMF）。我们在合成数据与脑部发射断层扫描数据上验证了该模型能够通过补偿拉伸效应，从而对脑组织结构提供更精细的特征描述。

摘要 (Abstract)

Dynamic neuroimaging data, such as emission tomography measurements of radiotracer transport in blood or cerebrospinal fluid, often exhibit diffusion-like properties. These introduce distance-dependent temporal delays, scale-differences, and stretching effects that limit the effectiveness of conventional linear modeling and decomposition methods. To address this, we present the shift- and stretch-invariant non-negative matrix factorization framework. Our approach estimates both integer and non-integer temporal shifts as well as temporal stretching, all implemented in the frequency domain, where shifts correspond to phase modifications, and where stretching is handled via zero-padding or truncation. The model is implemented in PyTorch (https://github.com/anders-s-olsen/shiftstretchNMF). We demonstrate on synthetic data and brain emission tomography data that the model is able to account for stretching to provide more detailed characterization of brain tissue structure.

关键词: non-negative matrix factorization, shift-invariant, stretch-invariant, emission tomography, brain tissue delineation, temporal delays, frequency domain, neuroimaging data

258. ❌ A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

作者: Zhen Li, Gilles Stoltz 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是带有隐状态动态的上下文多臂赌博机问题，属于强化学习/在线学习领域，专注于线性赌博机模型、隐马尔可夫链、遗憾界分析等传统机器学习理论。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了具有隐马尔可夫链状态动态的有限臂线性上下文赌博机模型，提出了一种直接处理隐状态依赖的自适应策略，并获得了不依赖于奖励函数的高概率遗憾界。

摘要翻译

我们重新审视了Nelson等人（2022）提出的有限臂线性赌博机模型，其中上下文与奖励由有限隐马尔可夫链（HMM）控制。Nelson等人（2022）通过将其简化为线性上下文赌博机来处理该模型；但在此过程中，他们实际上引入了一种简化：奖励被定义为给定观测上下文后隐状态后验概率的线性函数，而非隐状态本身的函数。他们的分析（不包括算法）也未考虑隐马尔可夫模型参数的估计，且仅处理了期望遗憾界而非高概率遗憾界，这些界还受到模型中不必要的复杂依赖（如奖励间隔）的影响。相反，我们研究了更自然的模型，该模型纳入了隐状态的直接依赖关系（除上下文依赖外，这对上下文赌博机是自然的），并针对在线估计隐马尔可夫模型参数的全自适应策略，获得了更强的高概率遗憾界。这些界限不依赖于奖励函数，仅通过隐马尔可夫模型参数的估计与模型产生关联。

摘要 (Abstract)

We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.

关键词: contextual bandits, hidden Markov chain, linear bandit model, regret bounds, adaptive strategy, HMM parameter estimation, high-probability bounds, latent state dynamics

259. ❌ DeepForestSound: a multi-species automatic detector for passive acoustic monitoring in African tropical forests, a case study in Kibale National Park

作者: Gabriel Dubus, Théau d’Audiffret, Claire Auger, Raphaël Cornette, Sylvain Haupert, Innocent Kasekendi, Raymond Katumba, Hugo Magaldi, Lise Pernel, Harold Rugonge, Jérôme Sueur, John Justice Tibesigwa, Sabrina Krief 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于音频信号处理和生物多样性监测，使用深度学习模型（Audio Spectrogram Transformer）进行物种检测。与大多数大模型技术关键词无关，但明确使用了监督微调（SFT）和LoRA参数高效微调技术，因此这两个关键词得10分。同时，该研究属于AI在科学领域的应用（生物信息学/生态学），因此’AI for Science’关键词得10分。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了DeepForestSound模型，通过结合半监督聚类、手动验证和基于LoRA的监督微调，显著提高了非洲热带森林中多物种（鸟类、灵长类、大象）的被动声学监测性能，在跨时间和地点的评估中优于现有工具。

摘要翻译

被动声学监测技术已广泛应用于生物多样性评估。其在非洲热带森林中的应用受限于标注数据的稀缺，导致通用生态声学模型在代表性不足类群上的性能下降。本研究提出了DeepForestSound——一个专为非洲热带森林被动声学监测设计的多物种自动检测模型。该模型采用半监督流程：首先对未标注录音进行聚类分析并辅以人工验证，随后基于低秩自适应方法对音频频谱变换器进行监督式微调，并与冻结骨干网络的线性基线模型进行比较。该框架支持从长期声学录音中检测包括鸟类、灵长类和象类在内的多类生物类群。模型训练数据采集自乌干达基巴莱国家公园塞比托利地区，并在同一森林不同地点采集的两年后独立数据集上进行评估，从而验证了其在单一热带森林生态系统内跨时间和监测点的泛化能力。在12个生物类群中的8个类群上，DFS模型优于现有自动检测工具，尤其在非鸟类类群中表现突出——灵长类平均平均精度达0.964，象类达0.961。结果进一步表明，基于低秩自适应的微调策略在各生物类群上均显著优于线性探测方法。总体而言，这些研究证实了面向特定任务的区域性训练能显著提升声学复杂热带环境中的检测性能，凸显了DFS作为非洲雨林生物多样性监测与保护实用工具的潜力。

摘要 (Abstract)

Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.

关键词: Passive Acoustic Monitoring, multi-species detection, Audio Spectrogram Transformer, supervised fine-tuning, LoRA, biodiversity assessment, tropical forests, African rainforests

260. ❌ Multimodal Latent Reasoning via Predictive Embeddings

作者: Ashutosh Adhikari, Mirella Lapata 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08065v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Pearl框架，专注于视觉语言模型（VLMs）的多模态推理，核心创新在于通过预测嵌入学习替代显式工具调用，以解决工具增强推理中的效率、监督和错误问题。与关键词的相关性分析如下：1）与"Tool Use"高度相关（10分），因为论文核心是改进工具使用；2）与"Chain of Thought"和"System 2 Thinking"中度相关（8分），涉及多步推理和深度推理；3）与"Post-training"弱相关（5分），因涉及监督微调比较；4）其他关键词如LLMs、MoE、RAG等均无关（0分），因论文聚焦VLMs而非纯语言模型，且未涉及其他技术。

!!! tip deepseek-chat TL;DR

论文提出Pearl框架，通过预测嵌入学习在潜在空间中模拟工具使用轨迹，解决了视觉语言模型在工具增强推理中的效率低下和错误调用问题，在多个感知基准上达到或超越了传统方法。

摘要翻译

工具增强的多模态推理使视觉语言模型（VLMs）能够通过与外部工具（例如裁剪、深度估计）交互来提升感知能力。然而，此类方法会产生显著的推理开销，需要专门的监督，且容易产生错误的工具调用。我们提出Pearl（潜在空间推理的预测性嵌入对齐），这是一个受JEPA启发的框架，完全在潜在空间中从专家工具使用轨迹中学习，从而在推理时无需显式调用工具。与基于重建的潜在推理方法（其自回归生成潜在标记，存在训练-推理不匹配问题且对多步骤工具使用的支持有限）不同，Pearl直接从多模态轨迹中学习预测性嵌入，同时保留了标准的视觉语言生成流程：它具有模型无关性、训练简单，并天然支持包含多次工具调用的轨迹。在多个感知基准上的实验表明，Pearl达到或超越了标准监督微调及基于重建的潜在推理方法的性能。此外，我们提供的实证证据表明，基于重建的方法主要在潜在空间中学习嵌入而非图像编辑，这促使预测性嵌入学习成为一种更具原则性的替代方案。

摘要 (Abstract)

Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.

关键词: multimodal reasoning, visual language models, tool-augmented reasoning, predictive embeddings, latent space learning, tool-use trajectories, perception benchmarks, JEPA-inspired framework

261. ❌ Automating aggregation strategy selection in federated learning

作者: Dian S. Y. Pang, Endrias Y. Ergetu, Eric Topham, Ahmed E. Fetit 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究联邦学习中聚合策略选择的自动化框架，其中单次试验模式明确使用大语言模型（LLMs）从数据特征推断合适的策略，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理优化、代理系统等，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种自动化选择联邦学习聚合策略的框架，通过大语言模型推断和遗传搜索优化，提高了非独立同分布数据下的鲁棒性和泛化能力。

摘要翻译

联邦学习支持无需集中数据的协作式模型训练，但其效果受聚合策略选择的影响显著。这一选择并非易事，因为不同数据集、异构程度和计算约束下的性能表现差异巨大。本文提出一种端到端框架，用于自动化、简化和自适应地选择联邦学习的聚合策略。该框架包含两种运行模式：单次试验模式，即利用大语言模型根据用户提供或自动检测的数据特征推断合适的策略；以及多次试验模式，即在有限资源约束下通过轻量级遗传搜索高效探索备选方案。在多种数据集上的大量实验表明，我们的方法在非独立同分布条件下提升了模型的鲁棒性和泛化能力，同时减少了对人工干预的需求。总体而言，本研究通过自动化联邦学习中最关键的设计决策之一——聚合策略的选择，推动了可访问与自适应联邦学习的发展。

摘要 (Abstract)

Federated Learning enables collaborative model training without centralising data, but its effectiveness varies with the selection of the aggregation strategy. This choice is non-trivial, as performance varies widely across datasets, heterogeneity levels, and compute constraints. We present an end-to-end framework that automates, streamlines, and adapts aggregation strategy selection for federated learning. The framework operates in two modes: a single-trial mode, where large language models infer suitable strategies from user-provided or automatically detected data characteristics, and a multi-trial mode, where a lightweight genetic search efficiently explores alternatives under constrained budgets. Extensive experiments across diverse datasets show that our approach enhances robustness and generalisation under non-IID conditions while reducing the need for manual intervention. Overall, this work advances towards accessible and adaptive federated learning by automating one of its most critical design decisions, the choice of an aggregation strategy.

关键词: Federated Learning, Aggregation Strategy, Large Language Models, Automation, Non-IID Data, Genetic Search, Robustness, Generalization

262. ❌ PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

作者: Mohsen Amiri, Mohsen Amiri, Ali Beikmohammadi, Sindri Magnuśson, Mehdi Hosseinzadeh 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）和模型预测控制（MPC）在部分可观测系统中的应用，提出了一种特权规划器引导的RL方法（PriPG-RL）。论文内容涉及POMDP、MPC、SAC等传统RL和控制系统技术，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型、深度学习技术或科学AI应用相关，而本文研究领域为机器人控制与强化学习，与这些关键词完全无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种特权规划器引导的强化学习方法（PriPG-RL），通过利用训练期间可用的特权规划器来缓解部分可观测性，提高了样本效率和策略性能，并在四足机器人导航任务中进行了验证。

摘要翻译

本文针对部分可观测环境下强化学习策略的训练问题，提出了一种利用仅在训练阶段可用的、具备特权信息的任意时间可行规划器智能体的方法。我们将此问题形式化为一个部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP），其中规划器智能体能够访问近似动力学模型和特权状态信息，并以此指导一个仅能观测到真实状态有损投影的学习智能体。为实现该框架，我们引入了一种作为规划器智能体的任意时间可行模型预测控制（Model Predictive Control, MPC）算法。对于学习智能体，我们提出了规划器到策略的柔性演员-评论家方法（Planner-to-Policy Soft Actor-Critic, P2P-SAC），该方法通过蒸馏规划器智能体的特权知识来缓解部分可观测性，从而同时提升样本效率和最终策略性能。我们通过严格的理论分析为此框架提供了支撑。最后，我们在仿真中使用NVIDIA Isaac Lab验证了所提方法，并成功将其部署于现实世界的Unitree Go2四足机器人上，使其能够在复杂、多障碍物的环境中完成导航任务。

摘要 (Abstract)

This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent’s privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.

关键词: reinforcement learning, partial observability, model predictive control, POMDP, privileged planner, soft actor-critic, quadruped navigation, anytime-feasible MPC

263. ❌ Preference Redirection via Attention Concentration: An Attack on Computer Use Agents

作者: Dominik Seip, Matthias Hein 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机使用代理（CUAs）的安全漏洞，属于大模型在具体应用领域的研究。高度相关的关键词：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）- 论文明确提到基于多模态基础模型构建CUAs；2）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 论文核心研究计算机使用代理，属于自主代理范畴。其他关键词与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PRAC的新型攻击方法，通过操纵视觉语言模型的注意力机制来重定向计算机使用代理的内部偏好，从而在在线购物平台上操控代理选择特定目标产品。

摘要翻译

多模态基础模型的进步推动了计算机使用代理（Computer Use Agents，CUAs）的发展，使其能够自主与图形用户界面环境交互。由于CUAs不受特定工具限制，它们能够自动化更复杂的代理任务，但同时也带来了新的安全漏洞。以往研究主要集中于语言模态，而视觉模态的脆弱性尚未得到充分关注。本文提出一种新型攻击方法PRAC：与先前直接针对视觉语言模型输出的攻击不同，该方法通过将模型注意力重定向至隐蔽的对抗性补丁，从而操纵其内部偏好机制。我们证明，PRAC能够在电商平台上操控CUA的选择流程，使其定向选择特定目标商品。尽管攻击构建需要白盒模型访问权限，但我们的研究表明该攻击可泛化至同一模型的微调版本。鉴于多家公司正基于开源权重模型开发定制化CUAs，此类攻击构成了严峻的安全威胁。

摘要 (Abstract)

Advancements in multimodal foundation models have enabled the development of Computer Use Agents (CUAs) capable of autonomously interacting with GUI environments. As CUAs are not restricted to certain tools, they allow to automate more complex agentic tasks but at the same time open up new security vulnerabilities. While prior work has concentrated on the language modality, the vulnerability of the vision modality has received less attention. In this paper, we introduce PRAC, a novel attack that, unlike prior work targeting the VLM output directly, manipulates the model’s internal preferences by redirecting its attention toward a stealthy adversarial patch. We show that PRAC is able to manipulate the selection process of a CUA on an online shopping platform towards a chosen target product. While we require white-box access to the model for the creation of the attack, we show that our attack generalizes to fine-tuned versions of the same model, presenting a critical threat as multiple companies build specific CUAs based on open weights models.

关键词: Computer Use Agents, multimodal foundation models, security vulnerabilities, attention redirection, adversarial patch, preference manipulation, autonomous agents

264. ❌ Benchmarking Deep Learning for Future Liver Remnant Segmentation in Colorectal Liver Metastasis

作者: Anthony T. Wu, Arghavan Rezvani, Kela Liu, Roozbeh Houshyar, Pooya Khosravi, Whitney Li, Xiaohui Xie 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分割（肝脏和转移瘤），使用深度学习模型（nnU-Net、SwinUNETR、STU-Net）进行基准测试，属于AI在生物医学领域的应用。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为生物信息学包括医学影像分析。其他关键词均涉及大语言模型（LLM）及其相关技术（如MoE、对齐、推理、代理等），而本文未涉及任何语言模型或文本生成任务，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了结直肠肝转移手术规划中未来肝脏残体分割缺乏高质量基准的问题，通过创建首个公开验证数据集并比较多种深度学习模型，发现级联nnU-Net在分割任务中表现最佳。

摘要翻译

未来肝脏残体（FLR）的精确分割对于结直肠癌肝转移（CRLM）手术规划至关重要，可预防致命的肝切除术后肝功能衰竭。然而，由于复杂的切除边界、盘绕的肝血管系统和弥漫性转移病灶，该分割任务在技术上具有挑战性。开发自动化人工智能工具的一个主要瓶颈在于缺乏高保真、经过验证的数据。我们通过手动精修来自公开数据集CRLM-CT-Seg的全部197个三维影像，填补了这一空白，为此任务创建了首个开源且经过验证的基准。随后，我们使用nnU-Net、SwinUNETR和STU-Net模型，比较了级联式（肝脏->CRLM->FLR）与端到端（E2E）策略，首次建立了该任务的分割基线。研究发现，级联式nnU-Net实现了最佳的最终FLR分割Dice系数（0.767），而经过预训练的STU-Net则提供了更优的CRLM分割结果（0.620 Dice），并且对级联误差的鲁棒性显著更强。这项工作提供了首个经过验证的基准和一个可复现的框架，以加速人工智能辅助手术规划领域的研究。

摘要 (Abstract)

Accurate segmentation of the future liver remnant (FLR) is critical for surgical planning in colorectal liver metastases (CRLM) to prevent fatal post-hepatectomy liver failure. However, this segmentation task is technically challenging due to complex resection boundaries, convoluted hepatic vasculature and diffuse metastatic lesions. A primary bottleneck in developing automated AI tools has been the lack of high-fidelity, validated data. We address this gap by manually refining all 197 volumes from the public CRLM-CT-Seg dataset, creating the first open-source, validated benchmark for this task. We then establish the first segmentation baselines, comparing cascaded (Liver->CRLM->FLR) and end-to-end (E2E) strategies using nnU-Net, SwinUNETR, and STU-Net. We find a cascaded nnU-Net achieves the best final FLR segmentation Dice (0.767), while the pretrained STU-Net provides superior CRLM segmentation (0.620 Dice) and is significantly more robust to cascaded errors. This work provides the first validated benchmark and a reproducible framework to accelerate research in AI-assisted surgical planning.

关键词: Future Liver Remnant Segmentation, Colorectal Liver Metastases, Deep Learning Benchmark, nnU-Net, SwinUNETR, STU-Net, Surgical Planning, Medical Image Segmentation

265. ❌ Are we still able to recognize pearls? Machine-driven peer review and the risk to creativity: An explainable RAG-XAI detection framework with markers extraction

作者: Alin-Gabriel Văduva, Simona-Vasilica Oprea, Adela Bâra 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07964v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在科学同行评审中的应用风险，并提出了一个结合RAG和XAI的检测框架。因此，与’Large Language Models’高度相关（10分），因为LLMs是研究的核心对象；与’Retrieval-Augmented Generation’高度相关（10分），因为框架中明确使用了RAG组件；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为框架旨在提供可解释性（XAI）。与’AI for Science’有一定关联（5分），因为研究涉及AI在科学出版领域的应用，但并非生物信息学或化学信息学等具体科学领域。其他关键词（如MoE、SFT、量化等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）在同行评审中可能导致创造力风险的问题，并提出了一种可解释的RAG-XAI框架来检测自动化评审模式，该框架在测试集上达到了99.61%的准确率。

摘要翻译

将大型语言模型（LLMs）整合到同行评审中，引发了一个超越作者身份与检测的担忧：整个编辑流程可能出现的级联自动化风险。随着评审意见部分或完全由机器生成，编辑决策也可能被委托给算法系统，从而形成全自动化的评估流程。这有可能重塑科学工作的评价标准。本文认为，机器驱动的评估可能系统性地偏向标准化、符合模式的研究，而惩罚那些需要人类情境化判断的非传统与范式颠覆性思想。我们认为，这种转变可能导致认知同质化，即研究人员被隐性激励去优化其工作以迎合算法认可，而非追求真正的科学发现。为应对此风险，我们引入了一个可解释的框架（RAG-XAI），该框架利用LLM标记提取器评估评审质量并检测自动化模式，旨在维护科学领域的透明度、问责制与创造性。所提出的框架实现了近乎完美的检测性能：在测试集上，XGBoost、随机森林和LightGBM达到了99.61%的准确率，AUC-ROC高于0.999，F1分数为0.9925，同时保持了极低的误报率（<0.23%）和漏报率（约0.8%）。相比之下，逻辑回归基线模型表现显著较差（准确率89.97%，F1分数0.8314）。特征重要性与SHAP分析表明，个人信号的缺失和重复模式是主要的预测因子。此外，RAG组件实现了90.5%的top-1检索准确率，并在嵌入空间中呈现出强烈的同类聚类，进一步支持了框架输出结果的可靠性。

摘要 (Abstract)

The integration of large language models (LLMs) into peer review raises a concern beyond authorship and detection: the potential cascading automation of the entire editorial process. As reviews become partially or fully machine-generated, it becomes plausible that editorial decisions may also be delegated to algorithmic systems, leading to a fully automated evaluation pipeline. They risk reshaping the criteria by which scientific work is assessed. This paper argues that machine-driven assessment may systematically favor standardized, pattern-conforming research while penalizing unconventional and paradigm-shifting ideas that require contextual human judgment. We consider that this shift could lead to epistemic homogenization, where researchers are implicitly incentivized to optimize their work for algorithmic approval rather than genuine discovery. To address this risk, we introduce an explainable framework (RAG-XAI) for assessing review quality and detecting automated patterns using markers LLM extractor, aiming to preserve transparency, accountability and creativity in science. The proposed framework achieves near-perfect detection performance, with XGBoost, Random Forest and LightGBM reaching 99.61% accuracy, AUC-ROC above 0.999 and F1-scores of 0.9925 on the test set, while maintaining extremely low false positive rates (<0.23%) and false negative rates (~0.8%). In contrast, the logistic regression baseline performs substantially worse (89.97% accuracy, F1-score 0.8314). Feature importance and SHAP analyses identify absence of personal signals and repetition patterns as the dominant predictors. Additionally, the RAG component achieves 90.5% top-1 retrieval accuracy, with strong same-class clustering in the embedding space, further supporting the reliability of the framework’s outputs.

关键词: large language models, peer review, explainable AI, RAG, automation detection, creativity risk, machine-generated reviews, transparency

266. ❌ Is your algorithm unlearning or untraining?

作者: Eleni Triantafillou, Ahmed Imtiaz Humayun, Monica Ribero, Alexander Matt Turner, Michael C. Mozer, Georgios Kaissis 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文讨论的是机器学习中的’遗忘学习’（machine unlearning）概念，主要关注如何从已训练模型中删除特定数据或行为，并区分了’遗忘’（unlearning）和’反训练’（untraining）两个概念。论文内容属于机器学习模型训练后的修改和优化范畴，但并未涉及大模型、深度学习技术原理、科学应用或任何评分关键词中的具体技术（如LLM、MoE、SFT、RAG等）。所有关键词均与大模型技术、训练方法、推理优化、应用领域等相关，而该论文聚焦于通用机器学习模型的遗忘问题，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文区分了机器学习中'遗忘学习'（unlearning）和'反训练'（untraining）两个概念，前者旨在从模型中移除整个数据分布，后者仅移除特定训练数据的影响，以澄清文献中的混淆并推动该领域研究。

摘要翻译

随着模型规模日益扩大且训练数据量不断增长，如何在训练完成后从模型中“删除”特定数据点或行为模式引发了广泛关注。这一目标被称为“机器遗忘”。本文指出，“遗忘”这一术语在实际使用中存在概念泛化现象：不同研究对应着两种截然不同的问题范式，但现有文献尚未明确区分或承认这一差异。这导致了一系列问题，包括算法适用场景的模糊性、不同算法比较时采用的度量标准与基线选择失当、结果解读困难，以及关键研究方向被忽视等。为厘清这一现状，本文通过建立根本性区分来界定两种概念——即图1所示的“反学习”与“反训练”。简言之，“反训练”旨在逆转模型在特定遗忘集上的训练效果，即消除该遗忘集样本在训练过程中对模型产生的影响；而“反学习”的目标不仅在于消除给定样本的影响，更在于借助这些样本来更广泛地移除其背后代表的整体数据分布（例如样本所表征的概念或行为模式）。我们将探讨这些问题的技术定义，并将文献中的研究场景归类至相应范式。我们期望通过明确技术定义来开启学术讨论，并指出一系列被忽视的研究问题——我们相信，这是加速“遗忘”领域发展的关键缺失环节。

摘要 (Abstract)

As models are getting larger and are trained on increasing amounts of data, there has been an explosion of interest into how we can delete'' specific data points or behaviours from a trained model, after the fact. This goal has been referred to as machine unlearning’’. In this note, we argue that the term unlearning'' has been overloaded, with different research efforts spanning two distinct problem formulations, but without that distinction having been observed or acknowledged in the literature. This causes various issues, including ambiguity around when an algorithm is expected to work, use of inappropriate metrics and baselines when comparing different algorithms to one another, difficulty in interpreting results, as well as missed opportunities for pursuing critical research directions. In this note, we address this issue by establishing a fundamental distinction between two notions that we identify as \unlearning and \untraining, illustrated in Figure 1. In short, \untraining aims to reverse the effect of having trained on a given forget set, i.e. to remove the influence that that specific forget set examples had on the model during training. On the other hand, the goal of \unlearning is not just to remove the influence of those given examples, but to use those examples for the purpose of more broadly removing the entire underlying distribution from which those examples were sampled (e.g. the concept or behaviour that those examples represent). We discuss technical definitions of these problems and map problem settings studied in the literature to each. We hope to initiate discussions on disambiguating technical definitions and identify a set of overlooked research questions, as we believe that this a key missing step for accelerating progress in the field of unlearning’’.

关键词: machine unlearning, untraining, forget set, data deletion, model modification, training influence, distribution removal, algorithm distinction

267. ❌ Rethinking Residual Errors in Compensation-based LLM Quantization

作者: Shuaiting Li, Juncan Deng, Kedong Xu, Rongtao Deng, Hong Gu, Minghan Jiang, Haibin Shen, Kejie Huang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM量化方法的研究，核心贡献是改进基于权重补偿的量化技术（如GPTQ和GPTAQ），通过重新定义残差误差和引入’补偿感知误差’来提升量化性能。因此，与’Large Language Models’和’Quantization’高度相关（10分），其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文重新审视了基于权重补偿的LLM量化方法中的残差误差定义，提出补偿感知误差概念并改进校准目标，显著提升了GPTQ和GPTAQ等方法的量化性能。

摘要翻译

基于权重补偿的方法通过迭代应用量化和权重补偿来最小化输出误差，近期在大型语言模型（LLM）量化中取得了显著成功。代表性工作GPTQ引入了几项关键技术，使得此类迭代方法能够适用于参数规模达数十亿的LLM。GPTAQ通过引入非对称校准过程扩展了这一方法，该过程将每个量化层的输出与其全精度对应层对齐，并将残差误差纳入权重补偿框架。在本研究中，我们重新审视了残差误差的构建方式。我们发现现有方法在校准目标上存在次优性：在层内校准过程中，它们将量化输出与补偿后权重的输出对齐，而非与原始全精度模型的真实输出对齐。因此，我们重新定义了校准目标，旨在使量化模型每一步的输出都精确对齐原始全精度模型的输出。随后，我们揭示残差误差不仅源于前一层的输出差异，还来自每层内补偿权重与原始权重之间的偏差，我们将此称为“补偿感知误差”。通过继承GPTAQ中的神经元分解技术，我们可以高效地将这种补偿感知误差整合到权重更新过程中。在不同LLM和量化设置下的大量实验表明，我们提出的改进方案能够与GPTQ和GPTAQ无缝集成，显著提升了它们的量化性能。我们的代码公开于https://github.com/list0830/ResComp。

摘要 (Abstract)

Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs). The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters. GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full-precision counterpart, incorporating a residual error into the weight compensation framework. In this work, we revisit the formulation of the residual error. We identify a sub-optimal calibration objective in existing methods: during the intra-layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original full-precision model. Therefore, we redefine the objective to precisely align the quantized model’s output with the original output of the full-precision model at each step. We then reveal that the residual error originates not only from the output difference of the preceding layer but also from the discrepancy between the compensated and original weights within each layer, which we name the ‘compensation-aware error’. By inheriting the neuron decomposition technique from GPTAQ, we can efficiently incorporate this compensation-aware error into the weight update process. Extensive experiments on various LLMs and quantization settings demonstrate that our proposed enhancements integrate seamlessly with both GPTQ and GPTAQ, significantly improving their quantization performance. Our code is publicly available at https://github.com/list0830/ResComp.

关键词: LLM Quantization, Weight Compensation, Residual Error, GPTQ, GPTAQ, Model Compression, Calibration Objective, Compensation-aware Error

268. ❌ Pruning Extensions and Efficiency Trade-Offs for Sustainable Time Series Classification

作者: Raphael Fischer, Angus Dempster, Sebastian Buschjäger, Matthias Jakobs, Urav Maniar, Geoffrey I. Webb 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于时间序列分类（TSC）领域的模型剪枝、能效评估和可持续性研究，未涉及任何大语言模型、深度学习技术原理或科学AI应用。所有关键词均与大模型、深度学习、AI科学应用或相关技术（如MoE、RLHF、RAG等）相关，而本文研究的是传统机器学习模型（Hydra、Quant）的剪枝和效率优化，属于时间序列分析的特定领域，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了时间序列分类中模型剪枝对能效的影响，提出了一种新的剪枝组合方法Hydrant，实验表明剪枝可降低高达80%的能耗，同时保持预测精度损失小于5%。

摘要翻译

时间序列分类（TSC）支持重要的应用场景，但目前缺乏对模型、数据集和硬件之间性能权衡的统一理解。尽管该领域对资源消耗的关注日益增长，但TSC方法尚未在能源效率方面得到严格评估。本文引入了一个整体评估框架，旨在明确探索TSC中预测性能与资源消耗之间的平衡。为提升效率，我们将一种理论上有界的剪枝策略应用于领先的混合分类器——Hydra和Quant，并提出Hydrant，一种新颖的、可剪枝的二者结合模型。通过在20个MONSTER数据集、13种方法和三种计算设置上进行的超过4000次实验配置，我们系统分析了模型设计、超参数和硬件选择如何影响实际TSC性能。我们的结果表明，剪枝策略能在保持竞争力的预测质量的同时，显著降低高达80%的能耗，通常仅使模型损失不到5%的准确率。所提出的方法、实验结果及配套软件将推动TSC朝着可持续和可复现的方向发展。

摘要 (Abstract)

Time series classification (TSC) enables important use cases, however lacks a unified understanding of performance trade-offs across models, datasets, and hardware. While resource awareness has grown in the field, TSC methods have not yet been rigorously evaluated for energy efficiency. This paper introduces a holistic evaluation framework that explicitly explores the balance of predictive performance and resource consumption in TSC. To boost efficiency, we apply a theoretically bounded pruning strategy to leading hybrid classifiers - Hydra and Quant - and present Hydrant, a novel, prunable combination of both. With over 4000 experimental configurations across 20 MONSTER datasets, 13 methods, and three compute setups, we systematically analyze how model design, hyperparameters, and hardware choices affect practical TSC performance. Our results showcase that pruning can significantly reduce energy consumption by up to 80% while maintaining competitive predictive quality, usually costing the model less than 5% of accuracy. The proposed methodology, experimental results, and accompanying software advance TSC toward sustainable and reproducible practice.

关键词: Time series classification, Pruning, Energy efficiency, Model compression, Sustainable AI, Hydrant, Resource consumption, Predictive performance

269. ❌ Fraud Detection System for Banking Transactions

作者: Ranya Batsyas, Ritesh Yaduwanshi 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用传统机器学习方法（如逻辑回归、随机森林、XGBoost、决策树）进行银行交易欺诈检测，并应用了SMOTE处理类别不平衡和GridSearchCV进行超参数优化。论文内容完全围绕传统机器学习在金融科技领域的应用，未涉及任何大语言模型（LLM）、深度学习、大模型技术原理、AI for Science或相关关键词所描述的前沿技术。所有关键词均与大模型、深度学习、科学AI应用或相关技术创新无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于传统机器学习方法的银行交易欺诈检测框架，通过使用PaySim数据集、SMOTE处理类别不平衡和GridSearchCV优化超参数，比较了多种基线模型，为金融科技交易系统提供了一个鲁棒且可扩展的欺诈预防解决方案。

摘要翻译

数字支付系统的扩张提升了在线金融交易的规模与复杂程度，从而增加了对欺诈活动的脆弱性。攻击策略的不断演变以及真实交易与欺诈交易之间的显著差异，使得有效检测欺诈变得尤为复杂。本研究提出一种基于机器学习的欺诈检测框架，该框架采用PaySim合成金融交易数据集。遵循CRISP-DM方法论，研究内容包括假设驱动的探索性分析、特征优化，以及对逻辑回归等基线模型与随机森林、XGBoost、决策树等基于树的分类器的比较评估。为应对类别不平衡问题，研究采用SMOTE技术，并通过GridSearchCV进行超参数调优以提升模型性能。所提出的框架为增强金融科技交易系统的反欺诈能力提供了一个稳健且可扩展的解决方案。关键词：欺诈检测，不平衡数据，超参数优化，SMOTE

摘要 (Abstract)

The expansion of digital payment systems has heightened both the scale and intricacy of online financial transactions, thereby increasing vulnerability to fraudulent activities. Detecting fraud effectively is complicated by the changing nature of attack strategies and the significant disparity between genuine and fraudulent transactions. This research introduces a machine learning-based fraud detection framework utilizing the PaySim synthetic financial transaction dataset. Following the CRISP-DM methodology, the study includes hypothesis-driven exploratory analysis, feature refinement, and a comparative assessment of baseline models such as Logistic Regression and tree-based classifiers like Random Forest, XGBoost, and Decision Tree. To tackle class imbalance, SMOTE is employed, and model performance is enhanced through hyperparameter tuning with GridSearchCV. The proposed framework provides a robust and scalable solution to enhance fraud prevention capabilities in FinTech transaction systems. Keywords: fraud detection, imbalanced data, HPO, SMOTE

关键词: fraud detection, machine learning, banking transactions, imbalanced data, SMOTE, hyperparameter optimization, PaySim dataset, FinTech

270. ❌ Investigation of Automated Design of Quantum Circuits for Imaginary Time Evolution Methods Using Deep Reinforcement Learning

作者: Ryo Suzuki, Shohei Watabe 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子电路设计的自动化框架，使用深度强化学习（DDQN）优化变分虚时演化（VITE）方法的电路结构，属于AI在科学领域的应用（量子计算），与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），但未涉及大模型、深度学习技术原理或其他关键词，因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用深度强化学习自动设计量子电路的方法，用于优化变分虚时演化（VITE）的电路结构，在Max-Cut问题和分子氢模拟中实现了电路门数和深度的显著减少，同时达到高精度。

摘要翻译

高效基态搜索是推进组合优化问题与量子化学研究的基础。虽然变分虚时间演化方法为变分量子本征求解器和量子近似优化算法提供了有效的替代方案，但其在噪声中等规模量子设备上的实现受到手动设计拟设的门数量和电路深度的严重限制。本文提出了一种基于双深度Q网络的变分虚时间演化电路自动设计框架。我们的方法将电路构建视为多目标优化问题，同时最小化能量期望值并优化电路复杂度。通过引入自适应阈值，我们实现了显著的硬件开销降低。在最大割问题中，智能体自主发现的电路相较于标准硬件高效拟设，平均门数量减少约37%，电路深度降低43%。对于氢分子，双深度Q网络同样达到了完全组态相互作用极限，同时保持了显著更浅的电路深度。这些结果表明，深度强化学习有助于发现非直观的最优电路结构，为高效且硬件感知的量子算法设计提供了可行路径。

摘要 (Abstract)

Efficient ground state search is fundamental to advancing combinatorial optimization problems and quantum chemistry. While the Variational Imaginary Time Evolution (VITE) method offers a useful alternative to Variational Quantum Eigensolver (VQE), and Quantum Approximate Optimization Algorithm (QAOA), its implementation on Noisy Intermediate-Scale Quantum (NISQ) devices is severely limited by the gate counts and depth of manually designed ansatz. Here, we present an automated framework for VITE circuit design using Double Deep-Q Networks (DDQN). Our approach treats circuit construction as a multi-objective optimization problem, simultaneously minimizing energy expectation values and optimizing circuit complexity. By introducing adoptive thresholds, we demonstrate significant hardware overhead reductions. In Max-Cut problems, our agent autonomously discovered circuits with approximately 37% fewer gates and 43% less depth than standard hardware-efficient ansatz on average. For molecular hydrogen ($H_2$), the DDQN also achieved the Full-CI limit, with maintaining a significantly shallower circuit. These results suggest that deep reinforcement learning can be helpful to find non-intuitive, optimal circuit structures, providing a pathway toward efficient, hardware-aware quantum algorithm design.

关键词: Quantum Circuit Design, Deep Reinforcement Learning, Variational Imaginary Time Evolution, Double Deep-Q Networks, Hardware-aware Optimization, Max-Cut Problems, Molecular Hydrogen Simulation, Circuit Complexity Reduction

271. ❌ A Systematic Framework for Tabular Data Disentanglement

作者: Ivan Tjuawinata, Andre Gunawan, Anh Quan Tran, Nitish Kumar, Payal Pote, Harsh Bansal, Chu-Hung Chi, Kwok-Yan Lam, Parventanis Murthy 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《A Systematic Framework for Tabular Data Disentanglement》专注于表格数据解缠（tabular data disentanglement）的框架研究，涉及数据提取、建模、分析和潜在表示外推等模块化组件。虽然论文属于数据科学和机器学习领域，但所有关键词均与大模型（LLMs）、深度学习技术原理、AI for Science等具体主题相关，而本文未提及任何大模型、深度学习技术或科学AI应用，仅讨论传统表格数据处理方法（如因子分析、CT-GAN、VAE），因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个系统化框架，用于解决表格数据中复杂属性相互依赖的解缠问题，并通过案例研究展示了其在合成数据生成任务中的适用性。

摘要翻译

表格数据广泛应用于工业控制系统、金融和供应链等诸多领域，其属性间常存在复杂的相互关联。数据解耦旨在将此类数据转化为相互依赖性降低的潜在变量，从而促进更高效的数据处理。尽管针对图像、文本或音频数据的数据解耦已有广泛研究，但由于表格数据通常具有更为错综复杂的属性交互，其解耦方法可能需要进一步探索。此外，由于属性间关系高度复杂，直接套用其他数据域的方法会导致解耦效果欠佳。现有的表格数据解耦方法，如因子分析、CT-GAN和VAE等，面临着可扩展性不足、模式崩溃和外推能力弱等局限。本文提出采用一个框架来系统化审视表格数据解耦，该框架将解耦过程模块化为四个核心组件：数据提取、数据建模、模型分析和潜在表示外推。我们相信这项工作能深化对表格数据解耦及现有方法的理解，并为未来开发鲁棒、高效且可扩展的数据解耦技术奠定基础。最后，我们通过合成表格数据生成的案例研究，展示了该框架的适用性，并揭示了其在数据合成这一具体下游任务中的潜力。

摘要 (Abstract)

Tabular data, widely used in various applications such as industrial control systems, finance, and supply chain, often contains complex interrelationships among its attributes. Data disentanglement seeks to transform such data into latent variables with reduced interdependencies, facilitating more effective and efficient processing. Despite the extensive studies on data disentanglement over image, text, or audio data, tabular data disentanglement may require further investigation due to the more intricate attribute interactions typically found in tabular data. Moreover, due to the highly complex interrelationships, direct translation from other data domains results in suboptimal data disentanglement. Existing tabular data disentanglement methods, such as factor analysis, CT-GAN, and VAE face limitations including scalability issues, mode collapse, and poor extrapolation. In this paper, we propose the use of a framework to provide a systematic view on tabular data disentanglement that modularizes the process into four core components: data extraction, data modeling, model analysis, and latent representation extrapolation. We believe this work provides a deeper understanding of tabular data disentanglement and existing methods, and lays the foundation for potential future research in developing robust, efficient, and scalable data disentanglement techniques. Finally, we demonstrate the framework’s applicability through a case study on synthetic tabular data generation, showcasing its potential in the particular downstream task of data synthesis.

关键词: tabular data, data disentanglement, latent variables, framework, synthetic data generation, factor analysis, VAE, CT-GAN

272. ❌ Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions

作者: Jing Wang, Yu-Yang Qian, Ke Xue, Chao Qian, Peng Zhao, Zhi-Hua Zhou 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM服务效率中的输出长度预测问题，核心贡献是提出基于重尾分布建模的预测方法。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为LLM是研究的基础对象。与"Speculative Decoding OR Inference Acceleration"有一定关联（5分），因为长度预测是推理加速和批处理优化的关键环节，但论文未直接涉及推测解码技术。其他关键词（如MoE、SFT、RAG、量化等）均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM服务中输出长度预测不可靠的问题，揭示了提示条件下的输出长度呈现重尾分布特性，并提出了基于分布建模的ProD方法，显著提升了预测质量。

摘要翻译

输出长度预测对于高效的大语言模型服务至关重要，因为它直接影响批处理、内存预留与调度。在仅基于提示的长度预测任务中，现有方法大多使用单次采样长度作为标签，这隐含地将每个提示视为仅对应一个确定的目标长度。我们证明这种做法并不可靠：即使在固定的模型和解码设置下，同一提示所诱导的是一个提示条件输出长度分布，而非一个确定性标量，且该分布呈现出重尾特性。基于此，我们将长度预测问题重新定义为从重尾的提示条件长度分布中进行鲁棒估计。我们提出了提示条件长度分布方法，该方法通过同一提示的多次独立生成结果构建训练目标。我们开发了两种变体以复用服务中大语言模型的隐藏状态：ProD-M采用基于中位数的目标以实现鲁棒的点预测，而ProD-D则使用保留提示条件不确定性的分布目标。通过分析替代模型下的估计误差，我们提供了理论依据。在不同场景下的实验表明，所提方法在预测质量上取得了持续提升。

摘要 (Abstract)

Output-length prediction is important for efficient LLM serving, as it directly affects batching, memory reservation, and scheduling. For prompt-only length prediction, most existing methods use a one-shot sampled length as the label, implicitly treating each prompt as if it had one true target length. We show that this is unreliable: even under a fixed model and decoding setup, the same prompt induces a \emph{prompt-conditioned output length distribution}, not a deterministic scalar, and this distribution is consistent with \emph{heavy-tailed} behavior. Motivated by this, we cast length prediction as robust estimation from heavy-tailed prompt-conditioned length distributions. We propose prompt-conditioned length distribution (ProD) methods, which construct training targets from multiple independent generations of the same prompt. Two variants are developed to reuse the served LLM’s hidden states: \mbox{ProD-M}, which uses a median-based target for robust point prediction, and ProD-D, which uses a distributional target that preserves prompt-conditioned uncertainty. We provide theoretical justifications by analyzing the estimation error under a surrogate model. Experiments across diverse scenarios show consistent gains in prediction quality.

关键词: LLM serving, output-length prediction, prompt-conditioned distribution, heavy-tailed behavior, robust estimation, batching optimization, memory reservation, scheduling efficiency

273. ❌ Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting

作者: Tao Hana, Zhibin Wen, Zhenghao Chen, Fenghua Lin, Junyu Gao, Song Guo, Lei Bai 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于AI在气象预测（数值天气预报，NWP）中的应用，提出了一种基于3D高斯建模和尺度感知注意力机制的新框架（GSSA-ViT），用于高分辨率大气场预测和降尺度。论文的核心是计算机视觉（3D高斯建模、视觉变换器）和气象学的交叉应用，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联，因为气象预测属于AI在科学领域的应用（AI for Science），但论文未涉及生物信息学或化学信息学。其他关键词均与大语言模型、对齐、推理、代理、效率优化等LLM特定技术无关，因此评分为0。加权总分计算为8.0（仅一个关键词得分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯建模和尺度感知注意力机制的生成框架（GSSA-ViT），用于解决高分辨率大气场预测和降尺度的计算效率问题，实验表明其在任意分辨率预测和降尺度任务中性能优越。

摘要翻译

尽管基于人工智能的数值天气预报（NWP）能够实现快速预测，但由于有限的多尺度适应性和低效的数据表示，生成高分辨率输出在计算上仍然要求很高。我们提出了基于3D高斯溅射的尺度感知视觉变换器（GSSA-ViT），这是一种用于高维大气场任意分辨率预报和灵活降尺度的新型框架。具体而言，经纬度网格点被视为3D高斯分布的中心。我们引入了一种生成式3D高斯预测方案，用于估计未见样本的关键参数，包括协方差、属性和不透明度，从而提升泛化能力并缓解过拟合。此外，设计了一个尺度感知注意力模块来捕捉跨尺度依赖关系，使模型能够有效整合不同降尺度比例下的信息，并支持连续分辨率适应。据我们所知，这是首个将生成式3D高斯建模与尺度感知注意力相结合以实现统一多尺度预测的NWP方法。在ERA5数据集上的实验表明，所提方法能够以任意分辨率准确预报87个大气变量，而在ERA5和CMIP6上的评估则证明了其在降尺度任务中的优越性能。该框架为高分辨率、多尺度大气预测和降尺度提供了一个高效且可扩展的解决方案。代码发布于：https://github.com/binbin2xs/weather-GS。

摘要 (Abstract)

While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.

关键词: 3D Gaussian splatting, scale-aware attention, numerical weather prediction, atmospheric downscaling, arbitrary-resolution forecasting, vision transformer, multi-scale prediction, high-dimensional atmospheric fields

274. ❌ Sinkhorn doubly stochastic attention rank decay analysis

作者: Michela Lapenna, Rita Fioresi, Bahman Gharesifard 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是Transformer中自注意力机制的数学理论分析，特别是Sinkhorn双随机注意力在缓解秩衰减方面的理论证明和实验验证。虽然涉及注意力机制这一Transformer核心组件，但论文聚焦于数学理论分析（秩衰减、熵正则化、Sinkhorn算法），而非大模型技术原理创新或具体应用。所有关键词都直接针对大模型技术栈（训练、对齐、推理、应用等），与论文的理论数学分析主题完全无关。

!!! tip deepseek-chat TL;DR

论文研究了Transformer中自注意力机制的秩衰减问题，通过理论分析和实验验证，证明了Sinkhorn双随机注意力比标准Softmax注意力能更有效地缓解秩衰减，并推导了其秩衰减的理论界限。

摘要翻译

自注意力机制是Transformer架构成功的关键。然而，研究表明，标准的行随机注意力会因层间信号显著退化而受到影响。具体而言，它可能引发秩崩溃，导致令牌表征趋于均匀化；同时也会引发熵崩溃，表现为注意力分布高度集中。近期研究指出，双重随机注意力作为一种熵正则化形式具有优势，它能促进更均衡的注意力分布，从而提升实际性能。本文研究了网络深度中的秩崩溃现象，并证明采用Sinkhorn算法归一化的双重随机注意力矩阵比标准的Softmax行随机矩阵能更有效地保持秩。正如先前针对Softmax的研究所示，跳跃连接对缓解秩崩溃至关重要。我们在情感分析和图像分类任务上实证验证了这一现象。此外，我们推导了使用Sinkhorn归一化时纯自注意力秩衰减的理论界，发现秩会随深度呈双指数级衰减至一，这一现象在Softmax中已被证实。

摘要 (Abstract)

The self-attention mechanism is central to the success of Transformer architectures. However, standard row-stochastic attention has been shown to suffer from significant signal degradation across layers. In particular, it can induce rank collapse, resulting in increasingly uniform token representations, as well as entropy collapse, characterized by highly concentrated attention distributions. Recent work has highlighted the benefits of doubly stochastic attention as a form of entropy regularization, promoting a more balanced attention distribution and leading to improved empirical performance. In this paper, we study rank collapse across network depth and show that doubly stochastic attention matrices normalized with Sinkhorn algorithm preserve rank more effectively than standard Softmax row-stochastic ones. As previously shown for Softmax, skip connections are crucial to mitigate rank collapse. We empirically validate this phenomenon on both sentiment analysis and image classification tasks. Moreover, we derive a theoretical bound for the pure self-attention rank decay when using Sinkhorn normalization and find that rank decays to one doubly exponentially with depth, a phenomenon that has already been shown for Softmax.

关键词: self-attention mechanism, rank collapse, doubly stochastic attention, Sinkhorn algorithm, Transformer architectures, entropy regularization, theoretical analysis, skip connections

275. ❌ Dynamic Attentional Context Scoping: Agent-Triggered Focus Sessions for Isolated Per-Agent Steering in Multi-Agent LLM Orchestration

作者: Nickson Patel 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM编排中的上下文污染问题，提出DACS机制实现智能体触发的上下文隔离。高度相关关键词：LLM Agents（10分）- 论文研究自主LLM智能体编排；Multi-agent Systems（10分）- 核心研究多智能体协调和污染问题；Context Window Extension（8分）- 涉及上下文窗口管理和效率优化。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多智能体LLM编排中的上下文污染问题，提出了动态注意力上下文范围机制，通过智能体触发的焦点会话实现隔离式单智能体引导，实验显示该方法将引导准确率从21-60%提升至90-98.4%。

摘要翻译

多智能体大语言模型编排系统普遍存在上下文污染问题：当N个并发智能体竞争编排器的上下文窗口时，每个智能体的任务状态、部分输出和待处理问题会污染其他所有智能体的引导交互，导致决策质量下降。我们提出动态注意力上下文定界机制，该机制使编排器在两种非对称模式下运行。在注册表模式下，编排器仅保存轻量级的单智能体状态摘要（每项≤200词元），同时保持对所有智能体及用户的响应能力。当智能体发出引导请求时，编排器进入聚焦模式，注入该智能体的完整上下文，同时将所有其他智能体上下文压缩为其注册表条目。这种上下文隔离由智能体触发、具有非对称性和确定性：引导期间上下文窗口严格包含目标智能体完整上下文与其他智能体注册条目，无需依赖上下文压缩或检索技术即可消除跨智能体污染。我们在总计200次试验的四个实验阶段评估DACS：第一阶段测试智能体数量N∈{3,5,10}（60次试验）；第二阶段测试智能体异构性与对抗性依赖关系（60次试验）；第三阶段测试决策密度直至D=15（40次试验）；第四阶段使用自主大语言模型智能体处理自由形式问题（40次试验，Claude Haiku 4.5）。在所有8个合成场景中，DACS实现了90.0–98.4%的引导准确率，而扁平上下文基线仅为21.0–60.0%（所有阶段p<0.0001），错误智能体污染率从28–57%降至0–14%，上下文效率比最高达3.53倍。该准确率优势随N和D增加而扩大；所有阶段通过大语言模型即裁判验证了关键词匹配有效性（平均kappa=0.909）。在第四阶段，DACS相较扁平上下文基线的优势在N=3时达+17.2个百分点（p=0.0023），在N=5时达+20.4个百分点（p=0.0008），且经两位独立裁判确认该优势随N增加而增强。

摘要 (Abstract)

Multi-agent LLM orchestration systems suffer from context pollution: when N concurrent agents compete for the orchestrator’s context window, each agent’s task state, partial outputs, and pending questions contaminate the steering interactions of every other agent, degrading decision quality. We introduce Dynamic Attentional Context Scoping (DACS), a mechanism in which the orchestrator operates in two asymmetric modes. In Registry mode it holds only lightweight per-agent status summaries (<=200 tokens each), remaining responsive to all agents and the user. When an agent emits a SteeringRequest, the orchestrator enters Focus(a_i) mode, injecting the full context of agent a_i while compressing all other agents to their registry entries. Context isolation is agent-triggered, asymmetric, and deterministic: the context window contains exactly F(a_i) + R_{-i} during steering, eliminating cross-agent contamination without requiring context compression or retrieval. We evaluate DACS across four experimental phases totalling 200 trials: Phase 1 tests N in {3,5,10} (60 trials); Phase 2 tests agent heterogeneity and adversarial dependencies (60 trials); Phase 3 tests decision density up to D=15 (40 trials); Phase 4 uses autonomous LLM agents for free-form questions (40 trials, Claude Haiku 4.5). Across all 8 synthetic scenarios, DACS achieves 90.0–98.4% steering accuracy versus 21.0–60.0% for a flat-context baseline (p < 0.0001 throughout), with wrong-agent contamination falling from 28–57% to 0–14% and context efficiency ratios of up to 3.53x. The accuracy advantage grows with N and D; keyword matching is validated by LLM-as-judge across all phases (mean kappa=0.909). DACS outperforms the flat-context baseline by +17.2pp at N=3 (p=0.0023) and +20.4pp at N=5 (p=0.0008) in Phase 4, with the advantage growing with N confirmed by two independent judges.

关键词: Multi-agent LLM orchestration, Context pollution, Dynamic Attentional Context Scoping, Agent-triggered steering, Context isolation, LLM agents, Steering accuracy, Context window management

276. ❌ Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency

作者: Mingqing Xiao, Yansen Wang, Dongqi Han, Caihua Shan, Dongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是神经启发的振荡同步机制在视觉Transformer中的应用，属于深度学习架构创新，但所有关键词都明确指向大语言模型（LLM）相关技术或应用。论文专注于视觉模型（Vision Transformers）和神经科学启发的机制（Kuramoto振荡器），未涉及任何大语言模型、对齐、推理、代理、压缩等关键词领域。虽然研究背景提到’大模型和深度学习技术原理的创新’，但该论文的创新仅限于视觉Transformer架构，与评分关键词中的大模型技术无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种神经启发的Kuramoto振荡相位编码（KoPE）方法，通过将振荡同步机制引入视觉Transformer来提高学习效率，并在多个视觉任务中验证了其有效性。

摘要翻译

时空神经动力学与振荡同步在生物信息处理中具有广泛作用，并被假设可支持特征绑定等灵活协调机制。相比之下，大多数深度学习架构通过激活值表示和传递信息，忽视了发放率与相位的联合动态特性。本研究引入**Kuramoto振荡相位编码（KoPE）**作为视觉Transformer中一种额外的演化相位状态，结合神经启发的同步机制以提升学习效率。我们证明，KoPE能通过同步增强的结构学习提升视觉模型的训练效率、参数效率和数据效率。此外，KoPE对需要结构化理解的任务具有增益效果，包括语义与全景分割、语言表征对齐以及小样本抽象视觉推理（ARC-AGI）。理论分析与实验验证进一步表明，KoPE能加速注意力集中过程从而提升学习效率。这些结果表明，同步机制可作为一种可扩展的、神经启发的机制，用于推进前沿神经网络模型的发展。

摘要 (Abstract)

Spatiotemporal neural dynamics and oscillatory synchronization are widely implicated in biological information processing and have been hypothesized to support flexible coordination such as feature binding. By contrast, most deep learning architectures represent and propagate information through activation values, neglecting the joint dynamics of rate and phase. In this work, we introduce Kuramoto oscillatory Phase Encoding (KoPE) as an additional, evolving phase state to Vision Transformers, incorporating a neuro-inspired synchronization mechanism to advance learning efficiency. We show that KoPE can improve training, parameter, and data efficiency of vision models through synchronization-enhanced structure learning. Moreover, KoPE benefits tasks requiring structured understanding, including semantic and panoptic segmentation, representation alignment with language, and few-shot abstract visual reasoning (ARC-AGI). Theoretical analysis and empirical verification further suggest that KoPE can accelerate attention concentration for learning efficiency. These results indicate that synchronization can serve as a scalable, neuro-inspired mechanism for advancing state-of-the-art neural network models.

关键词: Kuramoto oscillatory Phase Encoding, Vision Transformers, neuro-inspired synchronization, learning efficiency, attention concentration, spatiotemporal neural dynamics, oscillatory synchronization, few-shot abstract visual reasoning

277. ❌ Visual Perceptual to Conceptual First-Order Rule Learning Networks

作者: Kun Gao, Davide Soldà, Thomas Eiter, Katsumi Inoue 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出γILP框架，专注于从图像数据中学习一阶逻辑规则，以增强深度学习的可解释性和推理能力。与关键词的相关性分析如下：1）与’Large Language Models’有中等相关性（5分），因为摘要提到该研究旨在增强大语言模型的推理能力；2）与’Chain of Thought’和’System 2 Thinking’有较强相关性（各8分），因为规则学习直接关联多步推理和深度推理过程；3）与’Mechanistic Interpretability’高度相关（10分），因为可解释AI是论文的核心研究动机之一；4）其他关键词如MoE、量化、RAG等与论文的视觉规则学习主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文解决了从无标签图像数据中自动学习一阶逻辑规则的挑战，提出了γILP框架，在符号关系数据集和图像数据集（如Kandinsky模式）上均表现出色。

摘要翻译

学习规则在深度学习中扮演着关键角色，特别是在可解释人工智能（explainable artificial intelligence）以及提升大语言模型推理能力方面。尽管现有的规则学习方法主要针对符号数据设计，但从无图像标签支持的图像数据中学习规则并自动发明谓词（predicates）仍是一个挑战。本文通过一个名为γILP的框架来解决这些从图像中进行归纳规则学习的问题，该框架提供了一个从图像常量替换到规则结构归纳的完全可微分流程。大量实验表明，γILP不仅在经典的符号关系数据集上表现优异，在关系图像数据以及纯图像数据集（如康定斯基模式，Kandinsky patterns）上也取得了强劲的性能。

摘要 (Abstract)

Learning rules plays a crucial role in deep learning, particularly in explainable artificial intelligence and enhancing the reasoning capabilities of large language models. While existing rule learning methods are primarily designed for symbolic data, learning rules from image data without supporting image labels and automatically inventing predicates remains a challenge. In this paper, we tackle these inductive rule learning problems from images with a framework called γILP, which provides a fully differentiable pipeline from image constant substitution to rule structure induction. Extensive experiments demonstrate that γILP achieves strong performance not only on classical symbolic relational datasets but also on relational image data and pure image datasets, such as Kandinsky patterns.

关键词: rule learning, explainable AI, first-order logic, image data, inductive learning, differentiable pipeline, reasoning capabilities, Kandinsky patterns

278. ❌ Non-variational supervised quantum kernel methods: a review

作者: John Tanner, Chon-Fai Kam, Jingbo Wang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子机器学习中的量子核方法，属于量子计算与机器学习的交叉领域。所有关键词均围绕大模型、深度学习及其技术原理、优化方法、应用场景等，而本文研究的是量子计算框架下的监督学习方法，与经典深度学习、大模型技术无直接关联。唯一可能的相关点是“AI for Science”，因为量子机器学习可视为科学计算AI的一个子领域，但论文未明确讨论生物信息学或化学信息学等具体科学应用，因此给予5分（有一定关联）。其他关键词均与量子计算、核方法无关，故评0分。

!!! tip deepseek-chat TL;DR

该综述系统分析了非变分监督量子核方法，探讨了其理论基础、构造方法、优势条件以及实际挑战，旨在阐明量子核方法可能提供真正优势的适用范围。

摘要翻译

量子核方法已成为监督式量子机器学习的重要框架。与依赖基于梯度的优化且可能面临贫瘠高原等问题的变分量子算法不同，非变分量子核方法采用固定的量子特征映射，并通过凸优化和交叉验证以经典方式进行模型选择。这种量子特征嵌入与经典训练的分离确保了优化的稳定性，同时利用量子电路将数据编码到高维希尔伯特空间中。本文综述对非变分监督式量子核方法进行了全面分析，涵盖其在经典核理论中的基础、保真度核与投影量子核的构建方法，以及实际应用中的估计技术。我们考察了评估量子优势的理论框架，包括泛化边界以及与经典模型分离的必要条件，并分析了关键挑战，如指数浓度、通过张量网络方法的去量子化，以及核积分算子的谱特性。我们进一步讨论了可能实现优势的结构化问题类别，并综合了比较研究和硬件研究的见解。总体而言，本综述旨在阐明量子核方法可能提供真正优势的应用范围，并厘清实际量子增强学习必须克服的概念、方法和技术障碍。

摘要 (Abstract)

Quantum kernel methods (QKMs) have emerged as a prominent framework for supervised quantum machine learning. Unlike variational quantum algorithms, which rely on gradient-based optimisation and may suffer from issues such as barren plateaus, non-variational QKMs employ fixed quantum feature maps, with model selection performed classically via convex optimisation and cross-validation. This separation of quantum feature embedding from classical training ensures stable optimisation while leveraging quantum circuits to encode data in high-dimensional Hilbert spaces. In this review, we provide a thorough analysis of non-variational supervised QKMs, covering their foundations in classical kernel theory, constructions of fidelity and projected quantum kernels, and methods for their estimation in practice. We examine frameworks for assessing quantum advantage, including generalisation bounds and necessary conditions for separation from classical models, and analyse key challenges such as exponential concentration, dequantisation via tensor-network methods, and the spectral properties of kernel integral operators. We further discuss structured problem classes that may enable advantage, and synthesise insights from comparative and hardware studies. Overall, this review aims to clarify the regimes in which QKMs may offer genuine advantages, and to delineate the conceptual, methodological, and technical obstacles that must be overcome for practical quantum-enhanced learning.

关键词: quantum kernel methods, supervised quantum machine learning, non-variational algorithms, quantum feature maps, quantum advantage, generalization bounds, kernel integral operators, quantum-enhanced learning

279. ❌ Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs

作者: Binxing Xu, Hao Gu, Lujun Li, Hao Wang, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Xintong Yang, Chao Li, Sirui Han, Yike Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07888v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的低比特量化训练（Quantization/Model Compression），直接对应该关键词（15分）。论文明确研究LLMs（10分），并涉及推理加速（Speculative Decoding/Inference Acceleration）的优化（5分）。其他关键词如MoE、SFT、RAG等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Bit-by-Bit的渐进式量化感知训练框架，通过块级渐进训练、嵌套整数量化网格和舍入感知异常通道分裂，解决了低比特LLM训练中的收敛不稳定和高训练成本问题，在W2A2设置下显著优于基线模型。

摘要翻译

在超低精度下训练大语言模型仍是一项艰巨挑战。直接的低比特量化感知训练常面临收敛不稳定和训练成本高昂的问题，而重尾分布的异常值通道产生的量化噪声及跨层误差累积进一步加剧了这些困难。为解决上述问题，我们提出了Bit-by-Bit——一种融合异常值通道分割的渐进式量化感知训练框架。该方法整合了三个核心组件：(1) 块级渐进训练，通过分阶段降低精度，为低比特优化提供稳定的初始化；(2) 整数量化网格的嵌套结构，实现了“一次训练，任意精度部署”的范式，使单一模型无需重新训练即可支持多种比特宽度；(3) 舍入感知的异常值通道分割技术，在减少量化误差的同时，作为恒等变换保持量化输出不变。此外，我们采用符合OCP/NVIDIA标准的E4M3缩放系数的微缩放分组策略，以捕捉动态激活范围。针对高效2比特计算核的缺失，我们为W2A2和W2A16配置开发了定制算子，相比BF16实现了高达11倍的加速。在W2A2设置下，Bit-by-Bit在Llama2/3模型上显著优于BitDistiller和EfficientQAT等基线方法，其WikiText2困惑度损失仅为2.25，与全精度模型表现接近。

摘要 (Abstract)

Training LLMs at ultra-low precision remains a formidable challenge. Direct low-bit QAT often suffers from convergence instability and substantial training costs, exacerbated by quantization noise from heavy-tailed outlier channels and error accumulation across layers. To address these issues, we present Bit-by-Bit, a progressive QAT framework with outlier channel splitting. Our approach integrates three key components: (1) block-wise progressive training that reduces precision stage by stage, ensuring stable initialization for low-bit optimization; (2) nested structure of integer quantization grids to enable a “train once, deploy any precision” paradigm, allowing a single model to support multiple bit-widths without retraining; (3) rounding-aware outlier channel splitting, which mitigates quantization error while acting as an identity transform that preserves the quantized outputs. Furthermore, we follow microscaling groups with E4M3 scales, capturing dynamic activation ranges in alignment with OCP/NVIDIA standards. To address the lack of efficient 2-bit kernels, we developed custom operators for both W2A2 and W2A16 configurations, achieving up to 11$\times$ speedup over BF16. Under W2A2 settings, Bit-by-Bit significantly outperforms baselines like BitDistiller and EfficientQAT on both Llama2/3, achieving a loss of only 2.25 WikiText2 PPL compared to full-precision models.

关键词: Quantization-Aware Training, Low-bit LLMs, Progressive Training, Outlier Channel Splitting, Model Compression, Inference Acceleration, W2A2 Quantization

280. ❌ Ensembles at Any Cost? Accuracy-Energy Trade-offs in Recommender Systems

作者: Jannik Nitschke, Lukas Wegmeth, Joeran Beel 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究推荐系统中的集成方法与能源效率的权衡，使用传统机器学习模型（如SVD++）和推荐系统库（Surprise、LensKit），未涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了推荐系统中集成方法在提高准确性（0.3%-5.7%）与增加能源消耗（19%-2549%）之间的权衡，发现选择性集成比穷举平均更节能。

摘要翻译

集成方法在推荐系统中被频繁使用，通过组合多个模型以提高准确性。近期研究报道了显著的性能提升，但大多数工作仍主要针对准确性和鲁棒性进行优化，而非能源效率。本文测量了集成技术与强效单一模型相比的准确性与能耗权衡。我们在两个实验流程中进行了93组对照实验：1. 使用Surprise库进行显式评分预测（以RMSE为指标）；2. 使用LensKit库进行隐式反馈排序（以NDCG@10为指标）。我们评估了四个数据集，其交互数量从10万到780万不等（MovieLens 100K、MovieLens 1M、ModCloth、Anime）。我们比较了四种集成策略（平均法、加权法、堆叠法或排序融合法、择优集成法）与基线模型及优化单一模型的性能。全系统能耗通过智能插座使用EMERS工具测量，并转换为二氧化碳当量。在所有实验设置中，集成方法将准确性提升了0.3%至5.7%，但同时增加了19%至2,549%的能耗。在MovieLens 1M数据集上，择优集成法相比SVD++模型将RMSE提升了0.96%，能耗增加了18.8%。在MovieLens 100K数据集上，平均集成法将NDCG@10提升了5.7%，但额外消耗了103%的能源。在Anime数据集上，Surprise择优集成法将RMSE提升了1.2%，但能耗增加了2,005%（0.21 vs. 0.01 Wh），碳排放从2.6毫克二氧化碳当量增至53.8毫克，而LensKit集成则因内存限制而失败。总体而言，选择性集成比穷举平均法更具能源效率。

摘要 (Abstract)

Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,

关键词: Ensemble Methods, Recommender Systems, Accuracy-Energy Trade-offs, Energy Efficiency, Model Comparison, CO2 Emissions, Performance Evaluation, Memory Limits

281. ❌ QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch

作者: Hao Gu, Hao Wang, Jiacheng Liu, Lujun Li, Qiyuan Zhu, Bei Liu, Binxing Xu, Lei Wang, Xintong Yang, Sida Lin, Sirui Han, Yike Guo 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM强化学习训练中的量化加速问题，与’Large Language Models’高度相关（10分），因为全文围绕LLM RL训练展开；与’Quantization’高度相关（10分），因为研究量化训练-推理不匹配问题；与’Mixture of Experts’相关（8分），因为实验使用Qwen3-30B-A3B MoE模型；与’Speculative Decoding’相关（8分），因为研究通过量化加速解码过程；其他关键词如RLHF、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型强化学习训练中量化加速导致的训练-推理不匹配问题，提出了QaRL方法和TBPO目标，在数学问题上实现了+5.5的性能提升并保持了低比特吞吐优势。

摘要翻译

大语言模型（LLM）强化学习（RL）流程常受限于生成过程，导致端到端训练速度缓慢。近期研究通过采用量化技术加速解码——这是强化学习循环中最耗时的环节——来缓解这一问题。然而，这种设置会扩大训练与推理之间的差距，从而破坏优化稳定性：生成过程在低精度下运行，而学习更新则在全精度下计算。为应对这一挑战，我们提出QaRL（Rollout Alignment Quantization-Aware RL，对齐量化感知强化学习），通过将训练阶段的前向计算与量化生成过程对齐，以最小化不匹配。我们进一步识别了量化生成过程中的一种失效模式：长文本回复倾向于产生重复、混乱的标记（错误标记）。为缓解这些问题，我们引入了TBPO（Trust-Band Policy Optimization，信任带策略优化），这是一种针对负样本采用双重截断的序列级目标函数，旨在将更新限制在信任区域内。在Qwen3-30B-A3B混合专家模型上进行数学问题测试时，QaRL在保持低比特率吞吐优势的同时，将量化生成训练的稳定性提升了+5.5个点。

摘要 (Abstract)

Large language model (LLM) reinforcement learning (RL) pipelines are often bottlenecked by rollout generation, making end-to-end training slow. Recent work mitigates this by running rollouts with quantization to accelerate decoding, which is the most expensive stage of the RL loop. However, these setups destabilize optimization by amplifying the training-inference gap: rollouts are operated at low precision, while learning updates are computed at full precision. To address this challenge, we propose QaRL (Rollout Alignment Quantization-Aware RL), which aligns training-side forward with the quantized rollout to minimize mismatch. We further identify a failure mode in quantized rollouts: long-form responses tend to produce repetitive, garbled tokens (error tokens). To mitigate these problems, we introduce TBPO (Trust-Band Policy Optimization), a sequence-level objective with dual clipping for negative samples, aimed at keeping updates within the trust region. On Qwen3-30B-A3B MoE for math problems, QaRL outperforms quantized-rollout training by +5.5 while improving stability and preserving low-bit throughput benefits.

关键词: Large Language Models, Reinforcement Learning, Quantization, Training-Inference Mismatch, Rollout Generation, MoE, Decoding Acceleration, Policy Optimization

282. ❌ Information-Theoretic Requirements for Gradient-Based Task Affinity Estimation in Multi-Task Learning

作者: Jasper Zhang, Bryan Cheng 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究多任务学习中的梯度冲突分析，属于机器学习领域，但并非大模型或深度学习技术原理的创新。论文内容涉及生物信息学数据集验证，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心内容。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了多任务学习中梯度分析的一个基本要求——任务间需要共享训练样本才能准确反映任务关系，并发现样本重叠率低于30%时梯度信号与噪声无异，高于40%时能可靠恢复生物结构，这解释了多年来多任务学习结果不一致的问题。

摘要翻译

多任务学习呈现出显著不一致的结果——有时联合训练能大幅提升性能，有时却会严重损害效果——然而该领域始终缺乏预测这些结果的理论框架。我们发现了基于梯度的任务分析中一个基础却未被明示的前提：任务必须共享训练样本，梯度冲突才能揭示真实的任务关联。当任务基于相同输入进行测量时，梯度对齐反映的是共享的机制结构；当任务在互斥的输入集上测量时，任何表面信号都会混淆任务关系与分布偏移。我们进一步发现这种样本重叠要求存在明显的相位转变：当重叠度低于30%时，梯度与任务相关性在统计上与噪声无法区分；当重叠度超过40%时，梯度分析能稳定复现已知的生物学结构。在多个数据集上的综合验证实现了强相关性，并成功还原了生物通路组织结构。现有标准基准系统性地违背了这一要求——MoleculeNet数据集的重叠度低于5%，TDC数据集仅为8-14%——远低于梯度分析产生意义的临界阈值。这为七年来多任务学习结果的不一致性提供了首个理论解释。

摘要 (Abstract)

Multi-task learning shows strikingly inconsistent results – sometimes joint training helps substantially, sometimes it actively harms performance – yet the field lacks a principled framework for predicting these outcomes. We identify a fundamental but unstated assumption underlying gradient-based task analysis: tasks must share training instances for gradient conflicts to reveal genuine relationships. When tasks are measured on the same inputs, gradient alignment reflects shared mechanistic structure; when measured on disjoint inputs, any apparent signal conflates task relationships with distributional shift. We discover this sample overlap requirement exhibits a sharp phase transition: below 30% overlap, gradient-task correlations are statistically indistinguishable from noise; above 40%, they reliably recover known biological structure. Comprehensive validation across multiple datasets achieves strong correlations and recovers biological pathway organization. Standard benchmarks systematically violate this requirement – MoleculeNet operates at <5% overlap, TDC at 8-14% – far below the threshold where gradient analysis becomes meaningful. This provides the first principled explanation for seven years of inconsistent MTL results.

关键词: Multi-task Learning, Gradient Conflict, Task Affinity, Sample Overlap, Biological Structure, MoleculeNet, TDC, Phase Transition

283. ❌ Automatic Generation of Executable BPMN Models from Medical Guidelines

作者: Praveen Kumar Menaka Sekar, Ion Matei, Maksym Zhenirovskyy, Hon Yung Wong, Sayuri Kohmura, Shinji Hotta, Akihiro Inomata 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文明确使用LLMs将医疗指南转换为可执行的BPMN模型，属于大模型在医疗领域的应用，因此与’Large Language Models’高度相关（10分）。论文提到’uncertainty detection’和’entropy-based’方法，与’Self-Correction’有一定关联（5分）。论文应用LLMs处理医疗指南，属于’AI for Science’中的生物信息学应用（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用大语言模型（LLMs）将医疗政策文档自动转换为可执行BPMN模型的端到端管道，用于基于仿真的政策评估，并在糖尿病肾病预防指南上实现了高达100%的ground-truth匹配和超过92%的每患者决策一致性。

摘要翻译

本文提出一种端到端处理流程，利用大语言模型将医疗政策文件转化为可执行的、数据感知的业务流程模型与标注（BPMN）模型，以支持基于仿真的政策评估。我们通过四项核心贡献应对政策自动化数字化的主要挑战：基于数据的BPMN生成与语法自动校正、可执行性增强、关键绩效指标（KPI）植入以及基于熵的不确定性检测。该流程在日本三个城市的糖尿病肾病防治指南上进行了验证，针对三种大语言模型后端各生成100个模型，并对每个模型使用1,000名合成患者进行仿真执行。在结构清晰的政策文件上，流程实现了100%的基准匹配度与完美的个体决策一致性。在所有测试条件下，原始个体决策一致率超过92%，且熵值分数随文件复杂度单调上升，证实了检测器能够可靠地区分明确政策与需要针对性人工澄清的政策。

摘要 (Abstract)

We present an end-to-end pipeline that converts healthcare policy documents into executable, data-aware Business Process Model and Notation (BPMN) models using large language models (LLMs) for simulation-based policy evaluation. We address the main challenges of automated policy digitization with four contributions: data-grounded BPMN generation with syntax auto-correction, executable augmentation, KPI instrumentation, and entropy-based uncertainty detection. We evaluate the pipeline on diabetic nephropathy prevention guidelines from three Japanese municipalities, generating 100 models per backend across three LLMs and executing each against 1,000 synthetic patients. On well-structured policies, the pipeline achieves a 100% ground-truth match with perfect per-patient decision agreement. Across all conditions, raw per-patient decision agreement exceeds 92%, and entropy scores increase monotonically with document complexity, confirming that the detector reliably separates unambiguous policies from those requiring targeted human clarification.

关键词: large language models, BPMN models, medical guidelines, policy digitization, simulation-based evaluation, diabetic nephropathy, entropy-based uncertainty detection, synthetic patients

284. ❌ Intensity Dot Product Graphs

作者: Giulio Valentino Dalla Riva, Matteo Dalla Riva 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07810v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是随机图模型的理论扩展（Intensity Dot Product Graphs），属于数学统计和图论领域，与所有评分关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未涉及任何AI模型、训练方法、推理技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的随机图模型（Intensity Dot Product Graphs），通过用泊松点过程替换固定潜在位置来扩展随机点积图，建立了连续潜在结构与有限观测图之间的连接，并证明了谱一致性等理论结果。

摘要翻译

潜在位置随机图模型通常在选定样本量后将节点集视为固定，而基于图函数和随机测度的构造虽允许更多随机性，却以削弱几何可解释性为代价。本文提出强度点积图（Intensity Dot Product Graphs，简称IDPGs），该模型通过将随机点积图（Random Dot Product Graphs）中的固定潜在位置集合替换为欧几里得潜在空间上的泊松点过程，实现了节点群体的随机性、RDPG风格的点积亲和性，以及连接连续潜在结构与有限观测图的群体层面强度函数。我们定义了作为概率矩阵连续类比的热力图与期望算子，证明了邻接矩阵奇异值与算子谱相关联的谱一致性结果，比较了该构造与图函数及有向图函数表示的关系，并展示了经典RDPG如何在集中极限下自然产生。由于该模型由演化强度参数化，通过偏微分方程的时间扩展自然生成。

摘要 (Abstract)

Latent-position random graph models usually treat the node set as fixed once the sample size is chosen, while graphon-based and random-measure constructions allow more randomness at the cost of weaker geometric interpretability. We introduce \emph{Intensity Dot Product Graphs} (IDPGs), which extend Random Dot Product Graphs by replacing a fixed collection of latent positions with a Poisson point process on a Euclidean latent space. This yields a model with random node populations, RDPG-style dot-product affinities, and a population-level intensity that links continuous latent structure to finite observed graphs. We define the heat map and the desire operator as continuous analogues of the probability matrix, prove a spectral consistency result connecting adjacency singular values to the operator spectrum, compare the construction with graphon and digraphon representations, and show how classical RDPGs arise in a concentrated limit. Because the model is parameterized by an evolving intensity, temporal extensions through partial differential equations arise naturally.

关键词: Intensity Dot Product Graphs, Random Dot Product Graphs, Poisson point process, latent positions, graphon, spectral consistency, adjacency singular values, temporal extensions

285. ❌ PolicyLong: Towards On-Policy Context Extension

作者: Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, TingHao Yu, Feng Zhang, Songlin Hu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	15.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM上下文窗口扩展问题，与’Context Window Extension OR Long Context LLMs’高度相关（15分）。论文明确研究LLM，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及数据筛选和训练分布，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文提到检索机制，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分）。其他关键词与论文内容无直接关系，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM上下文窗口扩展中静态数据构建导致的训练分布漂移问题，提出了动态on-policy数据构建方法PolicyLong，实验证明该方法在多个长上下文基准测试中优于现有方法。

摘要翻译

扩展大型语言模型上下文窗口的瓶颈在于高质量长上下文数据的稀缺。近期研究通过信息论验证合成具有真实长程依赖关系的数据，其核心是筛选能够降低基础模型预测熵的上下文。然而，这些方法采用固定模型的单次离线构建方式，产生了根本性的离策略差距：静态筛选环境与模型动态演进的能力不匹配，导致训练分布发生偏移。我们提出PolicyLong方法，将数据构建转向动态在策略范式。该方法通过迭代使用当前模型重新执行数据筛选（包括熵计算、检索与验证），确保训练分布能够追踪模型能力的演进，从而形成一种涌现的自适应课程。关键在于，正例与困难负例上下文均源自当前模型的熵分布格局，使模型学习利用和抵抗的内容得以协同演化。在RULER、HELMET和LongBench-v2（Qwen2.5-3B）上的实验表明，PolicyLong持续优于EntropyLong与NExtLong方法，且在更长上下文场景中增益更为显著（如在RULER的128K长度上提升+2.54），这证实了在策略数据演进机制的价值。

摘要 (Abstract)

Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model’s predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model’s evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model’s entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.

关键词: Context Window Extension, Long Context LLMs, On-policy Data Construction, Entropy-based Screening, Training Distribution Drift, Self-curriculum Learning, Dynamic Data Evolution, Information-theoretic Verification

286. ❌ Order-Optimal Sequential 1-Bit Mean Estimation in General Tail Regimes

作者: Ivan Lau, Jonathan Scarlett 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07796v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在严格1比特通信约束下的均值估计问题，属于统计推断和通信理论领域，与所有评分关键词（均涉及大模型、深度学习及其技术原理、应用）完全无关。论文内容不涉及任何形式的大模型、深度学习、AI技术或科学AI应用，而是纯粹的统计估计理论问题。

!!! tip deepseek-chat TL;DR

该论文研究了在严格1比特通信约束下，基于随机阈值查询的自适应均值估计问题，提出了在任意有界k阶中心矩分布下样本复杂度达到阶最优的估计器，并证明了1比特量化在有限方差情况下的基本限制和自适应与非自适应方法之间的显著效率差距。

摘要翻译

本文研究了严格1比特通信约束下的均值估计问题。我们提出了一种仅基于随机化阈值查询的新型自适应均值估计器，其中每个1比特输出指示给定样本是否超过一个序列选择的阈值。对于任意具有有界均值$μ\in [-λ, λ]$及有界$k$阶中心矩$\mathbb{E}[|X-μ|^k] \le σ^k$（$k > 1$为任意固定值）的分布，我们的估计器具有$(ε, δ)$-PAC性质。关键的是，我们的样本复杂度在所有此类尾部条件下（即对每个$k$值）均达到阶数最优。当$k \neq 2$时，估计器的样本复杂度与未量化情形下的极小极大下界相匹配，仅增加了一个不可避免的$O(\log(λ/σ))$定位代价。对于有限方差情形（$k=2$），估计器的样本复杂度存在一个额外的乘法因子$O(\log(σ/ε))$惩罚，我们通过建立新的信息论下界证明该惩罚是1比特量化的本质极限。我们还揭示了一个显著的适应性差距：无论是阈值查询还是更一般的区间查询，任何非自适应估计器的样本复杂度都必须随搜索空间参数$λ/σ$线性增长，这使其样本效率远低于我们的自适应方法。最后，我们提出了几种算法变体：（i）处理未知的采样预算；（ii）在给定（可能宽松）边界条件下自适应于未知尺度参数$σ$；（iii）仅需两阶段适应性，但代价是使用更复杂的通用1比特查询。

摘要 (Abstract)

In this paper, we study the problem of mean estimation under strict 1-bit communication constraints. We propose a novel adaptive mean estimator based solely on randomized threshold queries, where each 1-bit outcome indicates whether a given sample exceeds a sequentially chosen threshold. Our estimator is $(ε, δ)$-PAC for any distribution with a bounded mean $μ\in [-λ, λ]$ and a bounded $k$-th central moment $\mathbb{E}[|X-μ|^k] \le σ^k$ for any fixed $k > 1$. Crucially, our sample complexity is order-optimal in all such tail regimes, i.e., for every such $k$ value. For $k \neq 2$, our estimator’s sample complexity matches the unquantized minimax lower bounds plus an unavoidable $O(\log(λ/σ))$ localization cost. For the finite-variance case ($k=2$), our estimator’s sample complexity has an extra multiplicative $O(\log(σ/ε))$ penalty, and we establish a novel information-theoretic lower bound showing that this penalty is a fundamental limit of 1-bit quantization. We also establish a significant adaptivity gap: for both threshold queries and more general interval queries, the sample complexity of any non-adaptive estimator must scale linearly with the search space parameter $λ/σ$, rendering it vastly less sample efficient than our adaptive approach. Finally, we present algorithmic variants that (i) handle an unknown sampling budget, (ii) adapt to an unknown scale parameter~$σ$ given (possibly loose) bounds, and (iii) require only two stages of adaptivity at the expense of more complicated general 1-bit queries.

关键词: 1-bit communication, mean estimation, adaptive estimator, threshold queries, sample complexity, order-optimal, tail regimes, information-theoretic lower bound

287. ❌ SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

作者: Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SEARL框架，专注于自进化智能体（self-evolving agents）的研究，核心涉及智能体架构、工具使用和记忆机制。与关键词的相关性分析如下：1）高度相关（8-10分）：‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）和’Tool Use/Function Calling/API Tool Use’（10分）是论文的核心主题；‘Self-Correction/Self-Improvement/Self-Reflection’（8分）与自进化学习直接相关。2）中等相关（5分）：‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’与论文中的规划与执行整合相关。3）低相关（0-3分）：‘Large Language Models/LLMs/Foundation Models’（5分）仅被提及为现有方法的依赖，非论文创新重点；其余关键词（如MoE、量化、RAG等）与论文内容无关。论文主要贡献在于智能体框架设计，而非大模型技术本身。

!!! tip deepseek-chat TL;DR

该论文针对资源受限环境中自进化智能体学习效率低和奖励稀疏的问题，提出了SEARL框架，通过构建结构化经验记忆来整合规划与执行，在知识推理和数学任务上实现了更实用高效的学习。

摘要翻译

近期，可验证奖励强化学习在单轮推理任务中展现出显著潜力。随着范式向自进化智能体学习转变，模型日益被期望能通过合成工具或积累显式经验从轨迹中学习。然而，主流方法通常依赖于大规模语言模型或多智能体框架，这限制了其在资源受限环境中的部署。基于结果奖励的固有稀疏性也构成重大挑战，因为智能体通常仅在任务完成后才能获得反馈。为应对这些局限，我们提出了一种基于工具-记忆的自进化智能体框架SEARL。与直接利用交互经验的方法不同，我们的方法构建了一种结构化经验记忆，将规划与执行相结合。这提供了一种新颖的状态抽象，有助于在类似情境（如工具复用）中实现泛化。因此，智能体既能从历史数据中提取显式知识，又能利用轨迹间关联来稠密化奖励信号。我们在知识推理和数学任务上评估了该框架，证明了其在实现更实用、高效学习方面的有效性。

摘要 (Abstract)

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

关键词: self-evolving agents, tool-memory, reinforcement learning, reward densification, state abstraction, knowledge reasoning, mathematics tasks, agentic framework

作者: Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究跨模态情感迁移用于说话人脸视频情感编辑，属于计算机视觉和音频处理交叉领域。论文使用了大规模预训练音频编码器，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及预训练模型的应用。但论文未涉及大语言模型、深度学习技术原理创新或科学领域应用，与绝大多数关键词无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨模态情感迁移方法（C-MET），通过建模语音和视觉特征空间之间的情感语义向量，解决了说话人脸视频中情感编辑的局限性，在MEAD和CREMA-D数据集上比现有方法提高了14%的情感准确性。

摘要翻译

说话人脸生成作为生成模型的核心应用已获得广泛关注。为增强合成视频的表现力与真实感，说话人脸视频中的情感编辑起着关键作用。然而现有方法常限制表现灵活性，且难以生成扩展情感。基于标签的方法使用离散类别表征情感，无法捕捉广泛的情感谱系。基于音频的方法可利用富含情感的语音信号——甚至受益于富有表现力的文本转语音（TTS）合成——但因情感与语言内容在情感语音中相互纠缠，往往无法准确表达目标情感。另一方面，基于图像的方法依赖目标参考图像引导情感迁移，但需要高质量正面视角图像，且在获取扩展情感（如讽刺）的参考数据时面临挑战。为突破这些局限，我们提出跨模态情感迁移（C-MET）这一创新方法，该方法通过建模语音与视觉特征空间之间的情感语义向量，实现基于语音驱动的人脸表情生成。C-MET利用大规模预训练音频编码器与解耦的面部表情编码器，学习表征跨模态不同情感嵌入之间差异的情感语义向量。在MEAD和CREMA-D数据集上的大量实验表明，本方法将情感准确率较现有最优方法提升14%，同时能生成富有表现力的说话人脸视频——即使对于未见过的扩展情感亦有效果。代码、检查点及演示可见于https://chanhyeok-choi.github.io/C-MET/

摘要 (Abstract)

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

关键词: Cross-Modal Emotion Transfer, Talking Face Generation, Emotion Editing, Audio-Visual Learning, Disentangled Facial Expression, Emotion Semantic Vectors, MEAD Dataset, CREMA-D Dataset

289. ❌ Toward Generalizable Graph Learning for 3D Engineering AI: Explainable Workflows for CAE Mode Shape Classification and CFD Field Prediction

作者: Tong Duy Son, Kohta Sugiura, Marc Brughmans, Andrey Hense, Zhihao Liu, Amirthalakshmi Veeraraghavan, Ajinkya Bhave, Jay Masters, Paolo di Carlo, Theo Geluk 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于3D工程AI的图学习框架，应用于汽车工程中的CAE振动模式分类和CFD空气动力学场预测。论文与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对语言模型和通用AI技术，而本文研究的是特定领域的工程AI应用。唯一相关的关键词是：1. ‘Mechanistic Interpretability OR Explainable AI’（5分）：论文提到’explainable mode classification’和’explainable workflows’，但解释性不是核心创新，而是框架的一个特性。2. ‘AI for Science OR Bioinformatics OR Cheminformatics’（8分）：论文属于AI在科学/工程领域的应用（汽车工程），但更偏向工程而非纯科学，因此不是高度相关。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于3D工程AI的通用图学习框架，通过将异构工程资产转换为物理感知的图表示，并利用图神经网络处理，成功应用于汽车CAE振动模式分类和CFD空气动力学场预测，实现了可解释且可重用的工程AI工作流程。

摘要翻译

汽车工程开发日益依赖于异构三维数据，包括有限元模型、白车身表征、CAD几何与CFD网格。与此同时，工程团队面临缩短开发周期、提升性能及加速创新的持续压力。尽管人工智能在该领域的应用探索不断深入，当前许多方法仍局限于特定任务、可解释性不足且难以跨开发阶段复用。本文提出一种面向三维工程人工智能的实用图学习框架，该框架将异构工程资产转化为物理感知的图结构表示，并通过图神经网络进行处理。该框架设计用于同时支持分类与预测任务，并在两项汽车工程应用中得到验证：CAE振动模态分类与CFD空气动力学场预测。在CAE振动模态分类中，区域感知的白车身图结构支持在标签稀缺条件下对车辆及有限元变体进行可解释的模态分类；在CFD空气动力学场预测中，基于物理信息的代理模型可预测不同空气动力学车身变体的压力与壁面剪切应力，而保持对称性的降采样方法能以更低计算成本维持精度。本框架还提出了数据生成指导原则，可帮助工程师识别哪些后续仿真数据或标签具有重要收集价值。这些成果展示了一种实用且可复用的工程人工智能工作流程，能为CAE与CFD决策支持提供更可信的辅助。

摘要 (Abstract)

Automotive engineering development increasingly relies on heterogeneous 3D data, including finite element (FE) models, body-in-white (BiW) representations, CAD geometry, and CFD meshes. At the same time, engineering teams face growing pressure to shorten development cycles, improve performance and accelerate innovation. Although artificial intelligence (AI) is increasingly explored in this domain, many current methods remain task-specific, difficult to interpret, and hard to reuse across development stages. This paper presents a practical graph learning framework for 3D engineering AI, in which heterogeneous engineering assets are converted into physics-aware graph representations and processed by Graph Neural Networks (GNNs). The framework is designed to support both classification and prediction tasks. The framework is validated on two automotive applications: CAE vibration mode shape classification and CFD aerodynamic field prediction. For CAE vibration mode classification, a region-aware BiW graph supports explainable mode classification across vehicle and FE variants under label scarcity. For CFD aerodynamic field prediction, a physics-informed surrogate predicts pressure and wall shear stress (WSS) across aerodynamic body shape variants, while symmetry preserving down sampling retains accuracy with lower computational cost. The framework also outlines data generation guidance that can help engineers identify which additional simulations or labels are valuable to collect next. These results demonstrate a practical and reusable engineering AI workflow for more trustworthy CAE and CFD decision support.

关键词: graph learning, 3D engineering AI, Graph Neural Networks, CAE mode shape classification, CFD field prediction, physics-aware graph representations, explainable workflows, automotive engineering

290. ❌ Structured Distillation of Web Agent Capabilities Enables Generalization

作者: Xing Han Lù, Siva Reddy 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07776v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）构建网络代理，使用监督微调（SFT）训练9B参数的学生模型，属于LLM代理研究范畴。论文涉及推理轨迹（reasoning traces），与多步推理相关。模型规模较小（9B参数）且旨在本地部署，与小型语言模型相关。其他关键词如MoE、数据质量、对齐、RAG等未在摘要中体现，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结构化轨迹合成框架，通过监督微调训练9B参数的本地可部署网络代理，在多个基准测试中超越了闭源模型，并展示了向未见环境的泛化能力。

摘要翻译

前沿大语言模型能够驾驭复杂网站，但其成本及对第三方API的依赖使得本地部署难以实现。我们提出“智能体即标注者”框架，该框架通过类比人类标注角色（将任务设计者、标注者和监督者替换为模块化的大语言模型组件），为网页智能体构建了合成轨迹生成流程。以Gemini 3 Pro作为教师模型，我们在六个网页环境中生成了3,000条轨迹，并对通过质量筛选的2,322条轨迹采用纯监督学习方式微调了一个90亿参数的学生模型。所得模型在WebArena基准测试中达到41.5%的成功率，在同一评估协议下超越了Claude 3.5 Sonnet（36.0%）和GPT-4o（31.5%）等闭源模型，并将此前最佳开放权重模型（Go-Browse，21.7%）的性能提升近一倍。该能力可迁移至未见环境：在训练中从未接触的企业平台WorkArena L1上获得18.2个百分点的性能提升，并在另外三个基准测试中均取得稳定改进。消融实验证实流程中各组件均具有显著贡献，其中评判过滤、评估提示和推理轨迹各自带来可量化的性能增益。这些结果表明，仅通过单一前沿教师模型进行结构化轨迹合成，即可训练出具有竞争力且可本地部署的网页智能体。项目页面：https://agent-as-annotators.github.io

摘要 (Abstract)

Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io

关键词: web agents, structured trajectory generation, supervised fine-tuning, LLM components, local deployment, generalization, 9B-parameter model, Agent-as-Annotators

291. ❌ Generative optimal transport via forward-backward HJB matching

作者: Haiqian Yang, Vishaal Krishnan, Sumit Sinha, L. Mahadevan 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Generative optimal transport via forward-backward HJB matching》研究随机最优控制、Schrödinger桥理论和非平衡统计力学，聚焦于通过哈密顿-雅可比-贝尔曼（HJB）方程匹配解决随机系统的最优传输问题。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心内容属于数学物理和随机控制理论，未涉及任何大模型、深度学习或AI for Science的具体技术、方法或应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了在仅给定目标样本的情况下，如何计算从无序参考状态到结构化目标集合的最小功随机最优传输过程的问题，通过建立时间反转对偶性，将难解的后向动力学值函数转化为可从前向松弛轨迹中直接计算的前向HJB方程解，从而避免了后向模拟的需求。

摘要翻译

在多体随机系统中，控制其从无序参考态演化至以经验样本表征的结构化目标系综，这一问题自然出现在非平衡统计力学与随机控制领域。此类系统在扩散驱动下的自然弛豫过程是从结构化目标态趋向无序参考态。随之而来的核心问题是：给定一个结合空间惩罚与控制代价的路径泛函，能够逆转该弛豫过程的最小功随机过程是什么？计算这一最优过程需要已知已采样目标系综的轨迹——而这正是试图构建的对象。我们通过建立时间反演对偶性解决了这一难题：支配困难逆向动力学的值函数满足一个等效的顺向时间哈密顿-雅可比-贝尔曼方程，其解可直接从易于处理的顺向弛豫轨迹中读取。通过科尔-霍普夫变换及其关联的费曼-卡茨表示，该顺向势函数被计算为这些顺向轨迹（即易于模拟的弛豫路径）上的路径空间自由能平均值——无需任何逆向模拟，也无需除样本外对目标的先验知识。所得框架从路径空间自由能、风险敏感控制和空间代价几何的角度，为随机输运提供了物理解释性描述。我们通过数值算例阐释该理论，可视化习得的值函数及其诱导的受控扩散过程，展示空间代价场如何类似非均匀介质中的费马原理塑造输运几何。我们的研究结果在随机最优控制、薛定谔桥理论与非平衡统计力学之间建立了统一的联系。

摘要 (Abstract)

Controlling the evolution of a many-body stochastic system from a disordered reference state to a structured target ensemble, characterized empirically through samples, arises naturally in non-equilibrium statistical mechanics and stochastic control. The natural relaxation of such a system - driven by diffusion - runs from the structured target toward the disordered reference. The natural question is then: what is the minimum-work stochastic process that reverses this relaxation, given a pathwise cost functional combining spatial penalties and control effort? Computing this optimal process requires knowledge of trajectories that already sample the target ensemble - precisely the object one is trying to construct. We resolve this by establishing a time-reversal duality: the value function governing the hard backward dynamics satisfies an equivalent forward-in-time HJB equation, whose solution can be read off directly from the tractable forward relaxation trajectories. Via the Cole-Hopf transformation and its associated Feynman-Kac representation, this forward potential is computed as a path-space free energy averaged over these forward trajectories - the same relaxation paths that are easy to simulate - without any backward simulation or knowledge of the target beyond samples. The resulting framework provides a physically interpretable description of stochastic transport in terms of path-space free energy, risk-sensitive control, and spatial cost geometry. We illustrate the theory with numerical examples that visualize the learned value function and the induced controlled diffusions, demonstrating how spatial cost fields shape transport geometry analogously to Fermat’s Principle in inhomogeneous media. Our results establish a unifying connection between stochastic optimal control, Schrödinger bridge theory, and non-equilibrium statistical mechanics.

关键词: optimal transport, stochastic control, Hamilton-Jacobi-Bellman equation, Schrödinger bridge, non-equilibrium statistical mechanics, path-space free energy, controlled diffusion, time-reversal duality

292. ❌ Sparse $ε$ insensitive zone bounded asymmetric elastic net support vector machines for pattern classification

作者: Haiyan Du, Hu Yang 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07748v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是传统支持向量机(SVM)的改进算法，通过结合弹性网损失和鲁棒损失框架来构建稀疏且鲁棒的SVM变体。论文内容完全属于传统机器学习中的支持向量机优化领域，涉及损失函数设计、优化算法和模式分类应用。所有评分关键词都聚焦于大语言模型(LLMs)、深度学习技术原理及其在不同领域的应用，而本论文完全不涉及任何大模型、深度学习、语言模型或AI for Science相关内容，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统支持向量机对噪声敏感且缺乏稀疏性的问题，提出了一种基于ε不敏感带界非对称弹性网损失的SVM变体(ε-BAEN-SVM)，通过理论证明和实验验证了其在噪声环境下能更好地平衡稀疏性和鲁棒性。

摘要翻译

现有支持向量机（SVM）模型对噪声敏感且缺乏稀疏性，这限制了其性能。为解决这些问题，我们将弹性网络损失与鲁棒损失框架相结合，构建了一种稀疏的$\varepsilon$不敏感有界非对称弹性网络损失，并将其与SVM结合，建立了基于$\varepsilon$不敏感区域有界非对称弹性网络损失的支持向量机（$\varepsilon$-BAEN-SVM）。$\varepsilon$-BAEN-SVM兼具稀疏性与鲁棒性。通过证明位于$\varepsilon$不敏感带内的样本并非支持向量，验证了其稀疏性；而影响函数的有界性从理论上保证了其鲁棒性。针对该非凸优化问题，我们设计了一种基于裁剪对偶坐标下降的半二次算法，将原问题转化为一系列加权子问题，并通过$\varepsilon$参数提升了计算效率。在模拟与真实数据集上的实验表明，$\varepsilon$-BAEN-SVM优于传统及现有鲁棒SVM模型，能在噪声环境中良好平衡稀疏性与鲁棒性。统计检验结果证实了其优越性。在高斯核下，该模型实现了更高的精度与噪声不敏感性，验证了其有效性与实用价值。

摘要 (Abstract)

Existing support vector machines(SVM) models are sensitive to noise and lack sparsity, which limits their performance. To address these issues, we combine the elastic net loss with a robust loss framework to construct a sparse $\varepsilon$-insensitive bounded asymmetric elastic net loss, and integrate it with SVM to build $\varepsilon$ Insensitive Zone Bounded Asymmetric Elastic Net Loss-based SVM($\varepsilon$-BAEN-SVM). $\varepsilon$-BAEN-SVM is both sparse and robust. Sparsity is proven by showing that samples inside the $\varepsilon$-insensitive band are not support vectors. Robustness is theoretically guaranteed because the influence function is bounded. To solve the non-convex optimization problem, we design a half-quadratic algorithm based on clipping dual coordinate descent. It transforms the problem into a series of weighted subproblems, improving computational efficiency via the $\varepsilon$ parameter. Experiments on simulated and real datasets show that $\varepsilon$-BAEN-SVM outperforms traditional and existing robust SVMs. It balances sparsity and robustness well in noisy environments. Statistical tests confirm its superiority. Under the Gaussian kernel, it achieves better accuracy and noise insensitivity, validating its effectiveness and practical value.

关键词: Support Vector Machines, Sparse Models, Robust Loss, Elastic Net, Pattern Classification, Noise Insensitivity, Half-Quadratic Algorithm, ε-Insensitive Band

作者: Jingye Tan, Govinda Anantha Padmanabha, Steven J. Yang, Nikolaos Bouklas 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于AI在材料科学领域的应用，特别是利用稀疏表示和物理增强的有限元模型更新进行本构模型发现。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）无直接关联，但涉及稀疏模型（sparsification）和可解释AI（interpretability）概念，因此给5分。论文属于AI for Science范畴，应用AI解决材料建模问题，因此给8分。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理增强的有限元模型更新方法，通过结合AI驱动的本构建模、稀疏表示和基于多模态数据的伴随优化，实现了快速的材料本构模型发现。

摘要翻译

人工智能赋能的本构模型建模近期进展，已从纯数据驱动范式转向强化物理约束与力学原理的实施，这一概念被称为物理增强。经典唯象方法依赖于选择预定义模型并标定其参数，而机器学习方法则常聚焦于模型本身的发现。稀疏回归方法介于两者之间，其在标定过程中探索大型预定义模型库。在上述范式中，以及在神经网络架构背景下，稀疏化已被证明能够实现模型可解释性与不确定性量化，同时由于所得模型具有低维特性，也促进了异构软件集成。现有大多数人工智能赋能的本构建模研究集中于单一来源数据，但实际材料建模工作流可能包含来自多种不同来源的数据（多模态数据），以及来自同一材料类别内其他材料测试的数据（多保真度数据）。本研究提出物理增强有限元模型更新（paFEMU）作为一种迁移学习方法，该方法融合了人工智能赋能的本构建模、面向可解释模型发现的稀疏化技术，以及基于有限元的伴随优化，并充分利用多模态数据。通过结合简单力学测试数据（可能来自不同材料）与数字图像相关型全场数据采集，最终实现快速本构模型发现。稀疏表达的简洁性使得神经本构模型易于集成到现有有限元工作流中，并在迁移学习过程中实现低维更新。

摘要 (Abstract)

Recent progress in AI-enabled constitutive modeling has concentrated on moving from a purely data-driven paradigm to the enforcement of physical constraints and mechanistic principles, a concept referred to as physics augmentation. Classical phenomenological approaches rely on selecting a pre-defined model and calibrating its parameters, while machine learning methods often focus on discovery of the model itself. Sparse regression approaches lie in between, where large libraries of pre-defined models are probed during calibration. Sparsification in the aforementioned paradigm, but also in the context of neural network architecture, has been shown to enable interpretability, uncertainty quantification, but also heterogeneous software integration due to the low-dimensional nature of the resulting models. Most works in AI-enabled constitutive modeling have also focused on data from a single source, but in reality, materials modeling workflows can contain data from many different sources (multi-modal data), and also from testing other materials within the same materials class (multi-fidelity data). In this work, we introduce physics augmented finite element model updating (paFEMU), as a transfer learning approach that combines AI-enabled constitutive modeling, sparsification for interpretable model discovery, and finite element-based adjoint optimization utilizing multi-modal data. This is achieved by combining simple mechanical testing data, potentially from a distinct material, with digital image correlation-type full-field data acquisition to ultimately enable rapid constitutive modeling discovery. The simplicity of the sparse representation enables easy integration of neural constitutive models in existing finite element workflows, and also enables low-dimensional updating during transfer learning.

关键词: constitutive modeling, physics augmentation, sparse regression, finite element model updating, multi-modal data, transfer learning, interpretable model discovery, adjoint optimization

294. ❌ The Condition-Number Principle for Prototype Clustering

作者: Romano Li, Jianfei Cao 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于原型聚类的几何理论框架，研究聚类条件数与分类误差的关系，属于传统机器学习聚类理论范畴。所有关键词均涉及大模型、深度学习技术原理或AI科学应用，与论文的纯理论聚类分析完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个几何框架，通过定义聚类条件数来连接目标精度与结构恢复，证明了当条件数较小时，低目标值可保证相对于基准分区的低误分类误差，并揭示了聚类不平衡下鲁棒性与敏感性的基本权衡。

摘要翻译

我们构建了一个几何框架，将原型聚类中的目标精度与结构恢复联系起来。该分析独立于具体算法，适用于一大类可接受的损失函数。我们定义了一个聚类条件数，用于比较簇内尺度与将点移过簇边界所需的最小损失增量。当该数值较小时，任何具有较小次优性间隙的解相对于基准划分必然也具有较小的误分类误差。该框架还阐明了鲁棒性与对簇不平衡敏感度之间的基本权衡，从而在不同目标下呈现出精确恢复的急剧相变。这些保证是确定性的且非渐近的，并将算法精度的作用与问题实例固有的几何难度分离开来。我们进一步证明，误差集中在簇边界附近，且在加强的局部间隔条件下，足够深的簇核心能被精确恢复。这些结果共同提供了一个几何原理，用于将低目标值解释为存在有意义聚类结构的可靠证据。

摘要 (Abstract)

We develop a geometric framework that links objective accuracy to structural recovery in prototype-based clustering. The analysis is algorithm-agnostic and applies to a broad class of admissible loss functions. We define a clustering condition number that compares within-cluster scale to the minimum loss increase required to move a point across a cluster boundary. When this quantity is small, any solution with a small suboptimality gap must also have a small misclassification error relative to a benchmark partition. The framework also clarifies a fundamental trade-off between robustness and sensitivity to cluster imbalance, leading to sharp phase transitions for exact recovery under different objectives. The guarantees are deterministic and non-asymptotic, and they separate the role of algorithmic accuracy from the intrinsic geometric difficulty of the instance. We further show that errors concentrate near cluster boundaries and that sufficiently deep cluster cores are recovered exactly under strengthened local margins. Together, these results provide a geometric principle for interpreting low objective values as reliable evidence of meaningful clustering structure.

关键词: prototype clustering, condition number, geometric framework, misclassification error, cluster boundaries, deterministic guarantees, phase transitions, algorithm-agnostic analysis

295. ❌ Efficient Dataset Selection for Continual Adaptation of Generative Recommenders

作者: Cathy Jiao, Juan Elenter, Praveen Ravichandran, Bernd Huber, Joseph Cauteruccio, Todd Wasson, Timothy Heath, Chenyan Xiong, Mounia Lalmas, Paul Bennett 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究推荐系统中的数据选择策略以应对时间分布漂移，专注于数据采样、表示学习和训练效率，未涉及大模型、深度学习技术原理或科学AI应用，与所有给定的大模型相关关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过基于梯度的表示和分布匹配的数据选择策略，在推荐系统中高效应对时间分布漂移，从而在保持模型鲁棒性的同时提升训练效率。

摘要翻译

推荐系统必须持续适应不断演变的用户行为，然而大规模流式环境中产生的数据量使得频繁的完整模型重训练难以实现。本研究探讨了如何通过定向数据选择来缓解由时间分布漂移引起的性能下降，同时保持系统的可扩展性。我们评估了一系列表征选择与采样策略，旨在从用户交互数据中筛选出体量小但信息量大的子集。结果表明，基于梯度的表征方法结合分布匹配策略，能够提升下游模型性能，在保持对分布漂移鲁棒性的同时显著提高训练效率。这些发现凸显了数据筛选作为一种实用机制，可在生产级推荐系统中实现可扩展的监控与自适应模型更新。

摘要 (Abstract)

Recommendation systems must continuously adapt to evolving user behavior, yet the volume of data generated in large-scale streaming environments makes frequent full retraining impractical. This work investigates how targeted data selection can mitigate performance degradation caused by temporal distributional drift while maintaining scalability. We evaluate a range of representation choices and sampling strategies for curating small but informative subsets of user interaction data. Our results demonstrate that gradient-based representations, coupled with distribution-matching, improve downstream model performance, achieving training efficiency gains while preserving robustness to drift. These findings highlight data curation as a practical mechanism for scalable monitoring and adaptive model updates in production-scale recommendation systems.

关键词: recommendation systems, data selection, temporal distributional drift, gradient-based representations, distribution-matching, training efficiency, scalability, adaptive model updates

296. ❌ Needle in a Haystack – One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology

作者: Swarnadip Chatterjee, Vladimir Basic, Arrigo Capitanio, Orcun Goksel, Joakim Lindblad 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.07722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算细胞学中罕见恶性细胞的检测，使用单类表示学习方法（如DSVDD和DROC），与深度学习在生物医学图像分析中的应用相关。然而，论文未涉及任何大语言模型（LLM）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、代理系统或模型压缩等关键词。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物信息学（具体为计算病理学）中的应用，但并非核心创新于大模型技术，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究在计算细胞学中检测罕见恶性细胞的挑战，提出使用单类表示学习方法（如DSVDD和DROC），在极低见证率（≤1%）下实现了最先进的实例级异常排名性能，甚至在某些情况下优于全监督学习。

摘要翻译

在计算细胞学中，全切片图像上的恶性细胞检测十分困难，因为恶性细胞形态多样，且在大量正常细胞的背景中极为罕见。由于严重的类别不平衡和有限的标注，准确检测这些极其稀少的恶性细胞仍然具有挑战性。传统的弱监督方法，如多示例学习（Multiple Instance Learning, MIL），通常在实例层面泛化能力不足，尤其是在恶性细胞比例（见证率）极低的情况下。本研究探索了使用单类表征学习技术来检测低见证率场景下的恶性细胞。这些方法仅使用阴性切片区块进行训练，无需任何实例级监督。具体而言，我们评估了两种单类分类（One-Class Classification, OCC）方法——DSVDD和DROC，并将其与FS-SIL、WS-SIL以及近期的ItS2CLR方法进行了比较。单类方法学习正常性的紧凑表征，并在测试时检测其偏离。在公开可用的骨髓细胞形态学数据集（TCIA）和一个内部口腔癌细胞学数据集上的实验表明，DSVDD在实例级异常排序方面取得了最先进的性能，特别是在超低见证率（≤1%）的情况下，并且在某些情况下甚至优于全监督学习——由于详尽的实例级标注在全切片细胞学中通常不切实际，全监督学习通常并非可行选择。DROC在极端稀有性条件下也表现出竞争力，这得益于其分布增强的对比学习。这些发现凸显了单类表征学习作为一种稳健且可解释的优越选择，在极端稀有性条件下进行恶性细胞检测时优于多示例学习。

摘要 (Abstract)

In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.

关键词: computational cytology, malignant cell detection, one-class representation learning, DSVDD, DROC, low-witness-rate, instance-level abnormality ranking, whole-slide images

297. ❌ A Quasi-Regression Method for the Mediation Analysis of Zero-Inflated Single-Cell Data

作者: Seungjun Ahn, Donald Porchia, Panos Roussos, Maaike van Gerwen, Qing Lu, Zhigang Li 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于单细胞数据的因果中介分析方法开发（QuasiMed框架），属于生物信息学领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其属于生物信息学范畴，但论文本身并非关于大模型或深度学习在科学领域的应用，而是传统的统计方法创新。因此，仅给予最后一个关键词5分（有一定关联），其余均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为QuasiMed的准回归方法，用于解决单细胞零膨胀数据的因果中介分析问题，该方法通过放松严格的分布假设，在模拟和真实数据中展示了高统计功效、可控的错误发现率和计算效率。

摘要翻译

单细胞技术的近期进展提升了我们在单细胞分辨率下对基因调控和细胞异质性的理解。单细胞数据同时包含基因表达水平与表达细胞比例，这使其在结构上区别于批量数据。目前，针对单细胞数据的因果中介分析方法研究仍较为有限，且通常需要特定的分布假设。为应对这一挑战，我们提出了QuasiMed——一个专为单细胞数据设计的中介分析框架。该方法包含三个步骤：（1）通过惩罚回归与边际模型（类似于确定独立筛选）筛选潜在中介变量；（2）基于平均表达水平与表达细胞比例估计间接效应；（3）在多重检验控制下进行假设检验。QuasiMed的核心优势在于其仅通过拟回归框架设定中介模型的均值函数，从而放宽了严格的分布假设。通过基于真实数据启发的模拟实验，该方法展现出较高的统计功效、错误发现率控制能力及计算效率。最后，我们将QuasiMed应用于ROSMAP单细胞数据集，以展示其在识别中介因果通路方面的潜力。相关R包已在GitHub仓库（https://github.com/sjahnn/QuasiMed）开源提供。

摘要 (Abstract)

Recent advances in single-cell technologies have advanced our understanding of gene regulation and cellular heterogeneity at single-cell resolution. Single-cell data contain both gene expression levels and the proportion of expressing cells, which makes them structurally different from bulk data. Currently, methodological work on causal mediation analysis for single-cell data remains limited and often requires specific distributional assumptions. To address this challenge, we present QuasiMed, a mediation framework specialized for single-cell data. Our proposed method comprises three steps, including (i) screening mediator candidates through penalized regression and marginal models (similar to sure independence screening), (ii) estimation of indirect effects through the average expression and the proportion of expressing cells, (iii) and hypothesis testing with multiplicity control. The key benefit of QuasiMed is that it specifies only the mean functions of the mediation models through a quasi-regression framework, thereby relaxing strict distributional assumptions. The method performance was evaluated through the real-data-inspired simulations, and demonstrated high power, false discovery rate control, and computational efficiency. Lastly, we applied QuasiMed to ROSMAP single-cell data to illustrate its potential to identify mediating causal pathways. R package is freely available on GitHub repository at https://github.com/sjahnn/QuasiMed.

关键词: mediation analysis, single-cell data, quasi-regression, causal pathways, zero-inflated data, statistical method, ROSMAP, QuasiMed

298. ❌ Quantifying the Spatiotemporal Dynamics of Engineered Cardiac Microbundles

作者: Hiba Kobeissi, Samuel J. DePalma, Javiera Jilberto, David Nordsletten, Brendon M. Baker, Emma Lejeune 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于心脏组织工程领域，开发了一个用于量化心脏微束收缩动力学的计算流程，涉及图像分析、位移跟踪、应变重建等生物医学工程方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于生物信息学/计算生物学的应用范畴，但论文本身并未使用或提及任何大模型或深度学习技术，而是传统的图像处理和计算力学方法。

!!! tip deepseek-chat TL;DR

该研究解决了心脏组织工程中缺乏标准化分析框架的问题，通过开发一个开源计算流程量化了心脏微束的收缩动力学，并应用于670个样本揭示了收缩表型的连续变异性和核心度量集。

摘要翻译

明场延时成像在心脏组织工程中被广泛应用，然而缺乏标准化、可解释的分析框架限制了研究的可重复性与跨平台比较。本研究提出一个开放、可扩展的计算流程，用于量化人类诱导多能干细胞来源的心脏微束在显微视频中的时空收缩动力学。基于我们开源工具“MicroBundleCompute”和“MicroBundlePillarTrack”，我们定义了一套包含16项可解释的结构、功能及时空指标，用以捕捉组织形变、同步性与异质性。该框架将全场位移追踪、应变重构、空间配准、降维以及基于拓扑的矢量场分析整合于统一工作流中。将其应用于涵盖20种实验条件的670个心脏微束数据集，该流程揭示了收缩表型呈现连续变化而非离散的条件特异性聚类，且同一条件内的变异常超过条件间的差异。冗余分析识别出一个由10项指标组成的精简核心集合，在保留大部分信息内容的同时最小化了多重共线性。对去噪位移场的分析表明，收缩主要由全局各向同性模式主导，约半数样本中存在局部鞍型形变模式。所有软件与工作流均已开源发布，以支持动态组织力学的可重复、可扩展分析。

摘要 (Abstract)

Brightfield time-lapse imaging is widely used in cardiac tissue engineering, yet the absence of standardized, interpretable analytical frameworks limits reproducibility and cross-platform comparison. We present an open, scalable computational pipeline for quantifying spatiotemporal contractile dynamics in microscopy videos of human induced pluripotent stem cell-derived cardiac microbundles. Building on our open-source tools “MicroBundleCompute” and “MicroBundlePillarTrack,” we define a suite of 16 interpretable structural, functional, and spatiotemporal metrics that capture tissue deformation, synchrony, and heterogeneity. The framework integrates full-field displacement tracking, strain reconstruction, spatial registration, dimensionality reduction, and topology-based vector-field analysis within a unified workflow. Applied to a dataset of 670 cardiac microbundles spanning 20 experimental conditions, the pipeline reveals continuous variation in contractile phenotypes rather than discrete condition-specific clustering, with intra-condition variability often exceeding inter-condition differences. Redundancy analysis identifies a reduced core set of 10 metrics that retain most informational content while minimizing multicollinearity. Analysis of denoised displacement fields shows that contraction is dominated by a global isotropic mode, with localized saddle-type deformation patterns present in approximately half of the samples. All software and workflows are released openly to enable reproducible, scalable analysis of dynamic tissue mechanics.

关键词: cardiac tissue engineering, contractile dynamics, microscopy video analysis, displacement tracking, strain reconstruction, dimensionality reduction, open-source software, reproducible analysis

299. ❌ Predicting Activity Cliffs for Autonomous Medicinal Chemistry

作者: Michael Cuccarese 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07560v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于计算药物化学中的活性悬崖预测，使用传统机器学习方法（11特征模型）分析2500万分子对，属于AI在科学领域的应用（具体为化学信息学）。论文未涉及大模型、深度学习技术原理或任何关键词列表中的其他技术（如MoE、RLHF、RAG等），仅与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因此该关键词评10分，其余均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了药物化学中活性悬崖预测的挑战，通过分析大规模分子数据开发了一个模型，能有效识别小结构修改导致大效力变化的位置，将化学家需要探索的位置减少31%。

摘要翻译

活性悬崖预测——即识别哪些细微结构变化会导致活性显著跃迁的分子位点——一直是计算药物化学领域的长期挑战。本研究采用一种简约的定义框架：探究在哪些位点进行何种细微修饰最可能引发活性状态的改变。我们基于来自六大蛋白家族、50个ChEMBL靶点的2500万组匹配分子对，计算了位点层面的敏感性，揭示出两个核心问题具有本质不同的答案。“哪些位点变异最频繁？”仅通过骨架尺寸即可准确回答（NDCG@3 = 0.966），无需机器学习介入。而“哪些是真正的活性悬崖？”——即通过SALI归一化捕捉的、细微修饰引发不成比例巨大效应的位点——则需要包含三维药效团背景的11特征模型进行预测（NDCG@3 = 0.910，随机基线0.839）。该模型在全部六大蛋白家族、新颖骨架（0.913）及时间分割测试（0.878）中均表现出良好泛化能力，能以53%的准确率首次识别即定位易发悬崖位点（随机基线27%，提升近2倍），将化学家需要探索的位点数从3.1个降至2.1个——相当于首轮实验量减少31%。仅凭结构信息无法有效预测具体修饰方案（斯皮尔曼系数0.268，在新骨架上降至-0.31）。本系统已以开源代码及交互式网络应用形式公开发布。

摘要 (Abstract)

Activity cliff prediction - identifying positions where small structural changes cause large potency shifts - has been a persistent challenge in computational medicinal chemistry. This work focuses on a parsimonious definition: which small modifications, at which positions, confer the highest probability of an outcome change. Position-level sensitivity is calculated using 25 million matched molecular pairs from 50 ChEMBL targets across six protein families, revealing that two questions have fundamentally different answers. “Which positions vary most?” is answered by scaffold size alone (NDCG@3 = 0.966), requiring no machine learning. “Which are true activity cliffs?” - where small modifications cause disproportionately large effects, as captured by SALI normalization - requires an 11-feature model with 3D pharmacophore context (NDCG@3 = 0.910 vs. 0.839 random), generalizing across all six protein families, novel scaffolds (0.913), and temporal splits (0.878). The model identifies the cliff-prone position first 53% of the time (vs. 27% random - 2x lift), reducing positions a chemist must explore from 3.1 to 2.1 - a 31% reduction in first-round experiments. Predicting which modification to make is not tractable from structure alone (Spearman 0.268, collapsing to -0.31 on novel scaffolds). The system is released as open-source code and an interactive webapp.

关键词: activity cliff prediction, computational medicinal chemistry, matched molecular pairs, pharmacophore context, scaffold generalization, experimental reduction, open-source system

300. ❌ Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy

作者: Jeffrey D. Varner, Maria Cristina Bravo, Carole McBride, Thomas Orfeo, Ira Bernstein 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于现代Hopfield网络理论的生成框架（multiplicity-weighted Stochastic Attention），用于从小型纵向临床队列生成合成患者数据，以解决数据稀缺问题。论文的核心是生成模型在生物医学（具体为凝血动力学）中的应用，属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新、或关键词列表中的其他具体技术（如MoE、SFT、RAG等），因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该研究解决了小型纵向临床队列数据稀缺限制计算建模的问题，通过提出一种基于Hopfield网络的生成框架（SA）来合成逼真的患者数据，并在凝血动力学数据集上验证了合成数据在统计、结构和机制上与真实数据无异，且能有效支持下游建模任务。

摘要翻译

在孕产妇健康、罕见病及早期临床试验等领域常见的小规模纵向临床队列限制了计算建模的发展：患者数量过少难以训练可靠模型，而通过额外招募扩大队列则成本高昂且进展缓慢。本文提出基于现代霍普菲尔德网络理论的多重加权随机注意力生成框架，以解决这一难题。该框架将真实患者特征嵌入连续能量景观作为记忆模式，通过朗之万动力学生成新型合成患者数据——这些数据在保留原始队列几何结构的同时，实现了存储模式间的插值生成。模式特异性多重加权机制可在推理阶段定向扩增罕见临床亚组，且无需重新训练模型。我们将该框架应用于包含23名孕妇的纵向凝血数据集，该数据集涵盖孕前基线、孕早期和孕晚期三次访视的72项生化特征，包含多囊卵巢综合征和子痫前期等罕见亚组。经多项独立验证测试（包括凝血级联反应的常微分方程模型）证实，合成患者在统计学特征、结构特性和机制层面均与真实患者不可区分。下游效用测试进一步表明：完全基于合成患者校准的机制模型对预留真实患者结局的预测能力，与基于真实数据校准的模型表现相当。这些结果证明，该生成框架能够从极小规模纵向数据集中生成具有临床实用价值的合成队列，为小规模队列场景下的数据增强建模提供了可行路径。

摘要 (Abstract)

Small longitudinal clinical cohorts, common in maternal health, rare diseases, and early-phase trials, limit computational modeling: too few patients to train reliable models, yet too costly and slow to expand through additional enrollment. We present multiplicity-weighted Stochastic Attention (SA), a generative framework based on modern Hopfield network theory that addresses this gap. SA embeds real patient profiles as memory patterns in a continuous energy landscape and generates novel synthetic patients via Langevin dynamics that interpolate between stored patterns while preserving the geometry of the original cohort. Per-pattern multiplicity weights enable targeted amplification of rare clinical subgroups at inference time without retraining. We applied SA to a longitudinal coagulation dataset from 23 pregnant patients spanning 72 biochemical features across 3 visits (pre-pregnancy baseline, first trimester, and third trimester), including rare subgroups such as polycystic ovary syndrome and preeclampsia. Synthetic patients generated by SA were statistically, structurally, and mechanistically indistinguishable from their real counterparts across multiple independent validation tests, including an ordinary differential equation model of the coagulation cascade. A downstream utility test further showed that a mechanistic model calibrated entirely on synthetic patients predicted held-out real patient outcomes as well as one calibrated on real data. These results demonstrate that SA can produce clinically useful synthetic cohorts from very small longitudinal datasets, enabling data-augmented modeling in small-cohort settings.

关键词: synthetic patient generation, longitudinal cohorts, coagulation dynamics, Hopfield network, generative framework, data augmentation, clinical modeling, small datasets

301. ❌ Beyond the Static Approximation: Assessing the Impact of Conformational and Kinetic Broadening on the Description of TADF Emitters

作者: Daniel Beer, Jonas Weiser, Tom Gabler, Kirsten Zeitler, Carsten Deibel, Christian Wiebeler 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究有机发光二极管（OLED）中热激活延迟荧光（TADF）的动力学表征，属于材料科学和物理化学领域。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、AI代理等）完全无关，因此评分为0。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及计算化学方法（半经典Marcus-like计算）来研究分子系统，属于科学计算应用，但并非核心AI驱动的研究，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对有机发光二极管中热激活延迟荧光在固态薄膜中表征困难的问题，提出了一种基于伽马分布的'Gamma-Fit'分析方法，通过考虑分子构象和动力学异质性，准确提取了多种发射体的动力学参数，并揭示了局部环境对OLED效率的重要影响。

摘要翻译

热活化延迟荧光（TADF）是实现高效、无金属有机发光二极管（OLEDs）的一条重要途径。然而，固态薄膜中的TADF动力学表征常因显著的多指数光致发光衰减而复杂化，这阻碍了标准的双指数模型拟合。本研究引入了“Gamma-Fit”方法，这是一种基于伽马分布的简化分析框架，能够解释无序分子体系中固有的连续衰减速率分布。通过将衰减视为构象异质性和动力学异质性的结果，我们准确提取了基准发光材料4CzIPN和5CzBN，以及一系列新型二苯胺（DPA）基体系的动力学参数。我们的结果表明，考虑薄膜中的局部环境仍然是决定OLED效率的重要因素。实验发现辅以一种半经典类马库斯（Marcus-like）计算方法进行了验证。我们评估了这种传统的单构象速率计算方法的可靠性，并指出构象系综和多个具有反向系间窜越（RISC）活性的三重态的存在，是准确描述跃迁动力学的重要因素。

摘要 (Abstract)

Thermally activated delayed fluorescence (TADF) is a promising route towards high-efficiency, metal-free organic light-emitting diodes (OLEDs). However, the characterization of TADF kinetics in solid-state thin films is often complicated by pronounced multiexponential photoluminescence decays that prevent standard biexponential modeling. In this work, we introduce the ‘Gamma-Fit’ method, a streamlined analytical framework based on the gamma distribution that accounts for the continuous distribution of decay rates inherent in disordered molecular ensembles. By treating the decay as a result of conformational and kinetic heterogeneity, we accurately extract kinetic parameters for the benchmark emitters 4CzIPN and 5CzBN, as well as a series of novel diphenylamine (DPA)-based systems. Our results reveal that accounting for the local environment in thin films remains an important part in determining OLED efficiency. The experimental findings are complemented by a semiclassical Marcus-like computational approach. We evaluate the reliability of this conventional single-conformation rate calculation method and highlight the presence of conformational ensembles and multiple RISC-active triplet states as important factors for accurately describing the transition kinetics.

关键词: Thermally activated delayed fluorescence, TADF, OLED, Gamma-Fit method, conformational heterogeneity, kinetic parameters, semiclassical Marcus-like computation, molecular ensembles

302. ❌ Theory-Guided Discovery of Pressure-Induced Transitions in Fast-Ion Conductor BaSnF4

作者: Robin Turnbull, Zhang YingLong, Claudio Cazorla, Akun Liang, Rahman Saqib, Miriam Pena-Alvarez, Catalin Popescu, Laura Pampillo, Daniel Errandonea 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 论文《Theory-Guided Discovery of Pressure-Induced Transitions in Fast-Ion Conductor BaSnF4》研究的是材料科学领域，具体是使用密度泛函理论（DFT）和高压实验研究BaSnF4的结构相变。论文内容完全聚焦于材料物理、化学和实验技术，没有涉及任何大模型、深度学习、AI技术原理或AI应用。所有关键词（如LLMs、MoE、SFT、RAG、CoT、Agents等）都是关于AI/大模型技术或应用的，与论文主题无关，因此除“AI for Science”外均得0分。“AI for Science”得5分，因为论文属于科学计算/材料科学领域，但论文本身并未使用AI方法，只是科学领域的研究，因此给予中等关联分。

!!! tip deepseek-chat TL;DR

该研究通过结合密度泛函理论计算和高压实验，发现了快离子导体BaSnF4在高压下的两个结构相变，并实验验证了第一个相变，阐明了其高压相图。

摘要翻译

诸如BaSnF4之类的快离子导体因其高离子电导率和化学稳定性，在下一代固态电池技术中备受关注。然而，尽管压力诱导改性对调控功能特性具有重要意义，这些材料在极端条件下的行为仍不甚明晰。本研究结合密度泛函理论（DFT）计算与高压实验，探究了BaSnF4在高达40 GPa压力下的结构演化。DFT预测了两次压力诱导相变：在10 GPa时从常压四方相P4/nmm结构转变为单斜相P21/m-I结构，随后在32 GPa时转变为更致密的单斜相P21/m-II相。首次相变通过环境温度下进行的角分散X射线衍射、拉曼光谱和电阻率测量得到了实验证实。第二次相变则得到高压拉曼模式和电阻率行为的显著变化支持，这些变化与进一步的结构重组相一致。这些发现不仅阐明了BaSnF4的高压相图，也为基于氟锡酸盐的固态电解质中压力调控离子传输的潜力提供了新的见解。

摘要 (Abstract)

Fast-ion conductors such as BaSnF4 are of significant interest for next-generation solid-state battery technologies due to their high ionic conductivity and chemical stability. However, the behaviour of these materials under extreme conditions remains poorly understood, despite the relevance of pressure-induced modifications for tuning functional properties. In this study, we combine density functional theory (DFT) calculations with high-pressure experiments to investigate the structural evolution of BaSnF4 up to 40 GPa. DFT predicts two pressure-induced phase transitions: from the ambient-pressure tetragonal P4/nmm phase to a monoclinic P21/m-I structure at 10 GPa, and subsequently to a denser monoclinic P21/m-II phase at 32 GPa. The first transition is experimentally confirmed via angle-dispersive X-ray diffraction, Raman spectroscopy, and electrical resistivity measurements, all performed at ambient temperature. The second transition is supported by distinct changes in high-pressure Raman modes and resistivity behaviour, consistent with a further structural reorganization. These findings not only clarify the high-pressure phase diagram of BaSnF4, but also shed light on the potential for pressure-tuned ionic transport in fluorostannate-based solid electrolytes.

关键词: BaSnF4, fast-ion conductor, pressure-induced phase transitions, density functional theory, high-pressure experiments, solid-state battery, ionic conductivity, structural evolution

303. ❌ Comparative high-pressure study on rare-earth entropy fluorite-type oxides

作者: Pablo Botellaa, David Vie, Leda Kolarek, Neha Bura, Peijie Zhang, Anna Herlihy, Dominik Daisenberger, Catalin Popescu, Daniel Errandonea 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08371v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是稀土高熵氧化物在高压下的结构稳定性和振动特性，属于材料科学和凝聚态物理领域。所有评分关键词均涉及大模型、深度学习及相关技术，与论文的实验物理研究内容完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过高压X射线衍射和拉曼光谱，揭示了两种稀土高熵氧化物在高压下的结构稳定性、局部晶格畸变和振动模式变化，发现高构型熵会降低结构稳定性并导致非晶化。

摘要翻译

本研究对两种具有递增构型熵的萤石型稀土氧化物（(CePr)O${2-δ}$ 与 (CePrLa)O${2-δ}$）进行了高压对比研究。利用基于同步辐射的粉末X射线衍射和拉曼光谱分别将压力提升至30 GPa和20 GPa。尽管在9-16 GPa区间观察到异常现象——其特征为压缩性平台和振动模式变化——但两种化合物在整个研究压力范围内均保持立方萤石结构。该行为归因于局部晶格畸变和逐渐的键角弯曲，而非突发的相变。在(CePrLa)O${2-δ}$中，22 GPa以上开始出现非晶化，表明其结构稳定性降低。两种体系的体弹模量在异常现象开始后均略有下降，暗示了微弱的晶格软化。拉曼光谱显示，随着阳离子无序度增加，F${2g}$模式强度受到抑制；而在压缩过程中，稀土-氧（RE-O）模式强度的增强证明了部分重有序化现象。我们的研究结果揭示了构型熵、阳离子尺寸和压力之间在决定稀土高熵氧化物结构稳定性与振动特性方面的复杂相互作用，并为理解极端条件下其结构韧性与局部无序的调控机制提供了新见解。

摘要 (Abstract)

We report a comparative high-pressure study of two fluorite-type rare-earth oxides with increasing configurational entropy, (CePr)O${2-δ}$ and (CePrLa)O${2-δ}$. Synchrotron-based powder X-ray diffraction and Raman spectroscopy were carried out up to 30 GPa and 20 GPa, respectively. Both compounds retain the cubic fluorite structure throughout the pressure range explored, although an anomaly is observed between 9-16 GPa, characterized by a compressibility plateau and changes in vibrational modes. This behavior is attributed to local lattice distortions and a progressive bond angle bending rather than abrupt phase transitions. In (CePrLa)O${2-δ}$, the onset of amorphization is observed above 22 GPa, highlighting its reduced structural stability. The bulk modulus of both systems shows a slight decrease after the onset of the anomaly, suggesting subtle lattice softening. Raman spectroscopy reveals suppression of the F${2g}$ mode intensity with increasing cationic disorder, and under compression, partial reordering is evidenced by an increase in the RE-O mode intensity. Our results highlight the complex interplay between configurational entropy, cation size, and pressure in determining the structural stability and vibrational properties of rare-earth high-entropy oxides and provide insight into the mechanisms governing their resilience and local disorder under extreme conditions.

关键词: high-pressure study, rare-earth oxides, fluorite-type structure, configurational entropy, structural stability, Raman spectroscopy, X-ray diffraction, amorphization

304. ❌ From Full Dynamic to Pure Static: A Family of $GW$-Based Approximations

作者: Pierre-François Loos, Johannes Tölle 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子化学中的GW近似方法，属于计算化学领域，与所有大模型/深度学习技术关键词完全无关。仅与’AI for Science’有一定关联，因为该研究属于科学计算应用，但论文未涉及人工智能方法，而是纯粹的量子化学理论方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种从完全动态到纯静态的GW近似方法层次结构，用于研究电离势的描述，并证明部分静态方案可以在简化计算的同时提供可靠的准粒子能量。

摘要翻译

我们基于$GW$近似，通过逐步降低自能动力学内容的方式，构建了一套系统性的单粒子格林函数方法层级体系。从完全动力学的戴森方程出发，我们建立了一系列近似方法，这些方法在标准$GW$近似与纯静态有效单粒子哈密顿量之间实现了连续过渡。该框架为研究动力学效应和粒子-空穴耦合在电离势描述中的作用提供了可控的研究路径。在此统一形式体系中，空穴支与粒子支可通过降维策略选择性地解耦至约化单粒子空间。通过对分子电离能进行基准测试，我们评估了该层级体系中不同方法的精度、数值稳健性及算法复杂度。研究表明，通过一致推导所得的部分静态方案能够在显著简化本征值问题的同时，提供可靠准粒子能量。我们进一步提出了一种新型静态厄米自能，它作为该层级体系的静态极限而获得。尽管其概念起源不同，但其计算结果与qs$GW$方法高度吻合，从而为部分自洽计算提供了一条替代的静态路径。

摘要 (Abstract)

We introduce a systematic hierarchy of one-body Green’s function methods derived from the $GW$ approximation, constructed by progressively reducing the dynamical content of the self-energy. Starting from the fully dynamical Dyson formulation, we generate a family of approximations that interpolates between the standard $GW$ approximation to purely static effective single-particle Hamiltonians. This framework enables a controlled investigation of the role of dynamical effects and particle-hole coupling in the description of ionization potentials. Within this unified formalism, the hole and particle branches can be selectively decoupled through downfolding strategies into reduced one-particle spaces. By benchmarking the different members of this hierarchy on molecular ionization energies, we assess their accuracy, numerical robustness, and algorithmic complexity. We demonstrate that consistently derived partially static schemes can yield reliable quasiparticle energies while significantly simplifying the underlying eigenvalue problem. We further introduce a novel static Hermitian self-energy obtained as the static limit of this hierarchy. Despite its conceptually distinct origin, it produces results remarkably close to those of qs$GW$, thereby providing an alternative static route toward partial self-consistency.

关键词: GW approximation, Green’s function methods, self-energy, ionization potentials, quasiparticle energies, static Hermitian self-energy, partial self-consistency, molecular ionization energies

305. ❌ Crossing Seam Blockade

作者: Ruoxi Liu, Xiaotong Zhu, Bing Gu 期刊/来源: arxiv 发布日期: 2026-04-09 arXiv链接: http://arxiv.org/abs/2604.08128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究光化学中的电子简并性和非绝热跃迁，通过量子几何分子动力学模拟发现交叉缝会完全阻断反应通道，属于理论化学和光化学领域。所有评分关键词均与大模型、深度学习技术及其应用相关，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该理论研究通过量子几何分子动力学模拟发现，在氢链H4中，先前被认为是单线态裂变最小模型的反应通道会被分子构型空间中的交叉缝完全阻断，揭示了控制光化学反应的新机制。

摘要翻译

电子简并与近简并（包括锥形交叉与避免交叉）通常伴随强烈的电子振动耦合与非绝热跃迁，在光化学、光物理与光生物学过程中起着基础性作用。然而，其对激发态化学反应性的影响尚未被完全理解。在本理论研究中，我们报告了一个令人惊奇的现象：分子构型空间中的交叉缝可以完全阻断一个开放的反应通道。具体而言，通过数值精确的从头算非绝热全量子几何分子动力学模拟，我们证明在氢链H$_4$（先前被确定为单线态裂解的最小模型）中，单线态裂解通道因电子量子几何效应而被阻断。我们提供了一个化学直观的图像以理解这一效应。我们的结果不仅揭示了调控光化学反应的新机制，也可能为阐明单线态裂解的机理提供启示。

摘要 (Abstract)

Electronic degeneracies and near-degeneracies including conical intersections and avoided crossings, typically accompanied by strong vibronic couplings and nonadiabatic transitions, play fundamental roles in photochemical, photophysical and photobiological processes. However, its implications on excited-state chemical reactivities are not fully understood. In this theoretical study, we report a surprising phenomena that an open reaction channel can be \emph{completely} blocked by a crossing seam in the molecular configuration space. Specifically, by numerically exact ab initio nonadiabatic full quantum geometrical molecular dynamics simulations, we show that the singlet fission channel in the hydrogen chain H$_4$, previously identified as a minimal model for singlet fission, is blocked due to electronic quantum geometry. We provide a chemically intuitive picture to understand this effect. Our results not only reveal a new mechanism for controlling photochemical reactions, but may also elucidate the mechanism of singlet fission.

关键词: conical intersections, avoided crossings, vibronic couplings, nonadiabatic transitions, singlet fission, quantum geometrical molecular dynamics, H4 hydrogen chain, reaction channel blockade

306. ❌ The BOS-TMC Dataset: DFT Properties of 159k Experimentally Characterized Transition Metal Complexes Spanning Multiple Charge and Spin States

作者: Aaron G. Garrison, Jacob W. Toney, Tatiana Nikolaeva, Roland G. St. Michel, Christopher J. Stein, Heather J. Kulik 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要介绍BOS-TMC数据集，包含159k过渡金属配合物的密度泛函理论（DFT）性质，用于机器学习模型开发、DFT基准测试和探索。论文内容聚焦于计算化学和材料科学，未涉及大模型、深度学习技术原理或任何关键词中的具体技术（如LLM、MoE、RLHF等）。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为数据集旨在支持机器学习在科学领域的应用，但论文本身未直接应用AI，仅提及作为潜在用途，因此给予5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究创建了BOS-TMC数据集，包含159k过渡金属配合物的DFT性质，以支持机器学习模型开发和DFT基准测试。

摘要翻译

本文介绍了波士顿开壳层过渡金属配合物（BOS-TMC）数据集，该数据集包含从剑桥结构数据库（CSD）中提取的15.9万个经实验表征的单核过渡金属配合物（TMCs）的密度泛函理论（DFT）性质数据，这些配合物具有多种自旋态及一系列形式电荷。为构建此数据集，我们采用迭代程序以可靠地确定配合物的整体电荷。基于此信息，我们进一步获取了最多三种自旋态（对于3d金属为低、中、高自旋态，对于4d和5d金属则根据与金属电子构型的兼容性获取低和中自旋态）的性质数据，共计34.38万种TMC/自旋组合。与以往数据集不同，我们在结构优化过程中保留了实验测得的重原子坐标。我们使用PBE0/def2-TZVP方法对这些结构进行单点能计算，并报告所有性质。我们引入了一种计算金属自旋依赖的原子化能的方法，并报告了每个TMC的该能量值。除电子能量外，我们还报告了多达七项额外性质，包括：最高占据分子轨道（HOMO）、最低未占分子轨道（LUMO）、HOMO-LUMO能隙、原子部分电荷、偶极矩、原子化能以及自旋劈裂能，总计超过290万项与TMC相关的性质数据。针对基于尺寸选取的代表性超过1万个配合物的子集，我们评估了计算性质对交换关联（xc）泛函选择的敏感性，所使用的十二种xc泛函覆盖了“雅各布阶梯”的各个层级，并指出了TMC空间中不确定性最高的热点区域。与以往的过渡金属数据集相比，BOS-TMC在电荷和自旋构型方面规模更大、多样性更高，因此其性质范围也更为广泛。该数据集预计将为机器学习模型开发、DFT基准测试以及探索性研究提供高保真的基础。

摘要 (Abstract)

We present the Boston Open-Shell Transition Metal Complex (BOS-TMC) dataset, a set of density functional theory (DFT) properties for 159k experimentally characterized mononuclear transition metal complexes (TMCs) in multiple spin states with a range of formal charges derived from the Cambridge Structural Database (CSD). To curate this set, we carried out an iterative procedure to confidently assign overall TMC charge. From this information, we then obtained properties in up to three spin states, i.e., low-, intermediate-, and high-spin for 3d metals and low- and intermediate-spin for 4d and 5d metals, depending on compatibility with the metal electron configuration, for a total of 343.8k TMC/spin combinations. At odds with prior sets, we preserved experimental heavy-atom coordinates in these structures during optimization. We report all properties using PBE0/def2-TZVP single-point energies on these structures. We introduce a scheme for computing metal-spin-dependent atomization energies, which we report for each TMC. Alongside electronic energies, we report up to seven additional properties including: HOMO, LUMO, HOMO-LUMO gap, atomic partial charges, dipole moments, atomization energies, and spin-splitting energies for a total of over 2.9M TMC-associated properties. For a representative subset of over 10k complexes chosen based on size, we evaluate the sensitivity of computed properties to exchange-correlation (xc) functional choice from a set of twelve xcs spanning rungs of “Jacob’s ladder”, highlighting hotspots of TMC space that have the greatest uncertainty. In comparison to prior transition-metal datasets, BOS-TMC is both larger and more diverse in terms of charge and spin configurations and, as a result, more diverse in its range of properties. This dataset is expected to provide a high-fidelity foundation for machine-learning model development, DFT benchmarking, and exploration.

关键词: transition metal complexes, density functional theory, dataset, machine learning, computational chemistry, spin states, properties, benchmarking

307. ❌ Linear odd electrophoresis of a sphere in a charged chiral active fluid

作者: Reinier van Buel, Bogdan Cichocki, Jeffrey C. Everts 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究带电胶体粒子在具有奇粘度的流体中的电泳现象，属于软物质物理和流体动力学领域。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、应用框架等），而论文完全不涉及任何人工智能、机器学习或计算模型技术，纯粹是理论物理和流体力学研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了带电胶体粒子在具有奇粘度的带电手性活性流体中的电泳现象，推导了任意形状粒子在弱外电场下的电泳迁移率通用表达式，并针对导电带电球体获得了精确的解析解，发现奇粘度会导致电泳迁移率张量出现方向不对称性。

摘要翻译

在表现出奇粘性的流体中，带电胶体颗粒的电泳现象是理解电荷稳定手性活性悬浮液内传输现象的一项基础性挑战。本文首次提出了带电手性活性流体（charged chiral active fluid）的概念，其中电动动力学与奇斯托克斯流（odd Stokes flow）相耦合，以探究牛顿流体中的经典电泳结果在奇粘性存在下如何推广。特别地，我们利用奇流体的洛伦兹互易定理（Lorentz reciprocal theorem），推导出了弱外电场下任意形状颗粒电泳迁移率（electrophoretic mobility）的通用表达式。通过将此结果应用于低zeta电势下的导电带电球体，我们得到了电泳迁移率的精确封闭解析表达式，该表达式对德拜屏蔽长度（Debye screening length）和奇粘性系数（odd-viscosity coefficient）的任意值均成立。与牛顿流体类似，我们发现电泳迁移率与不带电球体的平动迁移率成正比，并由亨利函数（Henry function）调节。然而，与牛顿流体不同的是，奇粘性会导致电泳迁移率张量（electrophoretic mobility tensor）出现方向不对称性，即使对于薄的双电层（electric double layer），这种不对称性依然存在。这一情况与悬浮在各向同性牛顿流体中的带电各向异性颗粒形成鲜明对比，在后一情形中，相同的静电屏蔽条件下各向异性效应会消失。

摘要 (Abstract)

The electrophoresis of charged colloidal particles in fluids exhibiting odd viscosity represents a fundamental challenge in understanding transport phenomena within charge-stabilized chiral active suspensions. Here, we provide the first concept of a charged chiral active fluid, where electrokinetics is coupled to odd Stokes flow, to explore how classical results from electrophoresis in Newtonian fluids generalize in the presence of odd viscosity. In particular, we derive a general expression for the electrophoretic mobility for particles of any shape under weak external electric fields using the Lorentz reciprocal theorem for odd fluids. By applying this result to a conducting charged sphere at low zeta potentials, we obtain an exact, closed-form analytical expression for the electrophoretic mobility, valid for arbitrary values of the Debye screening length and the odd-viscosity coefficient. Similar to Newtonian fluids, we find that the electrophoretic mobility is proportional to the translational mobility of an uncharged sphere, modulated by the Henry function. However, unlike in Newtonian fluids, odd viscosity leads to directional asymmetries in the electrophoretic mobility tensor that persist even for thin electric double layers. This case contrasts significantly with a charged anisotropic particle suspended in an isotropic Newtonian fluid, where anisotropic effects would vanish under the same electrostatic-screening conditions.

关键词: electrophoresis, charged colloidal particles, odd viscosity, chiral active fluid, electrophoretic mobility, Lorentz reciprocal theorem, Debye screening length, directional asymmetries

Token 消耗统计

总计: 1,016,554 tokens（输入 709,949 / 输出 306,605）

模型	输入	输出	合计
deepseek-chat	552,875	306,605	859,480
glm-4.7	157,074	0	157,074

📊 ArXiv 研究报告 (2026-04-11)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

2. MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

3. Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

4. Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

5. EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Di

6. 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

7. What do Language Models Learn and When? The Implicit Curriculum Hypothesis

8. Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

9. LogAct: Enabling Agentic Reliability via Shared Logs

10. SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

11. Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

12. What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

13. ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

14. TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context D

15. Emotion Concepts and their Function in a Large Language Model

16. AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

📋 所有论文列表

1. ✅ Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

2. ✅ MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

3. ✅ Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

4. ✅ Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

5. ✅ EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

6. ✅ 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

7. ✅ What do Language Models Learn and When? The Implicit Curriculum Hypothesis

8. ✅ Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

9. ✅ LogAct: Enabling Agentic Reliability via Shared Logs

10. ✅ SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

11. ✅ Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

12. ✅ What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

13. ✅ ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

14. ✅ TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

15. ✅ Emotion Concepts and their Function in a Large Language Model

16. ✅ AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

17. ❌ ACF: A Collaborative Framework for Agent Covert Communication under Cognitive Asymmetry

18. ❌ Sensitivity-Positional Co-Localization in GQA Transformers

19. ❌ InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

20. ❌ Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance

21. ❌ From Gaze to Guidance: Interpreting and Adapting to Users’ Cognitive Needs with Multimodal Gaze-Aware AI Assistants

22. ❌ Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

23. ❌ Equivariant Efficient Joint Discrete and Continuous MeanFlow for Molecular Graph Generation

24. ❌ Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

25. ❌ Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

26. ❌ SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

27. ❌ AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

28. ❌ RewardFlow: Generate Images by Optimizing What You Reward

29. ❌ OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

30. ❌ PSI: Shared State as the Missing Layer for Coherent AI-Generated Instruments in Personal AI Agents

31. ❌ Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

32. ❌ Differentially Private Language Generation and Identification in the Limit

33. ❌ ClawBench: Can AI Agents Complete Everyday Online Tasks?

34. ❌ Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

35. ❌ PIArena: A Platform for Prompt Injection Evaluation

36. ❌ SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

37. ❌ Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

38. ❌ TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

39. ❌ From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

40. ❌ OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

41. ❌ A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation

42. ❌ CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

43. ❌ KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

44. ❌ HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

45. ❌ Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules

46. ❌ KV Cache Offloading for Context-Intensive Tasks

47. ❌ Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

48. ❌ On-board Telemetry Monitoring in Autonomous Satellites: Challenges and Opportunities

49. ❌ Synthetic Data for any Differentiable Target

50. ❌ Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

51. ❌ Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

52. ❌ Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

53. ❌ Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks

54. ❌ ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

55. ❌ Phantasia: Context-Adaptive Backdoors in Vision Language Models

56. ❌ Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover

57. ❌ TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs