📊 ArXiv 研究报告 (2026-03-18)

生成时间: 2026-03-18 09:31:39 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 337 篇
及格论文: 15 篇 (4.5%)
深度分析: 5 篇

⭐ 及格论文详细分析

1. Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

作者: Auksarapak Kietkajornrit, Jad Tarifi, Nima Asgharbeygi 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14458v1

评分: 69.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在事实问答中的可靠性问题，提出一个将规划与检索/合成分离的模块化框架。高度相关（10分）的关键词包括：LLMs（核心研究对象）、RAG（直接解决检索增强问题）、Tool Use（框架涉及工具使用进行检索）、Hallucination Mitigation（核心目标是减少幻觉提高事实性）。较强相关（8分）的关键词：Chain of Thought/System 2 Thinking（涉及显式规划分解步骤）、LLM Agents（框架具有代理特性）。中等相关（5分）：SFT（使用教师-学生框架进行监督训练）。其余关键词论文未涉及或仅边缘提及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对检索增强LLMs在依赖最新或冲突信息的事实问答中不可靠的问题，提出了一个将规划与事实检索/答案合成显式分离的模块化框架，通过监督规划训练提高了准确性和延迟性能。

摘要翻译

当答案依赖于最新或相互冲突的信息时，基于大语言模型（LLMs）的事实探寻问答仍不可靠。尽管检索增强型和工具调用型LLMs减少了幻觉问题，但它们通常依赖于隐式规划，导致工具使用效率低下。我们提出一种模块化框架，将规划与事实检索及答案合成明确分离。通过师生框架训练一个轻量级学生规划器，以生成由抽象推理步骤和可搜索事实请求组成的结构化分解。监督信号仅包含规划轨迹和事实请求，不提供事实答案或检索证据。在推理阶段，规划器生成计划，而经过提示工程设计的模块则执行检索和响应合成。我们在SEAL-0（一个针对搜索增强型LLMs的极具挑战性的基准测试）上评估所提出的框架。结果表明，与单一推理模型和基于提示的工具增强框架相比，监督式规划在准确性和延迟方面均有提升，这证明显式学习的规划结构对于构建可靠的事实探寻型LLMs至关重要。

摘要 (Abstract)

Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.

关键词: Large Language Models, Retrieval-Augmented Generation, Tool Use, Hallucination Mitigation, Planning, Fact-seeking QA, Teacher-Student Framework, Modular Framework

2. An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control with

作者: Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14463v1

评分: 60.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在保险领域的专业化应用，因此与"Large Language Models"高度相关（10分）。方法上，论文提出了一个包含SFT和RLAIF的端到端对齐范式，因此与"Post-training/SFT"、“Instruction Tuning/Alignment”、“RLHF/RLAIF/DPO"高度相关（均为10分）。论文明确以解决幻觉问题为核心目标之一，因此与"Hallucination Mitigation"高度相关（10分）。论文提到现有方法依赖RAG，但其方法旨在超越RAG，因此与"RAG"有一定关联（5分）。论文涉及将LLM适配到垂直领域，与"Domain Adaptation"有一定关联（5分）。论文未涉及其他关键词的具体技术或概念，因此其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了将大语言模型（LLMs）专业化应用于高风险保险领域时，如何在实现领域精通和极低幻觉率的同时，不牺牲模型通用能力的挑战，并提出了一种结合验证数据合成与渐进式SFT-RL课程框架的新方法，成功训练出在领域任务上达到SOTA性能且保持顶级通用能力的保险专用模型INS-S1。

摘要翻译

将大型语言模型（LLM）适配于保险等高风险垂直领域面临着一项重大挑战：应用场景要求严格遵循复杂的监管规定与业务逻辑，且对幻觉零容忍。现有方法通常存在“能力权衡”问题——为获取领域专业知识而牺牲通用智能，或过度依赖检索增强生成（RAG）而缺乏内在推理能力。为弥补这一差距，我们提出了INS-S1，这是一个通过新颖的端到端对齐范式训练而成的保险专用大语言模型系列。我们的方法包含两项方法论创新：（1）一个可验证的数据合成系统，用于构建支持精算推理与合规性的分层数据集；（2）一个渐进式监督微调-强化学习（SFT-RL）课程框架，该框架将动态数据退火与经过验证的推理（RLVR）和人工智能反馈（RLAIF）的协同组合相结合。通过优化数据比例与奖励信号，该框架在强化领域约束的同时防止了灾难性遗忘。此外，我们发布了迄今为止最全面的保险领域基准测试INSEva（包含超过3.9万个样本）。大量实验表明，INS-S1在领域任务上实现了最先进的性能，显著优于DeepSeek-R1和Gemini-2.5-Pro。至关重要的是，它保持了顶级的通用能力，并实现了创纪录的低幻觉率（0.6%，基于HHEM评估）。我们的结果表明，严格的领域专业化可以在不损害通用智能的前提下实现。

摘要 (Abstract)

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

关键词: Large Language Models, Insurance Domain, Hallucination Mitigation, Supervised Fine-tuning, RLAIF, Domain Adaptation, Verifiable Data Synthesis, Progressive Curriculum

3. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

作者: Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, Kun Zhang 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14575v1

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文CausalEvolve提出了一种基于LLM的AI科学家代理，用于解决开放式的科学问题，因此与"Large Language Models"和"LLM Agents"高度相关（10分）。研究属于"AI for Science"领域，直接应用LLM于科学发现（10分）。方法涉及因果推理、反思和进化改进，与"Chain of Thought”、“System 2 Thinking"和"Self-Correction"有一定关联（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文核心内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对现有基于进化的AI科学家代理在进化效率低下和振荡行为的问题，提出了CausalEvolve方法，通过因果推理和反思机制来指导进化过程，在多个开放式科学任务中有效提高了进化效率并发现了更好的解决方案。

摘要翻译

以AlphaEvolve为代表的进化型智能体是利用大型语言模型构建AI科学家的显著成功案例之一。这类智能体通过迭代改进与演化程序，并借助大型语言模型的先验知识与推理能力，以解决开放式的科学问题。尽管取得了成功，现有的进化型智能体仍缺乏针对演化过程的定向引导机制，以及有效组织与利用历史进化经验中获取知识的系统方法。因此，它们在接近已知性能边界时会出现进化效率递减与振荡现象。为弥补这一不足，我们开发了CausalEvolve，其配备了一个因果推理工作台，能够利用大型语言模型识别并推演进化过程中的关键引导因素。在初始阶段，CausalEvolve首先识别结果层面的影响因素，这些因素能为优化目标提供互补性启发。在进化过程中，该系统还通过监测演化中的异常模式并结合溯因推理来假设新的影响因素，从而开辟新的进化方向。通过全面的实验验证，我们证明CausalEvolve在四项具有挑战性的开放式科学任务中，能有效提升进化效率并发现更优解决方案。

摘要 (Abstract)

Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.

关键词: CausalEvolve, Large Language Models, AI Scientists, open-ended scientific problems, evolutionary efficiency, causal reasoning, self-improvement, scientific discovery

4. Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

作者: Guangfu Hao, Yuming Dai, Xianzhe Qin, Shan Yu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15371v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在复杂推理任务中的局限性，并提出BIGMAS多智能体系统架构来提升推理性能。与关键词高度相关（10分）的有：LLMs（论文直接使用LLMs作为基础模型）、Chain of Thought（论文明确提到chain-of-thought机制和推理任务）、System 2 Thinking（论文研究复杂多步推理，属于深度推理范畴）、LLM Agents（论文构建专门的LLM智能体）、Multi-agent Systems（论文核心是多智能体系统架构）。其他关键词在论文摘要中未提及或与论文内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在复杂多步推理任务中存在的准确性崩溃问题，提出了一种受大脑启发的图多智能体系统架构，通过动态构建的智能体拓扑和集中式共享工作空间协调机制，显著提升了多种前沿LLM在复杂推理任务上的性能表现。

摘要翻译

大语言模型（LLM）已在广泛的语言任务中展现出卓越能力，但复杂的多步推理仍是根本性挑战。尽管配备扩展思维链机制的大型推理模型（LRM）相比标准LLM表现出性能提升，但这两类模型在足够复杂的任务上仍存在准确性崩溃现象，这表明仅靠模型级推理的扩展并不足够。受人类认知的全局工作空间理论启发，我们提出脑启发的图多智能体系统（BIGMAS）。该系统将专用LLM智能体组织为动态构建的有向图节点，并仅通过中心化共享工作空间进行协同。问题自适应的图设计器（GraphDesigner）构建任务特定的智能体拓扑结构，而全局编排器（Orchestrator）则利用完整的共享状态进行路由决策，从而克服反应式方法的局部视野瓶颈。在Game24、Six Fives和伦敦塔任务上对六种前沿LLM的实验表明，BIGMAS能持续提升标准LLM与LRM的推理性能，其表现优于包括ReAct和思维树（Tree of Thoughts）在内的现有多智能体基线方法，这证明多智能体架构设计能够提供与模型级推理增强正交的互补性增益。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain-of-thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model-level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain-Inspired Graph Multi-Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem-adaptive GraphDesigner constructs task-specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local-view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi-agent baselines including ReAct and Tree of Thoughts, showing that multi-agent architectural design provides complementary gains orthogonal to model-level reasoning enhancements.

关键词: Large Language Models, Multi-step Reasoning, Multi-agent Systems, Chain-of-Thought, Graph Topology, Agent Coordination, Reasoning Performance, Brain-Inspired Architecture

深度分析:

受大脑启发的图多智能体系统用于大语言模型推理

摘要:

针对大语言模型（LLM）和大型推理模型（LRM）在复杂任务中存在的精度坍塌问题，本文提出了一种受大脑全局工作空间理论（GWT）启发的图多智能体系统（BIGMAS）。该系统通过GraphDesigner智能体为每个问题动态构建特定的有向代理图和共享工作空间模式，利用全局编排器基于完整状态进行路由决策，克服了反应式方法的局部视野限制。实验在Game24、Six Fives和Tower of London三个基准上测试了六种前沿模型，结果表明BIGMAS显著提升了标准LLM和LRM的推理性能，优于ReAct和思维树等基线，证明了多智能体架构设计提供了与模型级推理增强正交的互补增益。

创新点:

受大脑全局工作空间理论（GWT）启发的动态图架构，通过GraphDesigner智能体根据问题自适应构建代理拓扑。
全局状态编排机制，利用集中式共享工作空间和全局编排器，解决了现有反应式多智能体系统的局部视野瓶颈。
鲁棒的执行流程，包含自纠正循环和多策略回退解析，确保在节点失败时仍能保持执行完整性。
验证了多智能体架构设计能提供与模型级推理能力提升正交的互补增益，有效缓解复杂任务中的精度坍塌问题。

方法

!!! info

论文提出BIGMAS框架，结合神经科学中的全局工作空间理论。技术路线包括：1) 使用GraphDesigner动态生成任务特定的代理图结构；2) 建立集中式共享工作空间，确保所有中间结果全局可见；3) 引入全局Orchestrator基于完整历史状态进行路由；4) 在Game24、Six Fives和Tower of London等可验证的推理基准上，对DeepSeek、Claude、GPT和Gemini等六种前沿LLM进行广泛评估。

关键结果:

BIGMAS在三个推理基准上均显著提升了标准LLM和LRM的性能。
在单个模型表现最差（最复杂）的任务上，BIGMAS的性能提升最大。
BIGMAS的表现优于现有的多智能体基线方法（如ReAct和Tree of Thoughts）。
证明了多智能体协调是解决推理坍塌的结构性补救措施，其增益与模型级增强正交。

技术栈: 全局工作空间理论, 动态图构建算法, 链式思维, 自我反思机制, 前沿大语言模型 (DeepSeek, Claude, GPT, Gemini), 推理基准环境 (Game24, Six Fives, Tower of London)

优点

架构创新性强，将神经科学原理（GWT）成功应用于LLM多智能体系统设计。
解决了现有多智能体系统拓扑固定和状态碎片化的核心局限。
实现了真正的动态适应性，能够根据不同问题调整代理组成和拓扑结构。
实验验证全面，涵盖了多种前沿模型和不同类型的推理任务，结论具有说服力。

局限

系统架构相对复杂，引入GraphDesigner和全局编排器可能会增加推理延迟和计算成本。
依赖于GraphDesigner生成正确的图结构，如果初始图设计不合理，可能影响后续推理效率。
论文主要关注数学和规划类任务，在开放域问答或常识推理等任务上的泛化能力有待进一步验证。

与研究方向的相关性:

该论文高度契合研究关键词。它属于大模型技术原理的创新范畴，提出了全新的多智能体架构（BIGMAS）来解决LLM推理中的核心瓶颈（精度坍塌）。论文不仅涉及深度学习技术原理的创新（动态图、全局协调），还展示了其在科学计算和逻辑推理领域的应用潜力。其受大脑启发的跨学科创新点非常突出，符合高分标准。

5. SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

作者: Yu Pan, Wenlong Yu, Tiejun Wu, Xiaohu Ye, Qiannan Si, Guangquan Xu, Bin Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15397v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文SFCoT专注于提升大语言模型（LLMs）在链式思维（Chain-of-Thought）推理过程中的安全性，核心是防止越狱攻击。因此，与"Large Language Models"和"Chain of Thought"高度相关（10分）。它涉及安全对齐（“Alignment”）和幻觉缓解（“Hallucination Mitigation”）以提升事实性，以及通过实时校准实现自我纠正（“Self-Correction”），这些是论文的关键组成部分，给予8分。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，与论文主题无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在链式思维推理中易受越狱攻击的安全漏洞，提出了SFCoT框架，通过实时评估和校准中间推理步骤，将攻击成功率从58.97%降低至12.31%，有效增强了模型安全性且未显著影响一般性能。

摘要翻译

大语言模型（LLMs）在复杂推理任务中展现出卓越能力，但其安全对齐机制仍极易受到越狱攻击的破坏。现有防御方法通常仅对最终输出进行事后过滤，导致中间推理步骤缺乏监控，易受对抗性操纵。为弥补这一缺陷，本文提出一种更安全的思维链框架（SaFer Chain-of-Thought, SFCoT），该框架能实时评估并校准潜在的不安全推理步骤。SFCoT融合了三级安全评分系统与多视角一致性验证机制，旨在全程检测推理过程中的潜在风险。动态干预模块随后执行针对性校准，将推理路径引导至安全结果。实验表明，SFCoT将攻击成功率从$58.97%$降至$12.31%$，在未显著影响通用性能的前提下，证明其作为一种高效的大语言模型安全增强方法的有效性。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering applied only to the final output, leaving intermediate reasoning steps unmonitored and vulnerable to adversarial manipulation. To address this gap, this paper proposes a SaFer Chain-of-Thought (SFCoT) framework, which proactively evaluates and calibrates potentially unsafe reasoning steps in real time. SFCoT incorporates a three-tier safety scoring system alongside a multi-perspective consistency verification mechanism, designed to detect potential risks throughout the reasoning process. A dynamic intervention module subsequently performs targeted calibration to redirect reasoning trajectories toward safe outcomes. Experimental results demonstrate that SFCoT reduces the attack success rate from $58.97%$ to $12.31%$, demonstrating it as an effective and efficient LLM safety enhancement method without a significant decline in general performance.

关键词: Large Language Models, Chain-of-Thought, Safety Alignment, Jailbreak Attacks, Reasoning Security, Real-time Calibration, Adversarial Manipulation, Attack Success Rate Reduction

深度分析:

SFCoT：通过主动安全评估与校准实现更安全的思维链

摘要:

论文针对大语言模型（LLM）易受越狱攻击的问题，提出了一种名为SFCoT（Safer Chain-of-Thought）的安全框架。现有防御多依赖事后过滤，忽略了中间推理步骤的风险。SFCoT通过三层安全评分系统（词汇、语义、策略）和多视角一致性验证机制，实时监控思维链（CoT）中的每个推理步骤。一旦检测到不安全或模糊的步骤，动态干预模块会进行截断或重写校准。实验表明，该方法将攻击成功率从58.97%显著降低至12.31%，同时保持了91.2%的模型通用性能，有效提升了LLM在复杂推理任务中的安全性。

创新点:

提出了主动式思维链安全防御框架SFCoT，改变了传统事后过滤的被动防御模式，在推理过程中实时监控风险。
设计了三层安全评分系统（词汇、语义、策略），结合多视角一致性验证，能够精准识别显性及隐性的不安全推理步骤。
引入动态干预模块，针对不同风险等级（安全、灰色地带、不安全）分别采取继续、重写或截断策略，在保障安全的同时尽量保留模型效用。

方法

!!! info

论文首先定义了包含安全评分函数和校准过程的数学模型。技术路线上，首先利用CoT Parser解析推理步骤；接着通过三层评分系统（基于敏感词库的词汇层、轻量级深度学习模型的语义层、基于上下文的策略层）计算安全分数；对于处于“灰色地带”的步骤，生成多个语义变体进行一致性验证；最后根据评分和验证结果，通过动态干预器对推理链进行重写或截断，确保最终输出的安全性。

关键结果:

SFCoT将LLM在越狱攻击下的攻击成功率（ASR）从基线的58.97%降低至12.31%，优于事后过滤方法（45.13%）。
消融实验表明，多视角一致性验证模块进一步降低了ASR（从18.46%降至12.31%），且重写机制比直接截断能更好地保持输出质量（4.6分 vs 2.1分）。
在通用基准测试（MMLU, GSM8K, MBPP）中，SFCoT保留了模型91.2%的原始效用，未造成显著的性能退化。

技术栈: 算法：Chain-of-Thought (CoT) 解析、三层加权安全评分算法、多视角一致性验证、动态干预策略（重写/截断）。, 模型：Qwen3-8B（作为基座模型）、轻量级深度学习分类器（用于语义层评分）。, 数据集：JailBreakV_28K（越狱攻击样本）、MMLU、GSM8K、MBPP（通用能力评估）。, 工具/技术：正则表达式（Regex）、提示工程、LLM-as-a-Judge（用于输出质量评分）。

优点

主动防御机制：在推理过程中实时介入，防止有害信息在中间步骤中累积，比事后过滤更有效。
细粒度评估：结合词汇、语义和策略三个层面的评分，能捕捉复杂的对抗性攻击和隐晦的风险。
平衡安全与效用：通过重写机制而非简单的截断，有效降低了安全防御对模型正常推理能力的负面影响。
鲁棒性强：多视角一致性验证能有效识别通过语言微调绕过过滤的攻击。

局限

依赖CoT解析：对于不支持显式CoT输出的闭源模型，需要额外的提示工程或微调来提取推理步骤，可能引入解析误差。
计算开销：实时监控每个推理步骤、生成变体进行一致性验证以及重写操作，会增加推理延迟和计算成本。
阈值敏感性：安全评分的阈值和一致性方差阈值需要针对不同模型和应用场景进行精细调整，泛化性可能受限。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”领域，专注于提升大语言模型（LLM）的安全性和可靠性。论文针对LLM的核心推理机制（Chain-of-Thought）提出了创新性的防御架构，解决了越狱攻击这一关键安全问题，具有很高的技术创新性和实用价值。

作者: Ren Jian Lim, Rushi Dai 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15341v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发一个基于LLM的多模态多智能体框架，用于室内空间设计。高度相关的关键词包括：1) “Large Language Models” (论文明确使用LLMs进行空间推理和设计生成，是核心组件)；2) “Retrieval-Augmented Generation” (论文明确使用RAG来减少数据依赖)；3) “LLM Agents” (论文构建了专门的智能体，如Reference、Spatial、Interactive、Grader)；4) “Multi-agent Systems” (多个智能体协同工作，是框架的核心架构)。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等，论文未涉及或未明确提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于大型语言模型的多模态多智能体框架，通过自然语言交互和检索增强生成技术，将用户描述动态转化为优化的3D室内设计方案，有效改善了设计沟通和参与度。

摘要翻译

在建筑室内设计领域，由于客户缺乏设计知识，而设计师难以解释复杂的空间关系，沟通不畅问题频发，常导致项目延期和经济损失。近年来，生成式布局工具的进展通过自动化生成三维可视化方案，缩小了这一鸿沟。然而，现有方法存在局限：基于规则的体系采用硬编码的空间约束，限制了参与式互动；而数据驱动模型则依赖于大量训练数据集。近期兴起的大语言模型（Large Language Models, LLMs）通过自然语言实现对空间关系的直观推理，弥补了这一不足。本研究提出一个基于大语言模型的多模态、多智能体框架，能够动态地将自然语言描述和图像转化为三维设计方案。通过提示词指南运作的专项智能体（参考智能体、空间智能体、交互智能体、评分智能体）协同应对核心挑战：该智能体系统支持实时用户交互以实现迭代式空间优化，同时检索增强生成（Retrieval-Augmented Generation, RAG）技术降低了数据依赖性，无需针对特定任务进行模型训练。本框架能准确解读空间意图并生成优化的三维室内设计，从而提升生产效率，促进非专业设计者的参与。通过对多样化的平面布局和用户问卷进行评估，验证了该框架的有效性。一项独立的大语言模型评估显示，参与式生成的布局在用户意图契合度、美学一致性、功能性和动线流畅性方面均获得更高评分。问卷结果表明，用户满意度达77%，且相较于传统设计软件表现出明显偏好。这些发现表明，该框架增强了以用户为中心的沟通，并促进了更具包容性、高效性和适应性的设计流程。项目页面：https://rsigktyper.github.io/AICodesign/

摘要 (Abstract)

In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule-based systems implement hard-coded spatial constraints that restrict participatory engagement, while data-driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM-based, multimodal, multi-agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real-time user interaction for iterative spatial refinement, while Retrieval-Augmented Generation (RAG) reduces data dependency without requiring task-specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user-centric communication and fosters more inclusive, effective, and resilient design processes. Project page: https://rsigktyper.github.io/AICodesign/

关键词: Large Language Models, Multi-agent Systems, Retrieval-Augmented Generation, Interior Design, 3D Design Generation, Natural Language Interaction, Spatial Reasoning, User-centric Design

7. VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

作者: Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15030v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）作为视觉代理（agent）使用外部工具链解决复杂视觉任务的能力评估，与"Large Language Models”（MLLMs是LLMs的扩展）、“LLM Agents”（评估模型作为代理的能力）、“Tool Use”（核心研究工具使用和组合）高度相关（10分）。与"Chain of Thought"相关（8分），因为论文涉及多步规划和执行轨迹评估。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等，论文未直接涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了VTC-Bench基准，用于评估多模态大语言模型作为视觉代理组合使用多种工具执行复杂任务的能力，实验发现当前模型在工具适应、组合和规划方面存在显著局限，最佳模型仅达到51%的准确率。

摘要翻译

近期研究进展将多模态大语言模型的应用范围从标准视觉问答扩展至利用外部工具处理高级视觉任务。尽管取得这一进步，如何精确执行并有效组合多样化工具以完成复杂任务，仍是持续存在的瓶颈。受限于稀疏的工具集和简单的工具使用轨迹，现有基准测试无法捕捉复杂多样的工具交互，难以评估模型在现实实际条件下的性能。为弥补这一差距，我们提出了VisualToolChain-Bench（VTC-Bench），这是一个旨在评估多模态大语言模型工具使用能力的综合性基准测试。为贴合实际计算机视觉流程，我们的框架集成了32种基于OpenCV的多样化视觉操作。这一丰富的工具集支持广泛的组合方式，使VTC-Bench能够严格评估多工具组合能力以及长视野、多步骤计划的执行能力。为实现精确评估，我们构建了680个精编问题，这些问题按九级认知层次组织，每个问题均配有真实执行轨迹。对19个领先多模态大语言模型的广泛实验揭示了当前模型在视觉代理能力方面的关键局限。具体而言，模型难以适应多样化工具集并将其泛化至未见过的操作，其中表现最佳的Gemini-3.0-Pro模型在我们的基准测试中仅达到51%的准确率。此外，多工具组合仍是持续存在的挑战。面对复杂任务时，模型难以制定高效的执行计划，严重依赖狭窄且次优的熟悉功能子集，而非选择最优工具。通过揭示这些根本性挑战，VTC-Bench建立了一个严谨的基线，以指导开发更具泛化能力的视觉代理模型。

摘要 (Abstract)

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models’ visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

关键词: Multimodal Large Language Models, Visual Agentic Models, Tool Use, Tool Composition, Benchmark Evaluation, OpenCV Operations, Multi-step Planning, Visual Task Solving

8. The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

作者: Elmira Salari, Maria Claudia Nunes Delfino, Hazem Amamou, José Victor de Souza, Shruti Kshirsagar, Alan Davoust, Anderson Avila 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14838v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究RAG框架下意识形态文本对LLM输出的影响，因此与"Retrieval-Augmented Generation"和"Large Language Models"高度相关（10分）。研究涉及意识形态对齐和事实性风险，与"Instruction Tuning/Alignment"和"Hallucination Mitigation/Factuality"有一定关联（5分）。案例研究使用COVID-19治疗数据，属于科学领域AI应用，与"AI for Science"相关（5分）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在检索增强生成（RAG）框架中，检索到的意识形态文本如何影响大型语言模型（LLM）的输出，发现LLM的回答会与外部知识中的意识形态更加对齐，并强调了识别意识形态话语以减轻偏见和恶意操纵风险的重要性。

摘要翻译

本文研究了检索到的意识形态文本对大型语言模型（LLM）输出的影响。尽管近期对理解LLM中意识形态的兴趣有所增加，但在检索增强生成（RAG）的背景下，此问题却鲜受关注。为填补这一空白，我们设计了一个基于关于COVID-19治疗的意识形态负载文本的外部知识源。我们的语料库基于1,117篇学术文章，代表了关于该疾病有争议和受认可治疗的论述。我们提出了一个基于词汇多维分析（Lexical Multidimensional Analysis, LMDA）的语料库语言学框架，以识别语料库内的意识形态。我们要求LLM回答源自三个已识别意识形态维度的问题，并采用两种类型的上下文提示：第一种包含用户问题和意识形态文本；第二种包含问题、意识形态文本及LMDA描述。通过计算词汇和语义表征的余弦相似度，评估参考意识形态文本与LLM回答之间的意识形态对齐程度。结果表明，基于意识形态检索文本的LLM回答更倾向于与外部知识中遇到的意识形态保持一致，而增强型提示进一步影响了LLM的输出。我们的发现强调了在RAG框架内识别意识形态论述的重要性，这不仅能减轻非预期的意识形态偏见，也能降低恶意操纵此类模型的风险。

摘要 (Abstract)

This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs’ responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs’ responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs’ outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.

关键词: Retrieval-Augmented Generation, Large Language Models, ideological bias, COVID-19 treatments, corpus linguistics, lexical multidimensional analysis, ideological alignment, malicious manipulation

9. CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

作者: Taeyun Roh, Wonjune Jang, Junha Jung, Jaewoo Kang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15421v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文CLAG专注于为小型语言模型（SLMs）代理设计一个基于聚类的记忆组织框架，以解决全局记忆池中知识稀释和无关上下文干扰的问题。因此，与"Small Language Models"高度相关（10分），因为SLMs是核心研究对象；与"LLM Agents"高度相关（10分），因为框架专为代理设计；与"Retrieval-Augmented Generation"有一定关联（8分），因为涉及记忆检索以支持知识重用；与"Large Language Models"有一定关联（5分），因为论文在背景中提及LLM代理作为对比，但研究焦点是SLMs。其他关键词如MoE、Scaling Laws、Alignment等与论文内容无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对小型语言模型代理在全局记忆池中面临的知识稀释和无关上下文干扰问题，提出了一个基于聚类的自适应记忆组织框架CLAG，通过语义聚类和局部演化有效提升了答案质量和鲁棒性。

摘要翻译

大型语言模型智能体高度依赖外部记忆系统以支持知识复用与复杂推理任务。然而，现有记忆系统大多将经验存储于单一的全局检索池中，这可能导致存储的知识逐渐被稀释或污染。该问题对于小型语言模型尤为突出，因其极易受到无关上下文干扰。本文提出CLAG，一种基于聚类的智能体记忆框架，使小型语言模型智能体能够通过主动聚类来组织记忆。CLAG采用由小型语言模型驱动的路由机制，将新增记忆分配至语义连贯的聚类中，并自主生成包含主题摘要与描述性标签的聚类专属档案，使每个聚类成为独立的功能单元。通过在这些结构化邻域内进行局部演化，CLAG有效降低了跨主题干扰并提升了内部记忆密度。在检索阶段，该框架采用两阶段流程：首先通过聚类档案筛选相关聚类以排除干扰项并缩减搜索空间，随后进行精确检索。在三个小型语言模型骨干网络及多个问答数据集上的实验表明，相较于现有智能体记忆系统，CLAG在保持轻量高效的同时，持续提升了答案质量与系统鲁棒性。

摘要 (Abstract)

Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.

关键词: Small Language Models, LLM Agents, Memory Organization, Clustering, Retrieval-Augmented Generation, Knowledge Reuse, Context Interference, QA Datasets

10. ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

作者: Yuzhe Shang, Pengzhi Gao, Yazheng Yang, Jiayao Ma, Wei Liu, Jian Luan, Jingsong Su 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14903v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	8.0/10	8.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在SimulMT中的应用，与"Large Language Models"高度相关（10分）。方法涉及KV cache优化以实现高效解码，与"KV Cache Compression"和"Inference Acceleration"相关（各8分）。论文提到fine-tuning策略，与"Post-training"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Alignment等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了将仅解码器LLM应用于同声传译时存在的解码效率与位置一致性困境，提出了ExPosST框架，通过显式位置分配和策略一致的微调，在多种语言对上实现了高效且位置一致的同声翻译。

摘要翻译

大语言模型（LLMs）近期在同步机器翻译（SimulMT）任务中展现出有潜力的性能。然而，将仅解码器架构的LLMs应用于SimulMT时，会引入位置不匹配问题，导致解码效率与位置一致性之间形成两难困境。现有方法通常依赖于特定的位置编码或精心设计的提示方案，因而难以同时实现推理效率、位置一致性以及广泛的模型兼容性。本研究提出ExPosST，一种通过显式位置分配来解决此困境的通用框架。ExPosST为输入源语言词元预留固定的位置槽，使得在不同位置编码方法下均可利用KV缓存实现高效解码。为进一步弥合微调与推理之间的差距，我们引入了一种策略一致的微调方法，使训练过程与推理时的解码行为保持一致。跨多个语言对的实验表明，ExPosST能有效支持多种策略下的同步翻译。

摘要 (Abstract)

Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.

关键词: Large Language Models, Simultaneous Machine Translation, Positional Mismatch, KV Cache, Inference Efficiency, Explicit Position Allocation, Policy-consistent Fine-tuning, Decoder-only LLMs

11. Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

作者: Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14864v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在电子商务中的应用，直接涉及LLM和智能体技术，因此"Large Language Models"和"LLM Agents"得10分。论文提到使用工具奖励进行强化学习训练，与"Tool Use"高度相关，得10分。其他关键词如MoE、量化、推理加速、幻觉缓解等均未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对电子商务中LLM智能体在长期对话中准确捕捉用户偏好的挑战，提出了一个统一的Shopping Companion框架，通过双奖励强化学习策略训练，在包含120万真实产品的基准测试中显著优于现有模型。

摘要翻译

在电子商务领域，大语言模型智能体在推荐、预算规划与捆绑交易等购物任务中展现出潜力，其中从长期对话中准确捕捉用户偏好至关重要。然而，实现这一潜力面临两大挑战：(1) 缺乏用于评估长期偏好感知购物任务的基准测试体系；(2) 由于现有设计将偏好识别与购物辅助视为独立模块，导致缺乏端到端优化。本文提出一个包含长期记忆机制的新型基准测试，涵盖超过120万真实商品的两类购物任务，并推出“购物伴侣”——一个支持用户干预、能协同处理记忆检索与购物辅助的统一框架。为训练此类能力，我们开发了双奖励强化学习策略，通过工具级奖励机制应对多轮交互中固有的稀疏与不连续奖励问题。实验结果表明，即使最先进的模型（如GPT-5）在我们的基准测试中成功率也低于70%，凸显了该领域的重大挑战。值得注意的是，基于“购物伴侣”框架训练的轻量化大语言模型持续超越现有强基线模型，实现了更优的偏好捕捉与任务执行效果，这验证了我们统一设计的有效性。

摘要 (Abstract)

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.

关键词: LLM agents, e-commerce, shopping tasks, long-term memory, preference capture, reinforcement learning, tool-wise rewards, benchmark

深度分析:

购物伴侣：用于现实世界电子商务任务的记忆增强型LLM智能体

摘要:

论文针对电子商务场景中LLM智能体难以从长期对话中准确捕捉用户偏好的问题，提出了一个新的基准和SHOPPING COMPANION框架。该基准包含120万真实产品，旨在评估长期记忆感知的购物任务。SHOPPING COMPANION是一个统一框架，联合优化长期记忆检索和购物辅助，并支持用户干预。为了解决多轮交互中奖励稀疏的问题，论文开发了带有工具级奖励的双奖励强化学习策略。实验结果表明，即使是GPT-5在该基准上的成功率也不足70%，而经过训练的轻量级模型在偏好捕获和任务成功率上均优于强基线，验证了统一设计的有效性。

创新点:

提出了一个新的电子商务基准，结合了长期记忆设置、真实世界购物任务和用户干预，填补了现有研究的空白。
设计了SHOPPING COMPANION统一框架，将记忆检索和购物辅助作为联合优化组件，而非分离模块，实现了端到端优化。
开发了双奖励强化学习策略，利用工具级奖励处理多轮工具增强交互中的稀疏和不连续反馈问题。

方法

!!! info

论文构建了一个包含120万真实产品的模拟环境，并开发了基于记忆和产品的两个搜索引擎及五个专用工具。采用部分可观测马尔可夫决策过程（POMDP）对任务进行建模，提出了两阶段智能体框架：阶段一进行偏好识别与记忆检索，阶段二执行购物任务。训练方面，使用双奖励强化学习策略，针对不同阶段和任务类型设计奖励函数，通过LLM-as-Judge范式进行评估。

关键结果:

现有的最先进模型（如GPT-5）在提出的新基准上成功率低于70%，表明该领域挑战巨大。
SHOPPING COMPANION训练的轻量级模型在偏好捕获和任务性能上持续优于强基线。
用户干预（从低提示到详细纠正）能可靠地提高最终结果。

技术栈: 算法：强化学习（RL），POMDP（部分可观测马尔可夫决策过程），BM25，余弦相似度。, 模型：LLM（GPT-5用于生成/评估，轻量级LLM用于训练），all-MiniLM-L6-v2（嵌入模型）。, 工具：Pyserini（检索框架），自定义API工具（mem_search, product_search等）。

优点

填补了现有基准在长期记忆与端到端购物任务结合方面的空白。
端到端优化策略解决了传统方法中记忆模块与任务执行分离的问题。
双奖励RL策略有效应对了多轮交互中的奖励稀疏挑战。
基于大规模真实产品数据（120万），具有很高的现实意义。

局限

基准中的用户指令和偏好是通过LLM合成的，可能与完全真实的人类行为存在细微差异。
评估依赖于LLM-as-Judge（GPT-5），尽管与人类有高一致性，但仍可能存在评估偏差。
论文主要关注特定类型的购物任务（如单品购买、附加交易），可能未覆盖所有电商场景。

与研究方向的相关性:

该论文属于大模型（LLM）在电子商务领域的应用研究，涉及智能体、长期记忆机制和强化学习优化。它展示了大模型技术原理（如记忆增强、RL优化）在特定垂直领域（电商）的创新应用。虽然不直接属于科学领域（如生物医药），但在大模型智能体架构和端到端优化技术上有显著创新，符合“大模型和深度学习技术原理的创新”这一关键词，具有较高的相关性和创新性。

12. Questionnaire Responses Do not Capture the Safety of AI Agents

作者: Max Hellrigel-Holderbaum, Edward James Young 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14417v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心批判了当前基于问卷式提示评估LLM安全性的方法，认为其不适用于评估实际部署的AI智能体（LLM Agents），并讨论了AI对齐（Alignment）方法的类似问题。因此，与"Large Language Models"、“LLM Agents"和"Alignment"高度相关（10分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文指出，基于问卷式提示评估大语言模型（LLMs）安全性的方法存在缺陷，无法有效评估实际部署的AI智能体（LLM agents）的风险，并认为当前AI对齐方法存在类似的结构性问题。

摘要翻译

随着人工智能系统能力不断提升，衡量其安全性与人类价值观的对齐程度变得至关重要。人工智能研究领域正迅速发展出专门致力于此类评估的方法。然而，当前大多数进展可能并不适用于评估现实世界部署中的人工智能系统。标准方法采用问卷式提示，让大型语言模型在假设场景中描述其价值观或行为。这些方法仅关注未经增强的大型语言模型，未能评估实际可能执行相关行为、从而带来更大风险的人工智能体。大型语言模型对问卷式提示所描述场景的参与方式，与基于同款大型语言模型构建的智能体存在显著差异，这体现在输入内容、可行行动、环境交互及内部处理机制等多个层面的分歧上。因此，大型语言模型对场景描述的反应很可能无法代表相应智能体的实际行为。我们进一步指出，此类评估对大型语言模型准确报告其反事实行为的能力和倾向性做出了强假设，导致其缺乏结构效度，不足以评估现实环境中人工智能系统的风险。我们认为，当前的人工智能对齐方法也存在结构上相同的问题。最后，我们探讨了如何通过正视这些缺陷来改进安全评估与对齐训练。

摘要 (Abstract)

As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs’ engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs’ responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents’ behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

关键词: AI safety, AI alignment, large language models, LLM agents, safety assessment, questionnaire-style prompts, construct validity, real-world deployment

深度分析:

问卷回答无法捕捉 AI 智能体的安全性

摘要:

本文探讨了当前AI安全评估方法的有效性，特别是针对大型语言模型（LLMs）的问卷式评估。作者指出，现有的评估方法主要关注未增强的LLMs在假设场景中的文本回答，但这无法有效评估基于LLM的AI智能体在真实部署中的安全性。文章分析了LLMs与AI智能体在输入、动作、环境交互和内部处理上的显著差异，指出问卷回答缺乏构念效度，无法准确预测智能体的实际行为倾向。此外，作者认为这一问题同样存在于当前的AI对齐方法中，并提出了改进安全评估和对齐训练的建议，强调需要直接评估智能体的行为而非仅依赖模型对假设情境的描述。

创新点:

明确区分了LLMs与AI智能体在安全性评估中的本质差异，指出了问卷式评估在评估智能体时的根本性缺陷。
提出了“脚手架泛化”和“情境泛化”两个关键假设，并论证了当前评估方法无法满足这些假设。
批评了当前AI对齐方法（如RLHF）过度依赖模型自我报告或对假设情境的判断，缺乏对真实行为的约束。
呼吁从评估静态模型输出转向评估动态智能体行为，强调关注实际交互结果而非文本描述。

方法

!!! info

本文主要采用理论分析和概念论证的方法。作者通过对比分析问卷式评估与AI智能体在真实世界部署中的特征，解构了QAs的隐含假设。文章结合心理学中的构念效度概念，论证了当前评估方法在逻辑上的不足，并引用了现有的基准测试（如MACHIAVELLI, TRUSTLLM）作为案例进行具体分析。

关键结果:

问卷式评估无法有效捕捉AI智能体的安全性，因为LLMs对文本场景的反应与智能体的实际行为存在巨大鸿沟。
QAs依赖于不可靠的假设，即模型能准确报告其在反事实情况下的行为，这缺乏构念效度。
LLMs与AI智能体在输入模态、动作空间、环境交互及内部处理流程上的差异，使得基于文本的评估无法泛化到真实部署环境。
当前的AI对齐技术面临同样的局限性，需要开发新的评估范式来直接测试智能体的行为倾向。

技术栈: 大型语言模型, AI智能体, 脚手架, 问卷式评估, 构念效度, 基于人类反馈的强化学习 (RLHF)

优点

观点犀利，指出了当前AI安全评估领域的一个盲点：过度关注模型文本回答而忽视了智能体的实际行为风险。
逻辑清晰，通过定义核心假设（泛化能力）并逐一击破，论证过程严谨。
具有广泛的适用性，批评不仅适用于安全评估，还延伸到了AI对齐训练领域。
结合了心理学中的评估理论（构念效度），为AI安全评估提供了跨学科的理论视角。

局限

本文主要停留在理论批评层面，未提出具体的、可落地的替代性评估方案或新的基准测试。
虽然指出了问题，但对于如何构建能够评估真实智能体行为的系统，仅给出了方向性建议，缺乏技术细节。
论证主要基于逻辑推演，缺乏大规模的实证数据来直接对比问卷结果与智能体实际行为的偏差程度。

与研究方向的相关性:

本文与“大模型和深度学习技术原理的创新”高度相关。虽然它不提出新的算法架构，但它对大模型（LLM）及其衍生形态（AI智能体）的安全评估范式进行了深刻的反思。它触及了深度学习模型从“认知”到“行动”转变过程中的关键技术挑战，即如何确保具备工具使用能力和自主性的智能体的安全性。这对于理解大模型技术的局限性和未来发展方向具有重要意义，属于对现有技术范式的批判性创新。

13. MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-E

作者: Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng, Jing Li, Harry Qin, Vivian Hui 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14265v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究大语言模型（LLMs）在医疗领域的应用，特别是通过检索增强生成（RAG）技术连接外部临床数据库时面临的隐私-效用权衡问题。因此，与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分），属于"AI for Science"在生物医学信息学（Bioinformatics）范畴的应用（10分）。论文未涉及其他关键词的具体技术原理或创新，如MoE、量化、推理加速、对齐训练等，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个专门评估医疗开放域问答中大型语言模型隐私保护与临床效用权衡的基准MedPriv-Bench，通过多智能体人工循环流程生成敏感医疗上下文和查询，并利用RoBERTa-NLI模型自动化评估数据泄露，在对9个代表性LLM的广泛评估中揭示了普遍存在的隐私-效用权衡问题。

摘要翻译

检索增强生成（RAG）技术的最新进展使得大语言模型（LLM）能够基于临床证据生成输出。然而，将LLM与外部数据库连接会引入上下文泄露风险：这是一种微妙的隐私威胁，即使没有明确的标识符，独特的医疗细节组合也可能导致患者被重新识别。尽管存在《健康保险携带和责任法案》（HIPAA）和《通用数据保护条例》（GDPR）等严格法规，当前医疗领域的基准测试仍过度侧重于准确性，而忽视了此类隐私问题。为填补这一空白，我们提出了MedPriv-Bench，这是首个专门设计用于联合评估医疗开放式问答中隐私保护与临床效用的基准测试。我们的框架采用多智能体、人在回路（human-in-the-loop）的流程，合成敏感的医疗上下文和临床相关查询，以创建真实的隐私压力。我们建立了一个标准化评估协议，利用预训练的RoBERTa-自然语言推理（NLI）模型作为自动化评判器来量化数据泄露，其与人类专家的平均一致性达到85.9%。通过对9个代表性LLM的广泛评估，我们揭示了普遍存在的隐私-效用权衡问题。我们的研究结果强调了在隐私敏感环境中，需要特定领域的基准测试来验证医疗人工智能系统的安全性与有效性。

摘要 (Abstract)

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

关键词: Large Language Models, Retrieval-Augmented Generation, Medical AI, Privacy-Utility Trade-off, Benchmark, Clinical Evidence, Data Leakage, Healthcare

14. Effective Distillation to Hybrid xLSTM Architectures

作者: Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15590v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	10.0/10	10.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的知识蒸馏，将基于二次注意力的LLMs蒸馏到基于xLSTM的线性化架构中，因此与"Large Language Models"高度相关（10分）。论文提到蒸馏了指令调优模型，与"Instruction Tuning"有一定关联（5分）。论文提出合并线性化专家的额外阶段，与"Model Merging"高度相关（10分）。论文提到专家合并，与"Mixture of Experts"有一定关联（5分）。其他关键词如小模型、预训练、对齐、推理加速等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过有效的蒸馏管道将基于二次注意力的大语言模型（LLMs）无损蒸馏到基于xLSTM的线性化架构中，并引入专家合并阶段，使蒸馏后的学生模型在多项下游任务上恢复甚至超越教师模型的性能。

摘要翻译

已有大量研究尝试将基于二次注意力机制的大语言模型（LLM）蒸馏至次二次线性化架构中。然而，尽管研究广泛，此类蒸馏模型在多种下游任务上仍常常无法达到其教师大语言模型的性能水平。我们设定了无损蒸馏的目标，并将其定义为学生模型与教师模型在任务集上经容差校正的“胜平率”。为此，我们为基于xLSTM的学生模型引入了一套高效的蒸馏流程。我们提出了一个额外的合并阶段，将独立线性化的专家模型整合为单一模型。通过从Llama、Qwen和Olmo系列中蒸馏基础模型及指令微调模型，我们验证了该流程的有效性。在许多设定下，我们基于xLSTM的学生模型恢复了教师模型的大部分性能，甚至在某些下游任务上实现了超越。我们的贡献是朝着为基于Transformer的大语言模型提供更节能、更具成本效益的替代方案迈出的重要一步。

摘要 (Abstract)

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher’s performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

关键词: knowledge distillation, large language models, xLSTM, linearized architectures, model merging, instruction tuning, energy-efficient, downstream tasks

深度分析:

混合xLSTM架构的有效蒸馏

摘要:

针对将二次复杂度注意力机制的大语言模型（LLM）蒸馏到线性架构时性能下降的问题，本文提出了“无损蒸馏”的目标，并定义了基于容错修正的Win-and-Tie率作为评估标准。作者介绍了一种基于xLSTM的高效蒸馏管道，通过将mLSTM与滑动窗口注意力（SWA）及Sink令牌结合，构建了混合架构。该方法包含权重迁移、隐藏状态匹配、投影合并及知识蒸馏四个步骤，并创新性地引入了“合并阶段”，将独立蒸馏的领域专家模型在权重空间进行融合。实验表明，该方法在Llama、Qwen和Olmo等模型上蒸馏出的xLSTM学生模型，在数学、代码等生成任务上不仅恢复了教师模型的大部分性能，甚至在部分任务上超越了教师，为Transformer的高效替代方案提供了重要进展。

创新点:

提出了“无损蒸馏”的概念，并使用基于容错修正的Win-and-Tie率（Cα）来严格评估学生模型是否可作为教师模型的替代品。
设计了一种新的蒸馏管道，包含一个独特的“合并阶段”，允许将独立蒸馏的领域特定线性化学生模型通过权重空间合并整合为一个模型。
开发了混合xLSTM架构，将mLSTM（用于全局线性上下文）与滑动窗口注意力（SWA）及Sink令牌（用于局部上下文和稳定性）相结合。
针对蒸馏场景对mLSTM进行了特定适配，如移除输出投影前的归一化层以改善学生-教师对齐，并使用每头标量输出门。

方法

!!! info

论文提出的方法主要包括四个步骤：(1) 将预训练的Transformer教师模型的权重转移到学生模型，并引入适配器和门控；(2) 进行隐藏状态匹配以对齐中间层表示；(3) 合并查询和键投影以优化架构；(4) 进行知识蒸馏。核心架构上，用混合的mLSTM-SWA模块替换了标准的自注意力层。此外，研究采用了模块化线性化策略，即先独立蒸馏不同领域的专家模型，然后通过权重空间合并技术将其整合，以提升整体性能。

关键结果:

基于xLSTM的学生模型在数学、代码、STEM和聊天等生成基准测试中，恢复了大部分教师模型的性能。
在许多设置下，xLSTM学生模型在下游任务上的表现甚至超过了原始的Transformer教师模型。
与现有的线性化基线（如QRWKV7、Mamba-in-Llama）相比，该方法在不同容错水平下的Win-and-Tie率均占优。
证明了线性化模型可以作为Transformer基LLM的有效、节能且低成本的替代品。

技术栈: xLSTM (Extended LSTM), mLSTM (Matrix LSTM), Sliding Window Attention (SWA), Knowledge Distillation, Weight-space Merging, Llama, Qwen, Olmo model families, Linear Attention mechanisms

优点

有效解决了现有线性化模型在数学推理和代码合成等困难生成任务上性能不足的问题。
引入的“合并阶段”使得蒸馏过程模块化，提高了灵活性和最终模型的泛化能力。
提出了更严格的评估指标（Win-and-Tie率），更准确地衡量了模型作为“即插即用”替代品的可靠性。
结合了mLSTM的长程记忆能力和SWA的局部建模优势，架构设计合理。

局限

虽然性能接近，但在某些极端情况下可能仍无法完全达到教师模型的能力。
引入的合并阶段和混合架构可能会增加训练和部署的工程复杂度。
论文内容在提供的文本中未完全结束，关于注意力机制的详细讨论和部分实验细节可能缺失。
xLSTM及其混合架构对特定硬件加速器的依赖和优化程度可能不如成熟的Transformer。

与研究方向的相关性:

该论文与研究关键词高度相关。它属于“大模型和深度学习技术原理的创新”范畴，专注于解决Transformer架构的计算复杂度瓶颈，提出了创新的xLSTM混合架构和蒸馏方法。论文涉及大模型的核心技术（注意力机制替代、模型蒸馏、线性RNN），具有极强的技术创新性，符合用户对新技术原理创新的关注点。

15. AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulat

作者: Yusuke Takagi, Motonari Kambara, Daichi Yashima, Koki Seno, Kento Tokura, Komei Sugiura 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15046v1

评分: 29.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种轻量级视觉-语言-动作模型（AnoleVLA），用于资源受限环境下的机器人操作。核心创新是使用深度状态空间模型替代标准Transformer骨干网络，以实现高效的多模态序列处理。与关键词的相关性分析如下：1）与"Large Language Models"有一定关联（5分），因为VLA模型通常基于大语言模型架构，但论文未明确讨论LLMs；2）与"Small Language Models"高度相关（8分），因为论文明确目标是开发轻量级模型用于移动设备；3）与"Quantization"和"Inference Acceleration"高度相关（各8分），因为论文的核心贡献是通过模型架构创新（深度状态空间模型）实现模型轻量化和推理加速（3倍速度提升）；4）其他关键词（如MoE、Scaling Laws、Alignment等）在论文中未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了资源受限环境下语言引导机器人操作的计算效率问题，提出了一种基于深度状态空间模型的轻量级视觉-语言-动作模型（AnoleVLA），在真实世界实验中比大型VLA模型任务成功率提高21分且推理速度快3倍。

摘要翻译

本研究针对语言引导的机器人操作问题展开探讨，该任务要求机器人基于视觉观察与自然语言指令对多种物体进行操作。对于在人类环境中运行的服务机器人而言，此项任务至关重要，需同时满足安全性、高效性及任务层级的泛化能力。尽管视觉-语言-动作模型（Vision-Language-Action models, VLAs）在此类任务中已展现出强大性能，但由于标准Transformer主干网络的计算成本较高，其在资源受限环境中的部署仍面临挑战。为突破此限制，我们提出AnoleVLA——一种轻量级VLA模型，其采用深度状态空间模型以实现多模态序列的高效处理。该模型凭借其轻量化架构与快速序列状态建模能力，能够高效处理视觉与文本输入，从而使机器人能够生成流畅的运动轨迹。我们在仿真环境与物理实验中均对所提方法进行了评估。值得注意的是，在真实世界评估中，AnoleVLA的任务成功率较代表性大规模VLA模型高出21个百分点，同时推理速度提升约三倍。

摘要 (Abstract)

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.

关键词: Vision-Language-Action model, lightweight model, deep state space model, mobile manipulation, robotic manipulation, inference acceleration, resource-constrained environments, multimodal sequence processing

📋 所有论文列表

1. ✅ Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

作者: Auksarapak Kietkajornrit, Jad Tarifi, Nima Asgharbeygi 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14458v1

评分: 69.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对检索增强LLMs在依赖最新或冲突信息的事实问答中不可靠的问题，提出了一个将规划与事实检索/答案合成显式分离的模块化框架，通过监督规划训练提高了准确性和延迟性能。

摘要翻译

当答案依赖于最新或相互冲突的信息时，基于大语言模型（LLMs）的事实探寻问答仍不可靠。尽管检索增强型和工具调用型LLMs减少了幻觉问题，但它们通常依赖于隐式规划，导致工具使用效率低下。我们提出一种模块化框架，将规划与事实检索及答案合成明确分离。通过师生框架训练一个轻量级学生规划器，以生成由抽象推理步骤和可搜索事实请求组成的结构化分解。监督信号仅包含规划轨迹和事实请求，不提供事实答案或检索证据。在推理阶段，规划器生成计划，而经过提示工程设计的模块则执行检索和响应合成。我们在SEAL-0（一个针对搜索增强型LLMs的极具挑战性的基准测试）上评估所提出的框架。结果表明，与单一推理模型和基于提示的工具增强框架相比，监督式规划在准确性和延迟方面均有提升，这证明显式学习的规划结构对于构建可靠的事实探寻型LLMs至关重要。

摘要 (Abstract)

Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.

关键词: Large Language Models, Retrieval-Augmented Generation, Tool Use, Hallucination Mitigation, Planning, Fact-seeking QA, Teacher-Student Framework, Modular Framework

2. ✅ An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

作者: Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14463v1

评分: 60.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在保险领域的专业化应用，因此与"Large Language Models"高度相关（10分）。方法上，论文提出了一个包含SFT和RLAIF的端到端对齐范式，因此与"Post-training/SFT”、“Instruction Tuning/Alignment”、“RLHF/RLAIF/DPO"高度相关（均为10分）。论文明确以解决幻觉问题为核心目标之一，因此与"Hallucination Mitigation"高度相关（10分）。论文提到现有方法依赖RAG，但其方法旨在超越RAG，因此与"RAG"有一定关联（5分）。论文涉及将LLM适配到垂直领域，与"Domain Adaptation"有一定关联（5分）。论文未涉及其他关键词的具体技术或概念，因此其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了将大语言模型（LLMs）专业化应用于高风险保险领域时，如何在实现领域精通和极低幻觉率的同时，不牺牲模型通用能力的挑战，并提出了一种结合验证数据合成与渐进式SFT-RL课程框架的新方法，成功训练出在领域任务上达到SOTA性能且保持顶级通用能力的保险专用模型INS-S1。

摘要翻译

将大型语言模型（LLM）适配于保险等高风险垂直领域面临着一项重大挑战：应用场景要求严格遵循复杂的监管规定与业务逻辑，且对幻觉零容忍。现有方法通常存在“能力权衡”问题——为获取领域专业知识而牺牲通用智能，或过度依赖检索增强生成（RAG）而缺乏内在推理能力。为弥补这一差距，我们提出了INS-S1，这是一个通过新颖的端到端对齐范式训练而成的保险专用大语言模型系列。我们的方法包含两项方法论创新：（1）一个可验证的数据合成系统，用于构建支持精算推理与合规性的分层数据集；（2）一个渐进式监督微调-强化学习（SFT-RL）课程框架，该框架将动态数据退火与经过验证的推理（RLVR）和人工智能反馈（RLAIF）的协同组合相结合。通过优化数据比例与奖励信号，该框架在强化领域约束的同时防止了灾难性遗忘。此外，我们发布了迄今为止最全面的保险领域基准测试INSEva（包含超过3.9万个样本）。大量实验表明，INS-S1在领域任务上实现了最先进的性能，显著优于DeepSeek-R1和Gemini-2.5-Pro。至关重要的是，它保持了顶级的通用能力，并实现了创纪录的低幻觉率（0.6%，基于HHEM评估）。我们的结果表明，严格的领域专业化可以在不损害通用智能的前提下实现。

摘要 (Abstract)

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

关键词: Large Language Models, Insurance Domain, Hallucination Mitigation, Supervised Fine-tuning, RLAIF, Domain Adaptation, Verifiable Data Synthesis, Progressive Curriculum

3. ✅ CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

作者: Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, Kun Zhang 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14575v1

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对现有基于进化的AI科学家代理在进化效率低下和振荡行为的问题，提出了CausalEvolve方法，通过因果推理和反思机制来指导进化过程，在多个开放式科学任务中有效提高了进化效率并发现了更好的解决方案。

摘要翻译

以AlphaEvolve为代表的进化型智能体是利用大型语言模型构建AI科学家的显著成功案例之一。这类智能体通过迭代改进与演化程序，并借助大型语言模型的先验知识与推理能力，以解决开放式的科学问题。尽管取得了成功，现有的进化型智能体仍缺乏针对演化过程的定向引导机制，以及有效组织与利用历史进化经验中获取知识的系统方法。因此，它们在接近已知性能边界时会出现进化效率递减与振荡现象。为弥补这一不足，我们开发了CausalEvolve，其配备了一个因果推理工作台，能够利用大型语言模型识别并推演进化过程中的关键引导因素。在初始阶段，CausalEvolve首先识别结果层面的影响因素，这些因素能为优化目标提供互补性启发。在进化过程中，该系统还通过监测演化中的异常模式并结合溯因推理来假设新的影响因素，从而开辟新的进化方向。通过全面的实验验证，我们证明CausalEvolve在四项具有挑战性的开放式科学任务中，能有效提升进化效率并发现更优解决方案。

摘要 (Abstract)

Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.

关键词: CausalEvolve, Large Language Models, AI Scientists, open-ended scientific problems, evolutionary efficiency, causal reasoning, self-improvement, scientific discovery

4. ✅ Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

作者: Guangfu Hao, Yuming Dai, Xianzhe Qin, Shan Yu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15371v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在复杂多步推理任务中存在的准确性崩溃问题，提出了一种受大脑启发的图多智能体系统架构，通过动态构建的智能体拓扑和集中式共享工作空间协调机制，显著提升了多种前沿LLM在复杂推理任务上的性能表现。

摘要翻译

大语言模型（LLM）已在广泛的语言任务中展现出卓越能力，但复杂的多步推理仍是根本性挑战。尽管配备扩展思维链机制的大型推理模型（LRM）相比标准LLM表现出性能提升，但这两类模型在足够复杂的任务上仍存在准确性崩溃现象，这表明仅靠模型级推理的扩展并不足够。受人类认知的全局工作空间理论启发，我们提出脑启发的图多智能体系统（BIGMAS）。该系统将专用LLM智能体组织为动态构建的有向图节点，并仅通过中心化共享工作空间进行协同。问题自适应的图设计器（GraphDesigner）构建任务特定的智能体拓扑结构，而全局编排器（Orchestrator）则利用完整的共享状态进行路由决策，从而克服反应式方法的局部视野瓶颈。在Game24、Six Fives和伦敦塔任务上对六种前沿LLM的实验表明，BIGMAS能持续提升标准LLM与LRM的推理性能，其表现优于包括ReAct和思维树（Tree of Thoughts）在内的现有多智能体基线方法，这证明多智能体架构设计能够提供与模型级推理增强正交的互补性增益。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain-of-thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model-level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain-Inspired Graph Multi-Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem-adaptive GraphDesigner constructs task-specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local-view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi-agent baselines including ReAct and Tree of Thoughts, showing that multi-agent architectural design provides complementary gains orthogonal to model-level reasoning enhancements.

关键词: Large Language Models, Multi-step Reasoning, Multi-agent Systems, Chain-of-Thought, Graph Topology, Agent Coordination, Reasoning Performance, Brain-Inspired Architecture

5. ✅ SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

作者: Yu Pan, Wenlong Yu, Tiejun Wu, Xiaohu Ye, Qiannan Si, Guangquan Xu, Bin Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15397v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在链式思维推理中易受越狱攻击的安全漏洞，提出了SFCoT框架，通过实时评估和校准中间推理步骤，将攻击成功率从58.97%降低至12.31%，有效增强了模型安全性且未显著影响一般性能。

摘要翻译

大语言模型（LLMs）在复杂推理任务中展现出卓越能力，但其安全对齐机制仍极易受到越狱攻击的破坏。现有防御方法通常仅对最终输出进行事后过滤，导致中间推理步骤缺乏监控，易受对抗性操纵。为弥补这一缺陷，本文提出一种更安全的思维链框架（SaFer Chain-of-Thought, SFCoT），该框架能实时评估并校准潜在的不安全推理步骤。SFCoT融合了三级安全评分系统与多视角一致性验证机制，旨在全程检测推理过程中的潜在风险。动态干预模块随后执行针对性校准，将推理路径引导至安全结果。实验表明，SFCoT将攻击成功率从$58.97%$降至$12.31%$，在未显著影响通用性能的前提下，证明其作为一种高效的大语言模型安全增强方法的有效性。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering applied only to the final output, leaving intermediate reasoning steps unmonitored and vulnerable to adversarial manipulation. To address this gap, this paper proposes a SaFer Chain-of-Thought (SFCoT) framework, which proactively evaluates and calibrates potentially unsafe reasoning steps in real time. SFCoT incorporates a three-tier safety scoring system alongside a multi-perspective consistency verification mechanism, designed to detect potential risks throughout the reasoning process. A dynamic intervention module subsequently performs targeted calibration to redirect reasoning trajectories toward safe outcomes. Experimental results demonstrate that SFCoT reduces the attack success rate from $58.97%$ to $12.31%$, demonstrating it as an effective and efficient LLM safety enhancement method without a significant decline in general performance.

关键词: Large Language Models, Chain-of-Thought, Safety Alignment, Jailbreak Attacks, Reasoning Security, Real-time Calibration, Adversarial Manipulation, Attack Success Rate Reduction

作者: Ren Jian Lim, Rushi Dai 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15341v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究提出了一个基于大型语言模型的多模态多智能体框架，通过自然语言交互和检索增强生成技术，将用户描述动态转化为优化的3D室内设计方案，有效改善了设计沟通和参与度。

摘要翻译

在建筑室内设计领域，由于客户缺乏设计知识，而设计师难以解释复杂的空间关系，沟通不畅问题频发，常导致项目延期和经济损失。近年来，生成式布局工具的进展通过自动化生成三维可视化方案，缩小了这一鸿沟。然而，现有方法存在局限：基于规则的体系采用硬编码的空间约束，限制了参与式互动；而数据驱动模型则依赖于大量训练数据集。近期兴起的大语言模型（Large Language Models, LLMs）通过自然语言实现对空间关系的直观推理，弥补了这一不足。本研究提出一个基于大语言模型的多模态、多智能体框架，能够动态地将自然语言描述和图像转化为三维设计方案。通过提示词指南运作的专项智能体（参考智能体、空间智能体、交互智能体、评分智能体）协同应对核心挑战：该智能体系统支持实时用户交互以实现迭代式空间优化，同时检索增强生成（Retrieval-Augmented Generation, RAG）技术降低了数据依赖性，无需针对特定任务进行模型训练。本框架能准确解读空间意图并生成优化的三维室内设计，从而提升生产效率，促进非专业设计者的参与。通过对多样化的平面布局和用户问卷进行评估，验证了该框架的有效性。一项独立的大语言模型评估显示，参与式生成的布局在用户意图契合度、美学一致性、功能性和动线流畅性方面均获得更高评分。问卷结果表明，用户满意度达77%，且相较于传统设计软件表现出明显偏好。这些发现表明，该框架增强了以用户为中心的沟通，并促进了更具包容性、高效性和适应性的设计流程。项目页面：https://rsigktyper.github.io/AICodesign/

摘要 (Abstract)

In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule-based systems implement hard-coded spatial constraints that restrict participatory engagement, while data-driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM-based, multimodal, multi-agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real-time user interaction for iterative spatial refinement, while Retrieval-Augmented Generation (RAG) reduces data dependency without requiring task-specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user-centric communication and fosters more inclusive, effective, and resilient design processes. Project page: https://rsigktyper.github.io/AICodesign/

关键词: Large Language Models, Multi-agent Systems, Retrieval-Augmented Generation, Interior Design, 3D Design Generation, Natural Language Interaction, Spatial Reasoning, User-centric Design

7. ✅ VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了VTC-Bench基准，用于评估多模态大语言模型作为视觉代理组合使用多种工具执行复杂任务的能力，实验发现当前模型在工具适应、组合和规划方面存在显著局限，最佳模型仅达到51%的准确率。

摘要翻译

近期研究进展将多模态大语言模型的应用范围从标准视觉问答扩展至利用外部工具处理高级视觉任务。尽管取得这一进步，如何精确执行并有效组合多样化工具以完成复杂任务，仍是持续存在的瓶颈。受限于稀疏的工具集和简单的工具使用轨迹，现有基准测试无法捕捉复杂多样的工具交互，难以评估模型在现实实际条件下的性能。为弥补这一差距，我们提出了VisualToolChain-Bench（VTC-Bench），这是一个旨在评估多模态大语言模型工具使用能力的综合性基准测试。为贴合实际计算机视觉流程，我们的框架集成了32种基于OpenCV的多样化视觉操作。这一丰富的工具集支持广泛的组合方式，使VTC-Bench能够严格评估多工具组合能力以及长视野、多步骤计划的执行能力。为实现精确评估，我们构建了680个精编问题，这些问题按九级认知层次组织，每个问题均配有真实执行轨迹。对19个领先多模态大语言模型的广泛实验揭示了当前模型在视觉代理能力方面的关键局限。具体而言，模型难以适应多样化工具集并将其泛化至未见过的操作，其中表现最佳的Gemini-3.0-Pro模型在我们的基准测试中仅达到51%的准确率。此外，多工具组合仍是持续存在的挑战。面对复杂任务时，模型难以制定高效的执行计划，严重依赖狭窄且次优的熟悉功能子集，而非选择最优工具。通过揭示这些根本性挑战，VTC-Bench建立了一个严谨的基线，以指导开发更具泛化能力的视觉代理模型。

摘要 (Abstract)

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models’ visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

关键词: Multimodal Large Language Models, Visual Agentic Models, Tool Use, Tool Composition, Benchmark Evaluation, OpenCV Operations, Multi-step Planning, Visual Task Solving

8. ✅ The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文研究了在检索增强生成（RAG）框架中，检索到的意识形态文本如何影响大型语言模型（LLM）的输出，发现LLM的回答会与外部知识中的意识形态更加对齐，并强调了识别意识形态话语以减轻偏见和恶意操纵风险的重要性。

摘要翻译

本文研究了检索到的意识形态文本对大型语言模型（LLM）输出的影响。尽管近期对理解LLM中意识形态的兴趣有所增加，但在检索增强生成（RAG）的背景下，此问题却鲜受关注。为填补这一空白，我们设计了一个基于关于COVID-19治疗的意识形态负载文本的外部知识源。我们的语料库基于1,117篇学术文章，代表了关于该疾病有争议和受认可治疗的论述。我们提出了一个基于词汇多维分析（Lexical Multidimensional Analysis, LMDA）的语料库语言学框架，以识别语料库内的意识形态。我们要求LLM回答源自三个已识别意识形态维度的问题，并采用两种类型的上下文提示：第一种包含用户问题和意识形态文本；第二种包含问题、意识形态文本及LMDA描述。通过计算词汇和语义表征的余弦相似度，评估参考意识形态文本与LLM回答之间的意识形态对齐程度。结果表明，基于意识形态检索文本的LLM回答更倾向于与外部知识中遇到的意识形态保持一致，而增强型提示进一步影响了LLM的输出。我们的发现强调了在RAG框架内识别意识形态论述的重要性，这不仅能减轻非预期的意识形态偏见，也能降低恶意操纵此类模型的风险。

摘要 (Abstract)

This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs’ responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs’ responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs’ outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.

9. ✅ CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

作者: Taeyun Roh, Wonjune Jang, Junha Jung, Jaewoo Kang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15421v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对小型语言模型代理在全局记忆池中面临的知识稀释和无关上下文干扰问题，提出了一个基于聚类的自适应记忆组织框架CLAG，通过语义聚类和局部演化有效提升了答案质量和鲁棒性。

摘要翻译

大型语言模型智能体高度依赖外部记忆系统以支持知识复用与复杂推理任务。然而，现有记忆系统大多将经验存储于单一的全局检索池中，这可能导致存储的知识逐渐被稀释或污染。该问题对于小型语言模型尤为突出，因其极易受到无关上下文干扰。本文提出CLAG，一种基于聚类的智能体记忆框架，使小型语言模型智能体能够通过主动聚类来组织记忆。CLAG采用由小型语言模型驱动的路由机制，将新增记忆分配至语义连贯的聚类中，并自主生成包含主题摘要与描述性标签的聚类专属档案，使每个聚类成为独立的功能单元。通过在这些结构化邻域内进行局部演化，CLAG有效降低了跨主题干扰并提升了内部记忆密度。在检索阶段，该框架采用两阶段流程：首先通过聚类档案筛选相关聚类以排除干扰项并缩减搜索空间，随后进行精确检索。在三个小型语言模型骨干网络及多个问答数据集上的实验表明，相较于现有智能体记忆系统，CLAG在保持轻量高效的同时，持续提升了答案质量与系统鲁棒性。

摘要 (Abstract)

Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.

关键词: Small Language Models, LLM Agents, Memory Organization, Clustering, Retrieval-Augmented Generation, Knowledge Reuse, Context Interference, QA Datasets

10. ✅ ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

作者: Yuzhe Shang, Pengzhi Gao, Yazheng Yang, Jiayao Ma, Wei Liu, Jian Luan, Jingsong Su 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14903v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	8.0/10	8.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了将仅解码器LLM应用于同声传译时存在的解码效率与位置一致性困境，提出了ExPosST框架，通过显式位置分配和策略一致的微调，在多种语言对上实现了高效且位置一致的同声翻译。

摘要翻译

大语言模型（LLMs）近期在同步机器翻译（SimulMT）任务中展现出有潜力的性能。然而，将仅解码器架构的LLMs应用于SimulMT时，会引入位置不匹配问题，导致解码效率与位置一致性之间形成两难困境。现有方法通常依赖于特定的位置编码或精心设计的提示方案，因而难以同时实现推理效率、位置一致性以及广泛的模型兼容性。本研究提出ExPosST，一种通过显式位置分配来解决此困境的通用框架。ExPosST为输入源语言词元预留固定的位置槽，使得在不同位置编码方法下均可利用KV缓存实现高效解码。为进一步弥合微调与推理之间的差距，我们引入了一种策略一致的微调方法，使训练过程与推理时的解码行为保持一致。跨多个语言对的实验表明，ExPosST能有效支持多种策略下的同步翻译。

摘要 (Abstract)

Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.

11. ✅ Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

作者: Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14864v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对电子商务中LLM智能体在长期对话中准确捕捉用户偏好的挑战，提出了一个统一的Shopping Companion框架，通过双奖励强化学习策略训练，在包含120万真实产品的基准测试中显著优于现有模型。

摘要翻译

在电子商务领域，大语言模型智能体在推荐、预算规划与捆绑交易等购物任务中展现出潜力，其中从长期对话中准确捕捉用户偏好至关重要。然而，实现这一潜力面临两大挑战：(1) 缺乏用于评估长期偏好感知购物任务的基准测试体系；(2) 由于现有设计将偏好识别与购物辅助视为独立模块，导致缺乏端到端优化。本文提出一个包含长期记忆机制的新型基准测试，涵盖超过120万真实商品的两类购物任务，并推出“购物伴侣”——一个支持用户干预、能协同处理记忆检索与购物辅助的统一框架。为训练此类能力，我们开发了双奖励强化学习策略，通过工具级奖励机制应对多轮交互中固有的稀疏与不连续奖励问题。实验结果表明，即使最先进的模型（如GPT-5）在我们的基准测试中成功率也低于70%，凸显了该领域的重大挑战。值得注意的是，基于“购物伴侣”框架训练的轻量化大语言模型持续超越现有强基线模型，实现了更优的偏好捕捉与任务执行效果，这验证了我们统一设计的有效性。

摘要 (Abstract)

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.

关键词: LLM agents, e-commerce, shopping tasks, long-term memory, preference capture, reinforcement learning, tool-wise rewards, benchmark

12. ✅ Questionnaire Responses Do not Capture the Safety of AI Agents

作者: Max Hellrigel-Holderbaum, Edward James Young 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14417v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文指出，基于问卷式提示评估大语言模型（LLMs）安全性的方法存在缺陷，无法有效评估实际部署的AI智能体（LLM agents）的风险，并认为当前AI对齐方法存在类似的结构性问题。

摘要翻译

随着人工智能系统能力不断提升，衡量其安全性与人类价值观的对齐程度变得至关重要。人工智能研究领域正迅速发展出专门致力于此类评估的方法。然而，当前大多数进展可能并不适用于评估现实世界部署中的人工智能系统。标准方法采用问卷式提示，让大型语言模型在假设场景中描述其价值观或行为。这些方法仅关注未经增强的大型语言模型，未能评估实际可能执行相关行为、从而带来更大风险的人工智能体。大型语言模型对问卷式提示所描述场景的参与方式，与基于同款大型语言模型构建的智能体存在显著差异，这体现在输入内容、可行行动、环境交互及内部处理机制等多个层面的分歧上。因此，大型语言模型对场景描述的反应很可能无法代表相应智能体的实际行为。我们进一步指出，此类评估对大型语言模型准确报告其反事实行为的能力和倾向性做出了强假设，导致其缺乏结构效度，不足以评估现实环境中人工智能系统的风险。我们认为，当前的人工智能对齐方法也存在结构上相同的问题。最后，我们探讨了如何通过正视这些缺陷来改进安全评估与对齐训练。

摘要 (Abstract)

As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs’ engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs’ responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents’ behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

关键词: AI safety, AI alignment, large language models, LLM agents, safety assessment, questionnaire-style prompts, construct validity, real-world deployment

13. ✅ MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文提出了首个专门评估医疗开放域问答中大型语言模型隐私保护与临床效用权衡的基准MedPriv-Bench，通过多智能体人工循环流程生成敏感医疗上下文和查询，并利用RoBERTa-NLI模型自动化评估数据泄露，在对9个代表性LLM的广泛评估中揭示了普遍存在的隐私-效用权衡问题。

摘要翻译

检索增强生成（RAG）技术的最新进展使得大语言模型（LLM）能够基于临床证据生成输出。然而，将LLM与外部数据库连接会引入上下文泄露风险：这是一种微妙的隐私威胁，即使没有明确的标识符，独特的医疗细节组合也可能导致患者被重新识别。尽管存在《健康保险携带和责任法案》（HIPAA）和《通用数据保护条例》（GDPR）等严格法规，当前医疗领域的基准测试仍过度侧重于准确性，而忽视了此类隐私问题。为填补这一空白，我们提出了MedPriv-Bench，这是首个专门设计用于联合评估医疗开放式问答中隐私保护与临床效用的基准测试。我们的框架采用多智能体、人在回路（human-in-the-loop）的流程，合成敏感的医疗上下文和临床相关查询，以创建真实的隐私压力。我们建立了一个标准化评估协议，利用预训练的RoBERTa-自然语言推理（NLI）模型作为自动化评判器来量化数据泄露，其与人类专家的平均一致性达到85.9%。通过对9个代表性LLM的广泛评估，我们揭示了普遍存在的隐私-效用权衡问题。我们的研究结果强调了在隐私敏感环境中，需要特定领域的基准测试来验证医疗人工智能系统的安全性与有效性。

摘要 (Abstract)

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

关键词: Large Language Models, Retrieval-Augmented Generation, Medical AI, Privacy-Utility Trade-off, Benchmark, Clinical Evidence, Data Leakage, Healthcare

14. ✅ Effective Distillation to Hybrid xLSTM Architectures

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	10.0/10	10.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何通过有效的蒸馏管道将基于二次注意力的大语言模型（LLMs）无损蒸馏到基于xLSTM的线性化架构中，并引入专家合并阶段，使蒸馏后的学生模型在多项下游任务上恢复甚至超越教师模型的性能。

摘要翻译

已有大量研究尝试将基于二次注意力机制的大语言模型（LLM）蒸馏至次二次线性化架构中。然而，尽管研究广泛，此类蒸馏模型在多种下游任务上仍常常无法达到其教师大语言模型的性能水平。我们设定了无损蒸馏的目标，并将其定义为学生模型与教师模型在任务集上经容差校正的“胜平率”。为此，我们为基于xLSTM的学生模型引入了一套高效的蒸馏流程。我们提出了一个额外的合并阶段，将独立线性化的专家模型整合为单一模型。通过从Llama、Qwen和Olmo系列中蒸馏基础模型及指令微调模型，我们验证了该流程的有效性。在许多设定下，我们基于xLSTM的学生模型恢复了教师模型的大部分性能，甚至在某些下游任务上实现了超越。我们的贡献是朝着为基于Transformer的大语言模型提供更节能、更具成本效益的替代方案迈出的重要一步。

摘要 (Abstract)

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher’s performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

关键词: knowledge distillation, large language models, xLSTM, linearized architectures, model merging, instruction tuning, energy-efficient, downstream tasks

15. ✅ AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

评分: 29.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究解决了资源受限环境下语言引导机器人操作的计算效率问题，提出了一种基于深度状态空间模型的轻量级视觉-语言-动作模型（AnoleVLA），在真实世界实验中比大型VLA模型任务成功率提高21分且推理速度快3倍。

摘要翻译

本研究针对语言引导的机器人操作问题展开探讨，该任务要求机器人基于视觉观察与自然语言指令对多种物体进行操作。对于在人类环境中运行的服务机器人而言，此项任务至关重要，需同时满足安全性、高效性及任务层级的泛化能力。尽管视觉-语言-动作模型（Vision-Language-Action models, VLAs）在此类任务中已展现出强大性能，但由于标准Transformer主干网络的计算成本较高，其在资源受限环境中的部署仍面临挑战。为突破此限制，我们提出AnoleVLA——一种轻量级VLA模型，其采用深度状态空间模型以实现多模态序列的高效处理。该模型凭借其轻量化架构与快速序列状态建模能力，能够高效处理视觉与文本输入，从而使机器人能够生成流畅的运动轨迹。我们在仿真环境与物理实验中均对所提方法进行了评估。值得注意的是，在真实世界评估中，AnoleVLA的任务成功率较代表性大规模VLA模型高出21个百分点，同时推理速度提升约三倍。

摘要 (Abstract)

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.

16. ❌ Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks

作者: Timo Freiesleben 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15121v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是讨论LLM能力评估中的构念效度问题，特别是推理能力的评估。因此，与"Large Language Models"高度相关（10分），与推理相关的关键词（“Chain of Thought"和"System 2 Thinking”）有一定关联（8分），因为论文以推理能力评估为案例。其他关键词涉及具体技术、应用或评估方法，论文未直接讨论，故得0分。

!!! tip deepseek-chat TL;DR

该论文探讨了在大型语言模型能力基准测试中建立构念效度的问题，主张采用nomological网络框架来更可靠地评估LLM的推理等能力。

摘要翻译

近期机器学习研究日益基于基准测试表现，将推理或心理理论等类人能力归因于大语言模型。本文通过建构效度的视角审视这一实践，将其理解为将理论能力与实证测量相联结的问题。文章对比了三个有影响力的理论框架：克龙巴赫和米赫尔提出的法则学框架、梅西克提出并经凯恩完善的推论性框架，以及博斯布姆的因果性框架。笔者认为，法则学框架为当前大语言模型能力研究提供了最适宜的理论基础。它既避免了因果性框架强烈的本体论承诺，又比推论性框架提供了更具实质性的理论建构阐释框架。通过一个具体案例——大语言模型推理能力评估，本文探讨了采用法则学框架对大语言模型研究产生的概念性启示。

摘要 (Abstract)

Recent work in machine learning increasingly attributes human-like capabilities such as reasoning or theory of mind to large language models (LLMs) on the basis of benchmark performance. This paper examines this practice through the lens of construct validity, understood as the problem of linking theoretical capabilities to their empirical measurements. It contrasts three influential frameworks: the nomological account developed by Cronbach and Meehl, the inferential account proposed by Messick and refined by Kane, and Borsboom’s causal account. I argue that the nomological account provides the most suitable foundation for current LLM capability research. It avoids the strong ontological commitments of the causal account while offering a more substantive framework for articulating construct meaning than the inferential account. I explore the conceptual implications of adopting the nomological account for LLM research through a concrete case: the assessment of reasoning capabilities in LLMs.

关键词: construct validity, LLM capability assessment, nomological networks, reasoning capabilities, benchmark evaluation, large language models, theoretical frameworks

17. ❌ Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

作者: Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15611v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning》的核心是提出一个对抗性协同进化框架，使用强化学习（RL）联合优化一个代码大语言模型（Code LLM）和一个测试大语言模型（Test LLM）。因此，它与关键词“Large Language Models”高度相关（10分），因为论文明确研究并优化了两种LLM。同时，论文的核心训练方法是强化学习，具体使用了对抗性目标和奖励机制，这与关键词“RLHF”或“RLAIF”或“Direct Preference Optimization”或“DPO”高度相关（10分），尽管论文没有明确使用这些特定术语，但其“adversarial co-evolution framework”和“reinforcement learning”的核心机制在概念上与RLHF/RLAIF/DPO等基于奖励的优化方法高度一致，属于大模型训练与对齐技术的重要创新。论文未涉及其他关键词（如MoE、量化、RAG、CoT等）的具体内容，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文针对代码生成中高质量测试套件稀缺和现有自博弈方法存在自我共谋或测试泛化的问题，提出了一个名为Code-A1的对抗性协同进化框架，通过强化学习联合优化一个代码LLM和一个测试LLM，实验表明该框架在代码生成性能上达到或超过了使用人工标注测试训练的模型，并显著提升了测试生成能力。

摘要翻译

基于单元测试通过率的可验证奖励是代码生成强化学习的依赖基础。然而高质量测试套件稀缺，现有数据集覆盖有限，且静态奖励无法随模型改进而自适应调整。近期自博弈方法将代码与测试生成统一于单一模型，但面临固有困境：白盒访问会导致自我共谋——模型为获取简单奖励而生成琐碎测试，而黑盒限制则产生通用测试，无法捕捉实现特定的缺陷。我们提出Code-A1对抗协同进化框架，通过目标对立的代码大语言模型与测试大语言模型进行联合优化：代码大语言模型以获得更高测试通过率为奖励，测试大语言模型则以暴露更多缺陷为目标。这种架构分离消除了自我共谋风险，并安全实现了白盒测试生成——测试大语言模型可检查候选代码以构建针对性对抗测试。我们进一步引入“错题本”机制实现经验回放，并设计融合测试有效性与对抗难度的复合奖励函数。在Qwen2.5-Coder系列模型上的实验表明，Code-A1在代码生成性能上达到甚至超越基于人工标注测试训练的模型，同时显著提升了测试生成能力。

摘要 (Abstract)

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

关键词: Code LLM, Test LLM, Adversarial Co-evolution, Reinforcement Learning, Code Generation, Test Generation, Self-play, Mistake Book

18. ❌ Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

作者: Han Zhang, Jiamin Su, Li liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14891v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是提出一种基于大语言模型（LLMs）的自动化作文评分新方法（DLOM），将评分任务从隐式的生成式任务重构为显式的序数决策任务。因此，与"Large Language Models"和"Supervised Fine-tuning"高度相关（10分），因为论文明确使用LLMs作为基础模型，并通过SFT进行微调（提到"generation-based SFT baseline”）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF、PEFT、Agents等均未在摘要中提及或与论文核心方法无关，故评0分。论文属于大模型在教育领域的应用，但未涉及生物信息学等特定科学领域，因此"AI for Science"评0分。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的自动化作文评分任务，提出了一种名为决策级序数建模（DLOM）的新方法，将评分重构为显式的序数决策任务，并通过实验证明其在多模态和纯文本数据集上均能有效提升评分性能。

摘要翻译

自动作文评分（AES）旨在预测每篇作文在多个评分维度上的分数，每个维度均遵循有序离散的评分量表。大多数基于大语言模型的AES方法将评分任务构建为自回归的标记生成过程，并通过解码和解析获得最终分数，这使得评分决策隐含于生成过程中。这种处理方式在多模态AES中尤为敏感，因为视觉输入的有效性在不同作文和评分维度间存在差异。为应对这些局限，我们提出了决策级有序建模（DLOM），该方法通过复用语言模型头在预定义分数标记上提取分数维度的逻辑值，将评分转化为显式的有序决策，从而支持在分数空间中进行直接优化与分析。针对多模态AES，DLOM-GF引入门控融合模块，自适应地结合文本与多模态分数逻辑值。对于纯文本AES，DLOM-DA则添加距离感知正则化项，以更好地反映有序距离。在多模态数据集EssayJudge上的实验表明，DLOM在各项评分维度上均优于基于生成的监督微调基线，且当模态相关性存在异质性时，DLOM-GF能带来进一步的性能提升。在纯文本基准数据集ASAP/ASAP++上，DLOM在无视觉输入时依然有效，而DLOM-DA进一步提升了性能，并超越了具有代表性的强基线方法。

摘要 (Abstract)

Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.

关键词: Automated Essay Scoring, Large Language Models, Ordinal Decision Modeling, Multimodal Fusion, Supervised Fine-tuning, Score-wise Logits, Gated Fusion, Distance-aware Regularization

19. ❌ CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

作者: Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand, Bingnan Li, Haiyang Xu, Zhuowen Tu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14957v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出CyCLeGen，一个统一的自回归视觉-语言基础模型，专注于图像理解和生成。它与’Large Language Models’ OR ‘LLMs’ OR ‘Foundation Models’高度相关（10分），因为论文明确将其称为’vision-language foundation model’，属于基础模型范畴。与’Pre-training’ OR ‘Continual Pre-training’ OR ‘Domain Adaptation’有一定关联（5分），因为基础模型通常涉及预训练，但论文未详细说明具体预训练方法。与’Self-Correction’ OR ‘Self-Improvement’ OR ‘Self-Reflection’有一定关联（5分），因为模型通过循环一致性实现自省和自我改进。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CyCLeGen，一个统一的自回归视觉-语言基础模型，通过循环一致性学习在单一框架中实现图像理解和生成，并在多个基准测试中取得了显著性能提升。

摘要翻译

我们提出CyCLeGen，这是一个统一视觉语言基础模型，能够在单一自回归框架内同时实现图像理解与图像生成。与现有依赖独立模块进行感知和合成的视觉模型不同，CyCLeGen采用完全集成的架构，通过图像->布局->图像和布局->图像->布局的生成循环，强制实现周期一致性学习。这种统一框架带来两个关键优势：内省能力，使模型能够对其自身生成结果进行推理；以及数据效率，允许模型在周期一致性指导的强化学习目标下，通过合成监督实现自我改进。大量实验表明，CyCLeGen在多样化的图像理解与生成基准测试中均取得显著性能提升，彰显了统一视觉语言基础模型的巨大潜力。

摘要 (Abstract)

We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

关键词: vision-language foundation model, autoregressive framework, cycle-consistent learning, image understanding, image generation, unified architecture, introspection, reinforcement learning

20. ❌ A proof-of-concept for automated AI-driven stellarator coil optimization with in-the-loop finite-element calculations

作者: Alan A. Kaptanoglu, Pedro F. Gil 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15240v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究使用AI（包括遗传算法和上下文感知LLM）自动化优化托卡马克线圈设计，属于AI在科学（具体是聚变能工程）领域的应用。因此，与"AI for Science"高度相关（10分）。论文提到使用"context-aware LLM"，因此与"Large Language Models"有一定关联（5分）。论文未涉及其他关键词的具体技术细节或原理，如MoE、训练方法、推理优化、对齐、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种端到端的自动化AI驱动框架，用于优化托卡马克聚变装置的线圈设计，通过遗传算法或上下文感知大语言模型实现非停止优化，并引入了线圈内冯·米塞斯应力的新型在环优化方法。

摘要翻译

为仿星器聚变装置寻找可行的线圈是实现未来电厂应用这一概念的关键挑战。即便是单个反应堆规模的仿星器设计，也可能需要投入多年的研究工作。为加速并自动化仿星器线圈设计流程，我们设计了一种端到端的“运行器”以执行仿星器线圈优化。所有预处理和后处理步骤均已实现自动化；用户仅需指定少量基本输入参数，最终线圈解便会更新于开源排行榜上。系统提供两种策略，可通过遗传算法或上下文感知大语言模型（LLM）进行不间断的自动化线圈优化。最后，我们构建了一种新颖的在线优化方法，用于分析线圈中的冯·米塞斯应力（Von Mises stresses），这为未来实现在线有限元计算开启了重要的可能性。

摘要 (Abstract)

Finding feasible coils for stellarator fusion devices is a critical challenge of realizing this concept for future power plants. Years of research work can be put into the design of even a single reactor-scale stellarator design. To rapidly speed up and automate the workflow of designing stellarator coils, we have designed an end-to-end ``runner’’ for performing stellarator coil optimization. The entirety of pre and post-processing steps have been automated; the user specifies only a few basic input parameters, and final coil solutions are updated on an open-source leaderboard. Two policies are available for performing non-stop automated coil optimizations through a genetic algorithm or a context-aware LLM. Lastly, we construct a novel in-the-loop optimization of Von Mises stresses in the coils, opening up important future capabilities for in-the-loop finite-element calculations.

关键词: stellarator coil optimization, AI-driven, genetic algorithm, context-aware LLM, in-the-loop finite-element calculations, automated workflow, fusion devices, Von Mises stress

21. ❌ Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

作者: Zhuoshang Wang, Yubing Ren, Yanan Cao, Fang Fang, Xiaoxue Li, Li Guo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14968v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文专注于LLM水印检测，核心与"Large Language Models"高度相关（10分），因为论文直接研究LLM的溯源机制。其他关键词如MoE、SLMs、训练方法、推理技术、对齐、压缩、科学AI应用等均未在标题或摘要中提及，与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对黑盒设置下LLM水印检测的局限性，提出了一种非侵入式的第三方检测框架TTP-Detect，通过解耦检测与注入并采用相对假设测试，实现了优异的检测性能和鲁棒性。

摘要翻译

水印技术作为大语言模型溯源的关键机制，现有密钥方案将检测与注入紧密耦合，验证时需获取密钥或依赖服务商提供的专用检测器。这种依赖性为实际治理设置了根本性障碍——若不愿损害模型安全性或依赖服务商不透明的声明，独立审计便无法实现。为解决这一困境，我们提出TTP-Detect，这是一个开创性的黑盒验证框架，专为非侵入式第三方水印验证而设计。通过解耦检测与注入环节，TTP-Detect将验证重构为相对假设检验问题。该框架采用代理模型放大水印相关信号，并运用一系列互补的相对度量方法，评估查询文本与水印分布的一致性。在多种代表性水印方案、数据集和模型上的大量实验表明，TTP-Detect在检测性能与对抗各类攻击的鲁棒性方面均表现卓越。

摘要 (Abstract)

While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.

关键词: LLM watermarking, black-box detection, third-party verification, non-intrusive framework, TTP-Detect, watermark provenance, relative hypothesis testing, proxy model

22. ❌ A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression

作者: Yuming Han, Jooho Kim, Anish Shakya 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15365v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于遥感图像压缩，提出了一种结合条件扩散模型和PPO比特率分配的方法。与大多数大模型/深度学习技术关键词无关，仅与"Quantization/Model Compression"（涉及压缩技术）和"AI for Science"（遥感属于科学应用）有弱关联（5分），其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于PPO比特率分配的条件扩散压缩框架，用于高分辨率无人机遥感图像压缩，实现了19.3-21.2倍的压缩比，同时在下游目标检测任务中保持了可忽略的性能损失。

摘要翻译

现有遥感图像压缩方法仍在探索如何平衡高压缩效率与细节及任务相关信息的保留。与此同时，高分辨率无人机影像为城市监测与灾害评估提供了宝贵的结构细节，但大范围数据集极易达到数百GB量级，对存储与长期管理构成了显著挑战。本文提出了一种基于PPO的码率分配条件扩散压缩框架。该框架将条件扩散解码器与基于PPO的分块码率分配策略相结合，在实现高压缩比的同时保持优异的感知性能。我们还发布了一个高分辨率无人机影像数据集，该数据集在沿海城市居民区上空以恒定低空拍摄，包含更丰富的结构细节。实验结果表明，在DIV2K数据集上实现了19.3倍的压缩比，在无人机影像数据集上实现了21.2倍的压缩比。此外，下游目标检测实验表明，重建图像能有效保留任务相关信息，且性能损失可忽略不计。

摘要 (Abstract)

Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.

关键词: remote sensing image compression, conditional diffusion model, PPO-based bitrate allocation, drone imagery, high compression ratio, object detection, perceptual performance

23. ❌ Do Metrics for Counterfactual Explanations Align with User Perception?

作者: Felix Liedeker, Basil Ell, Philipp Cimiano, Christoph Düsing 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究反事实解释的评估指标与人类感知之间的对齐问题，属于可解释人工智能（XAI）领域。论文核心关注解释质量评估，与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（评10分），因为这是可解释AI的直接应用。其他关键词主要涉及大模型技术、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评0分。论文不涉及大模型在不同领域的应用或技术创新，因此整体相关性较低。

!!! tip deepseek-chat TL;DR

该论文通过实证研究发现，当前广泛使用的反事实解释评估指标与人类对解释质量的感知之间相关性较弱且依赖数据集，表明这些指标未能有效反映用户关注的关键方面，需要更以人为中心的评估方法。

摘要翻译

可解释性被广泛视为可信人工智能系统的关键要素。然而，当前用于评估反事实解释的指标多为算法评估指标，这些指标很少通过人类对解释质量的判断进行验证。这引发了一个问题：此类指标是否真正反映了用户的感知？我们通过一项实证研究探讨了该问题，该研究在三个数据集上直接比较了算法评估指标与人类判断。参与者从多个感知质量维度对反事实解释进行评分，我们将这些评分与一套全面的标准反事实评估指标相关联。我们既分析了单个指标与人类评价的关系，也探究了指标组合能在多大程度上预测人类评估。研究结果显示，算法指标与人类评分之间的相关性普遍较弱，且高度依赖于数据集。此外，在预测模型中增加使用的指标数量并不会带来可靠的改进，这表明当前指标在捕捉人类相关评价标准方面存在结构性局限。总体而言，我们的研究结果表明，广泛使用的反事实评估指标未能反映用户所感知的解释质量的关键方面，这凸显了需要采用更以人为中心的方法来评估可解释人工智能。

摘要 (Abstract)

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

关键词: counterfactual explanations, explainable AI, evaluation metrics, human judgments, explanation quality, empirical study, user perception, algorithmic metrics

24. ❌ Mechanistic Origin of Moral Indifference in Language Models

作者: Lingyu Li, Yan Teng, Yingchun Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的道德对齐问题，直接涉及’Large Language Models’、‘Instruction Tuning/Alignment’和’Mechanistic Interpretability’三个关键词，分别得10分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等均未在摘要中提及或相关，得0分。论文虽涉及道德概念，但未明确涉及’AI for Science’的具体科学领域应用，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大型语言模型存在内在的道德漠视问题，通过稀疏自编码器分离道德特征并重构其拓扑关系，实现了表示对齐，从而显著提升了道德推理能力。

摘要翻译

现有的大型语言模型（LLM）行为对齐技术往往忽视了表面合规性与内部未对齐表征之间的差异，使LLM易受长尾风险影响。更重要的是，我们认为由于LLM将不同的道德概念压缩为统一的概率分布，其本质上存在一种道德漠然状态。我们基于原型理论和社会化学-101数据集构建的251k个道德向量，验证并修正了LLM潜在表征中的这种漠然性。首先，我们对23个模型的分析表明，当前LLM无法表征对立道德类别之间的差异以及这些类别内部细粒度的典型性梯度；值得注意的是，无论是模型缩放、架构调整还是显式对齐训练，都未能改变这种漠然状态。随后，我们在Qwen3-8B模型上应用稀疏自编码器，分离出单语义道德特征，并针对性重构其拓扑关系以对齐真实道德向量。这种表征对齐自然提升了道德推理的精细度，在独立的对抗性基准测试Flames上取得了75%的配对胜率。最后，我们从经验主义哲学视角阐释了当前干预方法的补救性质，主张内生对齐的人工智能可能需要从事后修正转向主动培育的范式转变。

摘要 (Abstract)

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs’ latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.

关键词: Large Language Models, Moral Alignment, Representational Alignment, Sparse Autoencoders, Moral Indifference, Prototype Theory, Adversarial Benchmark, Experientialist Philosophy

25. ❌ Mixture-of-Depths Attention

作者: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15619v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的深度扩展问题，提出混合深度注意力（MoDA）机制，直接涉及LLMs和注意力优化技术。与’Large Language Models’高度相关（10分），因为论文明确研究LLMs的深度扩展。与’Mixture of Experts’有一定关联（8分），因为MoDA借鉴了混合思想但非严格MoE。与’KV Cache Compression OR Linear Attention OR FlashAttention’相关（8分），因为论文提出硬件高效算法并与FlashAttention-2比较效率。其他关键词如SLMs、Scaling Laws、训练方法、推理技术、应用领域等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型深度扩展中的信号退化问题，提出混合深度注意力（MoDA）机制，实验证明MoDA能提升模型性能且计算开销小，是深度扩展的有效方法。

摘要翻译

深度扩展是大型语言模型（LLMs）发展的关键驱动力。然而，随着模型深度的增加，它们常常面临信号退化问题：在浅层形成的具有信息量的特征，会因残差更新的重复进行而逐渐被稀释，导致这些特征在深层中更难被恢复。我们提出了混合深度注意力（mixture-of-depths attention, MoDA）机制，该机制允许每个注意力头同时关注当前层的序列键值对（KV pairs）以及来自前面各层的深度键值对。我们进一步描述了一种针对MoDA的硬件高效算法，该算法解决了非连续内存访问模式的问题，在序列长度为64K时达到了FlashAttention-2效率的97.3%。在15亿参数模型上的实验表明，MoDA始终优于强基线模型。值得注意的是，它在10个验证基准测试中平均困惑度降低了0.2，在10个下游任务上平均性能提升了2.11%，而计算开销仅增加了可忽略的3.7% FLOPs。我们还发现，将MoDA与后归一化（post-norm）结合使用，比与前归一化（pre-norm）结合能获得更好的性能。这些结果表明，MoDA是实现深度扩展的一个有前景的基础构件。代码发布于 https://github.com/hustvl/MoDA。

摘要 (Abstract)

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2’s efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

关键词: Mixture-of-Depths Attention, large language models, depth scaling, signal degradation, attention mechanism, hardware-efficient algorithm, FlashAttention, perplexity improvement

26. ❌ Computational Concept of the Psyche

作者: Anton Kolonin, Vladimir Krykov 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于状态空间、需求和智能决策的认知架构概念，用于构建人工通用智能（AGI）系统，但全文未提及任何具体的大模型技术（如LLM、MoE、量化等）、训练方法（如预训练、微调、对齐等）、推理技术（如CoT、RAG、注意力优化等）或特定应用领域（如生物信息学）。论文聚焦于AGI的理论框架和概念模型，而非当前大模型或深度学习的技术实现、优化或应用，因此与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个将人类心理建模为状态空间和需求驱动决策系统的认知架构概念，用于通过经验学习构建人工通用智能（AGI），并形式化为在不确定性下优化目标达成、风险最小化和能效最大化的决策问题。

摘要翻译

本文概述了在构建人工心理过程中对人类心理建模的方法。基于此综述，本文提出了一种认知架构（cognitive architecture）概念，其中心理被视为生命体或人工主体的操作系统，包含一个状态空间——其中涵盖需求状态，这些需求决定了主体在与外部世界刺激关联下的存在意义；而智能则被视为关于在此世界中采取行动以满足这些需求的决策系统。基于此概念，本文提出了一种计算形式化方法，用于通过包含主体需求的状态空间中的经验学习来创建智能体的人工通用智能系统，同时考量这些需求对智能体的生物学或存在性意义，以及主体的感知与行动。因此，构建人工通用智能的问题被形式化为：在特定智能体需求空间内、不确定性条件下做出最优决策的系统，其目标在于最大化目标达成成功率、最小化存在性风险并最大化能源效率。文中还展示了该模型的一个最小化实验实现。

摘要 (Abstract)

This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject’s being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent’s needs, taking into account their biological or existential significance for the intelligent agent, along with agent’s sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.

关键词: cognitive architecture, artificial general intelligence, state space, needs, decision-making system, experiential learning, existential risks, energy efficiency

27. ❌ OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

作者: Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM搜索代理（LLM Agents），使用LLM作为基础模型，通过SFT训练，涉及数据质量（Data Quality）和预训练（Pre-training）对比，包含多跳推理（Multi-step Reasoning）。其他关键词如MoE、SLMs、RLHF、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了高质量训练数据稀缺阻碍开源搜索代理发展的问题，通过创新的数据合成方法OpenSeeker实现了仅用少量SFT训练样本就达到前沿性能的搜索代理。

摘要翻译

深度搜索能力已成为前沿大语言模型智能体不可或缺的核心能力，然而由于缺乏透明、高质量的训练数据，高性能搜索智能体的开发仍由工业巨头主导。这种持续存在的数据稀缺问题从根本上阻碍了更广泛研究界在该领域进行开发和创新的进程。为弥合这一差距，我们推出了OpenSeeker——首个完全开源的搜索智能体（即模型与数据），它通过两项核心技术突破实现了前沿性能水平：（1）基于事实的可扩展可控问答合成，该方法通过拓扑扩展和实体混淆对网络图进行逆向工程，以生成覆盖范围和复杂度均可控的复杂多跳推理任务；（2）去噪轨迹合成，该方法采用回溯总结机制对交互轨迹进行去噪处理，从而引导教师大语言模型生成高质量动作。实验结果表明，仅使用11.7k个合成样本进行训练（单次训练），OpenSeeker便在BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch等多个基准测试中取得了最先进的性能。值得注意的是，通过简单的监督微调训练，OpenSeeker显著优于第二佳的全开源智能体DeepDive（例如在BrowseComp上达到29.5%对15.3%），甚至在BrowseComp-ZH基准上超越了通义深度研究等工业级竞品（48.4%对46.7%），而后者采用了大规模持续预训练、监督微调和强化学习的复合训练方案。我们将完整训练数据集和模型权重全面开源，以期推动前沿搜索智能体研究的民主化，培育更加透明、协作的生态系统。

摘要 (Abstract)

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

关键词: search agent, LLM agents, open-source, training data synthesis, supervised fine-tuning, multi-hop reasoning, frontier performance, democratization

28. ❌ From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

作者: Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15600v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于视频多模态大语言模型（MLLMs）在机器人操作任务中的过程推理监控，核心创新在于使用强化学习激励显式的思维链生成。高度相关的关键词包括：1）‘Large Language Models’（论文研究视频MLLMs，属于大模型范畴）；2）‘Small Language Models’（论文提出7B参数模型，属于小语言模型）；3）‘Post-training/SFT’（论文明确提到当前视频MLLMs在SFT范式下训练，并以此为基线改进）；4）‘Chain of Thought’（论文核心方法是通过强化学习激励显式的思维链生成进行进度估计）。其他关键词如MoE、Scaling Laws、RLHF、RAG等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了长时程机器人操作中准确过程监督的挑战，通过强化学习激励视频多模态大语言模型生成显式思维链进行进度估计，提出的7B模型PRIMO R1在多个基准测试中实现了最先进的性能，相比72B通用模型显著提升了准确性。

摘要翻译

精确的过程监督仍是长周期机器人操作领域的关键挑战。当前的主要瓶颈在于，现有视频多模态大语言模型主要基于监督微调范式训练，其功能如同被动的“观察者”——仅能识别正在进行的事件，而无法根据最终任务目标评估当前状态。本文提出PRIMO R1（过程推理诱导监控框架），这是一个70亿参数规模的框架，旨在将视频多模态大语言模型转化为主动的“批判者”。我们利用基于结果的强化学习机制，激励模型生成显式的思维链以进行进度评估。此外，该架构通过将视频序列显式锚定在初始状态图像与当前状态图像之间，构建了结构化的时序输入。基于提出的PRIMO数据集与基准测试，我们在多样化的领域内环境及领域外真实世界仿人机器人场景中进行了广泛实验，结果表明PRIMO R1实现了最先进的性能。量化数据显示，我们的70亿参数模型将专业推理基线的平均绝对误差降低了50%，相较于720亿规模的通用多模态大语言模型实现了显著的相对精度提升。同时，PRIMO R1在困难故障检测任务中展现出强大的零样本泛化能力。我们在RoboFail基准测试中以67.0%的准确率创造了最新性能记录，较OpenAI o1等闭源模型高出6.0%。

摘要 (Abstract)

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive “Observers” that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active “Critics”. We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

关键词: Robotic Manipulation, Process Reasoning, Reinforcement Learning, Chain-of-Thought, Video MLLMs, Supervised Fine-Tuning, 7B Model, State-of-the-art Performance

29. ❌ Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

作者: Ivan Stetsenko 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI编程代理的知识管理协议，与’LLM Agents’高度相关（10分），因为论文明确讨论AI coding agents作为主要生产者和消费者；与’Tool Use’有一定关联（5分），因为协议涉及代理使用shell命令等工具；与’Large Language Models’有间接关联（5分），因为AI coding agents通常基于LLMs；其他关键词如MoE、SFT、RAG等与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

论文针对AI编程代理导致软件行业知识流失的问题，提出了Lore协议，通过重构git提交消息来捕获决策背后的约束、备选方案和上下文，从而保留机构知识。

摘要翻译

随着AI编程代理逐渐成为源代码的主要生产者和消费者，软件行业正面临机构知识加速流失的困境。每次提交虽能记录代码差异，却丢弃了背后的决策逻辑——包括影响决策的约束条件、被否定的替代方案以及前瞻性背景信息。我将这种被丢弃的推理过程称为“决策暗影”。本文提出Lore协议，该轻量级协议通过原生git尾部标识重构提交信息，将其转化为包含约束条件、被否定的替代方案、代理指令及验证元数据的自包含决策记录。Lore无需除git外的任何基础设施支持，可通过独立命令行界面工具进行查询，且任何能运行shell命令的代理均可发现这些记录。本文对协议进行了形式化定义，与五种现有方案进行对比分析，针对最强烈的质疑进行了压力测试，并规划了实证验证路径。

摘要 (Abstract)

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.

关键词: AI coding agents, git commit messages, structured knowledge protocol, decision shadow, institutional knowledge, software industry, agent directives, verification metadata

30. ❌ Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask

作者: Vasiliy A. Es’kin, Egor V. Ivanov 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于物理学启发的神经网络（PINNs）和神经算子（NOs）在EUV光刻掩模衍射模拟中的应用，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或生物医药AI应用，因此其他所有关键词均不相关（评0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于波导神经算子（WGNO）的混合方法，用于模拟EUV光刻掩模的电磁波衍射，实现了与数值求解器相当的精度和显著减少的预测时间。

摘要翻译

本文提出了基于物理信息的神经网络（PINNs）与神经算子（NOs）用于解决当代光刻掩模的极紫外（EUV）电磁波衍射问题。我们引入了一种新型混合波导神经算子（WGNO），该方法以波导方法为基础，并将其计算成本最高的组件替换为神经网络。为评估性能，我们在一系列已知精确解的问题上，将PINNs和神经算子的精度与推理时间同现代数值求解器进行了比较。重点研究了所考察的人工神经网络系统在13.5纳米和11.2纳米波长下的求解精度。在真实二维与三维掩模上的数值实验表明，PINNs与神经算子能够达到具有竞争力的精度，并显著减少预测时间，其中所提出的WGNO架构达到了最先进的性能。所展示的神经算子具有显著的泛化特性，这意味着对于训练数据集中未出现的问题参数，其求解精度仍能接近已见参数下的精度。这些结果为加速下一代光刻掩模的设计与优化流程提供了高效的解决方案。

摘要 (Abstract)

Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from contemporary lithography masks are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, based on a waveguide method with its most computationally expensive components replaced by a neural network. To evaluate performance, the accuracy and inference time of PINNs and NOs are compared against modern numerical solvers for a series of problems with known exact solutions. The emphasis is placed on investigation of solution accuracy by considered artificial neural systems for 13.5 nm and 11.2 nm wavelengths. Numerical experiments on realistic 2D and 3D masks demonstrate that PINNs and neural operators achieve competitive accuracy and significantly reduced prediction times, with the proposed WGNO architecture reaching state-of-the-art performance. The presented neural operator has pronounced generalizing properties, meaning that for unseen problem parameters it delivers a solution accuracy close to that for parameters seen in the training dataset. These results provide a highly efficient solution for accelerating the design and optimization workflows of next-generation lithography masks.

关键词: Physics-informed neural networks, Neural operators, EUV lithography, Waveguide Neural Operator, Electromagnetic wave diffraction, Computational efficiency, Mask design optimization, Hybrid method

31. ❌ The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

作者: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于LLM的智能体在Pokemon多智能体战斗和RPG环境中的决策能力，与’Large Language Models’、‘LLM Agents’和’Multi-agent Systems’高度相关（10分），因为这些是论文的核心技术和方法。论文涉及战略推理和长程规划，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），但非核心焦点。其他关键词如MoE、SFT、RAG等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PokeAgent挑战，一个基于Pokemon环境的大规模决策基准，通过战斗和速通两个赛道评估LLM智能体在部分可观测、竞争性和长程规划任务中的性能，发现LLM与专家RL及人类表现存在显著差距。

摘要翻译

我们推出PokeAgent挑战赛，这是一个基于宝可梦多智能体对战系统及广阔角色扮演游戏（RPG）环境构建的大规模决策研究基准。部分可观测性、博弈论推理与长程规划仍是前沿人工智能的开放难题，但现有基准鲜少能在真实条件下同时检验这三方面能力。PokeAgent通过两个互补赛道大规模应对这些局限：对战赛道要求在竞争性宝可梦对战中进行部分可观测条件下的策略推理与泛化，速通赛道则要求在宝可梦RPG中执行长程规划与序列决策。对战赛道提供了超过2000万条对战轨迹数据集，以及一套能够实现高水平竞技对战的启发式、强化学习（RL）和基于大语言模型（LLM）的基线系统。速通赛道首次为RPG速通建立了标准化评估框架，包含开源的多智能体编排系统，可对基于控制框架的LLM方法进行模块化、可复现的比较。我们在NeurIPS 2025举办的竞赛验证了资源质量及研究社区对宝可梦课题的关注度，双赛道共吸引超100支队伍参赛，获奖方案细节已在论文中阐述。参赛提交结果与基线系统的对比显示，通用型（LLM）、专用型（RL）方法与人类顶尖水平之间存在显著差距。通过BenchPress评估矩阵的分析表明，宝可梦对战能力与标准LLM基准测试近乎正交，它衡量了现有测试体系未涵盖的能力维度，从而将宝可梦定位为一个能够推动RL与LLM研究发展的待解基准。我们已将其转化为持续更新的动态基准，在对战赛道提供实时排行榜，在速通赛道提供自包含评估系统，详见https://pokeagentchallenge.com。

摘要 (Abstract)

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon’s multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community’s interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

关键词: PokeAgent Challenge, decision-making benchmark, multi-agent battle system, long-horizon planning, LLM-based baselines, competitive Pokemon battles, RPG speedrunning, autonomous agents

作者: Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15542v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在社会科学因果推理和干预研究设计中的能力评估，并提出了一个多智能体框架STRIDES。高度相关的关键词包括：LLMs（核心研究对象）、Chain of Thought/System 2 Thinking（涉及复杂推理）、LLM Agents/Multi-agent Systems（提出的解决方案框架）、AI for Science（社会科学应用）。其他关键词如MoE、SFT、RAG、Quantization等与论文技术内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了InterveneBench基准来评估大语言模型在真实社会系统中进行干预推理和因果研究设计的能力，并开发了多智能体框架STRIDES以显著提升模型在此任务上的性能。

摘要翻译

社会科学中的因果推断依赖于以现实政策干预为基础、端到端且以干预为中心的研究设计推理，但现有基准测试未能有效评估大语言模型在此方面的能力。我们提出了InterveneBench，这是一个为评估真实社会情境中此类推理能力而设计的基准测试。InterveneBench中的每个实例均源自实证社会科学研究，要求模型在无法获取预定义因果图或结构方程的条件下，对政策干预与识别假设进行推理。该基准涵盖744项经过同行评审、涉及多元政策领域的研究。实验结果表明，当前最先进的大语言模型在此设定下表现欠佳。为应对这一局限，我们进一步提出了多智能体框架STRIDES。该框架相较于现有最先进的推理模型实现了显著的性能提升。我们的代码与数据公开于https://github.com/Sii-yuning/STRIDES。

摘要 (Abstract)

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

关键词: LLMs, causal inference, intervention reasoning, social science, benchmark, multi-agent framework, policy interventions, reasoning models

33. ❌ Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

作者: Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在生成选择题干扰项（distractors）时如何模拟学生错误推理，属于LLMs在教育领域的应用研究。与"Large Language Models"高度相关（10分），因为全文围绕LLMs能力分析；与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因论文分析LLMs的多步骤推理过程（solve first, simulate misconceptions, select distractors）；与"Explainable AI"相关（5分），因提供结构化分析框架理解LLMs行为；与"AI for Science"相关（5分），因属于AI在教育科学领域的应用。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过分析大型语言模型生成选择题干扰项的过程，发现模型能有效模拟学生错误推理，其多步骤推理策略与教育科学最佳实践一致，且提供正确答案能显著提升干扰项质量。

摘要翻译

对学生可能存在的错误概念进行建模是教育人工智能领域的关键课题。本研究探讨了大型语言模型在生成选择题干扰项时如何对错误概念进行推理——这项任务要求模型通过协调解题知识、模拟学生错误概念并评估合理性，来构建错误但看似可信的答案。我们提出了一套分类体系，用于分析前沿大型语言模型所采用的策略，检验其推理过程，并将其与学习科学领域既定的最佳实践进行比较。我们的结构化分析揭示了这些模型的推理过程与最佳实践之间存在惊人的一致性：模型通常先正确解决问题，接着阐明并模拟多种潜在的错误概念，最后筛选出一组干扰项。对失败模式的分析表明，错误主要源于正确解题过程的缺失以及在候选答案中选择的失误，而非源于错误模拟或流程结构问题。与这些发现一致的是，我们发现在提示中提供正确答案可使模型生成的干扰项与人工编写干扰项的匹配度提升8%，这凸显了在生成看似合理的学生错误推理时，锚定正确解决方案的关键作用。总体而言，我们的分析为理解大型语言模型模拟学生错误推理及生成高质量干扰项的能力，提供了一个结构化且可解释的研究视角。

摘要 (Abstract)

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs’ ability to model incorrect student reasoning and produce high-quality distractors.

关键词: large language models, student misconceptions, distractor generation, reasoning analysis, AI in education, multiple-choice questions, model evaluation, educational technology

34. ❌ DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning

作者: Yifan Wang, Debabrota Basu, Pierre Bourhis, Romain Rouvoy, Patrick Royer 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DOT专注于数据库管理系统（DBMS）的自动化调优算法，使用特征选择（RFECV）、统计测试（LRT）和贝叶斯优化（BO）等技术来动态选择重要参数并在线优化配置。所有评分关键词均涉及大模型、深度学习、AI科学应用及相关技术原理，而本文研究的是传统数据库优化问题，未涉及任何大模型、深度学习或AI科学应用内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为DOT的动态旋钮选择和在线采样算法，用于自动化数据库调优，通过消除预热阶段和减少搜索空间，实现了与最先进调优器相当或更优的性能，同时显著降低了调优开销。

摘要翻译

数据库管理系统（DBMS）对于高效的数据管理和访问控制至关重要，但其管理对数据库管理员（DBA）而言仍具挑战性。其中，调优尤其困难。现代系统拥有众多调优参数，但仅有一部分对性能有显著影响。聚焦于这些关键参数可缩小搜索空间并优化性能。现有方法依赖于代价高昂的预热阶段和人工经验来识别重要调优参数。本文提出DOT，一种动态参数选择与在线采样的DBMS调优算法。DOT采用交叉验证递归特征消除法（RFECV）来剪枝低重要性调优参数，并利用似然比检验（LRT）策略来平衡探索与利用。在参数搜索方面，DOT采用贝叶斯优化（BO）算法实时优化配置，无需预热阶段或先验知识（但可兼容现有知识）。实验表明，与先进调优工具相比，DOT在显著降低调优开销的同时，实现了相当或更优的性能表现。

摘要 (Abstract)

Database Management Systems (DBMS) are crucial for efficient data management and access control, but their administration remains challenging for Database Administrators (DBAs). Tuning, in particular, is known to be difficult. Modern systems have many tuning parameters, but only a subset significantly impacts performance. Focusing on these influential parameters reduces the search space and optimizes performance. Current methods rely on costly warm-up phases and human expertise to identify important tuning parameters. In this paper, we present DOT, a dynamic knob selection and online sampling DBMS tuning algorithm. DOT uses Recursive Feature Elimination with Cross-Validation (RFECV) to prune low-importance tuning parameters and a Likelihood Ratio Test (LRT) strategy to balance exploration and exploitation. For parameter search, DOT uses a Bayesian Optimization (BO) algorithm to optimize configurations on-the-fly, eliminating the need for warm-up phases or prior knowledge (although existing knowledge can be incorporated). Experiments show that DOT achieves matching or outperforming performance compared to state-of-the-art tuners while substantially reducing tuning overhead.

关键词: Database Tuning, Dynamic Knob Selection, Online Sampling, Bayesian Optimization, Recursive Feature Elimination, Likelihood Ratio Test, Parameter Optimization, DBMS Performance

35. ❌ Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

作者: Zhenheng Tang, Xiang Liu, Qian Wang, Eunsol Choi, Bo Li, Xiaowen Chu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM对齐中的冲突与困境，使用优先级图建模LLM偏好，并探讨对齐的稳定性挑战和潜在漏洞（优先级黑客攻击），提出运行时验证机制增强鲁棒性。因此，与’Large Language Models’和’Instruction Tuning OR Alignment OR Value Alignment’高度相关（核心内容），其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理方法、压缩技术、科学AI应用等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM对齐中的冲突与困境，通过优先级图建模发现统一稳定的对齐具有挑战性且存在优先级黑客攻击漏洞，并提出运行时验证机制来增强鲁棒性，同时指出许多伦理和价值困境在哲学上不可约简，是AI对齐的长期开放挑战。

摘要翻译

随着大型语言模型（LLM）能力日益增强且自主性不断提升，它们在众多场景中越来越多地面临冲突与困境。我们首先对这些多样化的冲突进行了总结与分类。随后，我们将LLM在不同选择中表现出的偏好建模为一个优先级图，其中指令与价值作为节点，边则代表由模型输出分布决定的、特定于上下文的优先级关系。该图表明，实现统一且稳定的大型语言模型对齐极具挑战性，因为该图既非静态，在不同情境中也未必保持一致。此外，它还揭示了一种潜在的脆弱性：优先级攻击，即攻击者可通过构造欺骗性上下文来操纵优先级图，从而绕过安全对齐机制。为应对此问题，我们提出了一种运行时验证机制，使大型语言模型能够查询外部资源以锚定其上下文，从而抵御操纵。尽管这一方法增强了模型的鲁棒性，我们也承认许多伦理与价值困境在哲学层面无法被完全消解，这为人工智能对齐的未来提出了一个长期且开放的挑战。

摘要 (Abstract)

As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first summarize and taxonomize these diverse conflicts. Then, we model the LLM’s preferences to make different choices as a priority graph, where instructions and values are nodes, and the edges represent context-specific priorities determined by the model’s output distribution. This graph reveals that a unified stable LLM alignment is very challenging, because the graph is neither static nor necessarily consistent in different contexts. Besides, it also reveals a potential vulnerability: priority hacking, where adversaries can craft deceptive contexts to manipulate the graph and bypass safety alignments. To counter this, we propose a runtime verification mechanism, enabling LLMs to query external sources to ground their context and resist manipulation. While this approach enhances robustness, we also acknowledge that many ethical and value dilemmas are philosophically irreducible, posing a long-term, open challenge for the future of AI alignment.

关键词: LLM alignment, conflicts, dilemmas, priority graph, priority hacking, runtime verification, ethical challenges, AI safety

36. ❌ Building Trust in PINNs: Error Estimation through Finite Difference Methods

作者: Aleksander Krasowski, René P. Klausen, Aycan Celik, Sebastian Lapuschkin, Wojciech Samek, Jonas Naujoks 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15526v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究物理信息神经网络（PINNs）的误差估计方法，属于深度学习在科学计算领域的应用，与’AI for Science’关键词高度相关（8分）。论文提出的误差估计方法旨在提高PINNs的可解释性和可信度，这与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文未涉及大语言模型（LLMs）、模型架构、训练方法、推理优化、智能体系统等主题，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对物理信息神经网络（PINNs）预测结果缺乏误差估计的问题，提出了一种基于有限差分方法的轻量级后处理技术，能够高效生成PINNs预测的点误差图，从而增强模型的可解释性和可信度。

摘要翻译

物理信息神经网络（PINNs）是一种用于求解偏微分方程（PDEs）的灵活深度学习方法，其建模范围涵盖从热传导到量子力学系统的各类现象。尽管具有灵活性，PINNs 对其预测偏离真实解的程度提供的解释有限，这阻碍了对其预测质量的信任。我们提出一种轻量级的事后处理方法，通过生成 PINN 预测的逐点误差估计来弥补这一不足。该方法为此类模型提供了一种自然的解释形式，不仅能判断预测是否正确，还能指出误差出现的位置及大小。对于线性偏微分方程，PINN 近似解与真实解之间的误差满足与原问题相同的微分算子，但以 PINN 的 PDE 残差作为其源项驱动。我们使用有限差分法对该误差方程进行数值求解，无需已知真实解。在多个基准 PDE 上的评估表明，我们的方法能以较低计算成本生成精确的误差分布图，从而实现对 PINNs 的定向且可解释的验证。

摘要 (Abstract)

Physics-informed neural networks (PINNs) constitute a flexible deep learning approach for solving partial differential equations (PDEs), which model phenomena ranging from heat conduction to quantum mechanical systems. Despite their flexibility, PINNs offer limited insight into how their predictions deviate from the true solution, hindering trust in their prediction quality. We propose a lightweight post-hoc method that addresses this gap by producing pointwise error estimates for PINN predictions, which offer a natural form of explanation for such models, identifying not just whether a prediction is wrong, but where and by how much. For linear partial differential equations, the error between a PINN approximation and the true solution satisfies the same differential operator as the original problem, but driven by the PINN’s PDE residual as its source term. We solve this error equation numerically using finite difference methods requiring no knowledge of the true solution. Evaluated on several benchmark PDEs, our method yields accurate error maps at low computational cost, enabling targeted and interpretable validation of PINNs.

关键词: Physics-informed neural networks, PINNs, error estimation, finite difference methods, partial differential equations, PDEs, explainable AI, scientific computing

37. ❌ SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

作者: David Števaňák, Marek Šuppa 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要贡献是构建了斯洛伐克语的大规模关键词提取数据集，并评估了包括LLM方法在内的多种方法。与’Large Language Models’高度相关（8分），因为论文使用GPT-3.5-turbo进行关键词提取评估。与’AI for Science’有一定关联（5分），因为研究涉及科学文献处理，但并非核心的生物信息学或化学信息学应用。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究构建了斯洛伐克语的大规模科学摘要关键词提取数据集，并评估发现LLM方法（KeyLLM）相比传统无监督方法能更好地匹配作者指定的规范关键词形式，缩小了精确匹配与部分匹配之间的差距。

摘要翻译

针对形态丰富、资源稀缺语言的关键短语提取研究仍显不足，这主要源于缺乏合适的评估数据集。我们通过构建斯洛伐克语数据集填补了这一空白：该数据集包含227,432篇科学摘要及其作者标注的关键短语，数据经系统化清洗后采集自斯洛伐克学位论文中央注册系统，其规模达到先前最大斯洛伐克语资源的25倍，并接近KP20K等成熟英语基准数据集的体量。基于此数据集，我们评估了三种无监督基线方法（YAKE、TextRank、结合SlovakBERT嵌入的KeyBERT），并测试了基于大型语言模型（LLM）的提取方法KeyLLM（使用GPT-3.5-turbo）。无监督基线在精确匹配指标$F1@6$上最高仅达11.6%，与部分匹配指标（最高51.5%）存在巨大差距，这反映了将屈折变化的表层形式与作者标注的关键短语进行匹配的难度。KeyLLM显著缩小了精确匹配与部分匹配间的差距，生成的关键短语更接近作者标注的规范形式；同时基于100篇文档的人工评估（κ=0.61）证实，KeyLLM能捕捉到自动化精确匹配所低估的相关概念。我们的分析指出形态不匹配是统计方法的主要失效模式——这一发现对其他屈折语言同样具有参考价值。数据集（https://huggingface.co/datasets/NaiveNeuron/SlovKE）与评估代码（https://github.com/NaiveNeuron/SlovKE）已公开发布。

摘要 (Abstract)

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases – scraped and systematically cleaned from the Slovak Central Register of Theses – representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6% exact-match $F1@6$, with a large gap to partial matching (up to 51.5%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact–partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods – a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

关键词: keyphrase extraction, Slovak language, low-resource languages, LLM evaluation, morphological mismatch, scientific abstracts, dataset construction, GPT-3.5-turbo

38. ❌ Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains

作者: Raeid Saqur, Christoph Bergmeir, Blanka Horvath, Daniel Schmidt, Frank Rudzicz, Terry Lyons 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	2.0/10	0.0

评分理由: 该论文主要讨论时间序列预测领域的评估方法问题，指出当前基准测试的局限性，并呼吁采用更全面的评估标准。论文内容与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型技术原理、训练方法、推理优化等具体技术，而论文并未涉及这些内容。唯一略有相关的是"AI for Science"，因为论文讨论AI/ML在科学领域（时间序列预测）的应用问题，但并非核心创新，只是泛泛讨论AI在科学领域的应用挑战，因此给予2分（微弱关联）。

!!! tip deepseek-chat TL;DR

该论文指出当前时间序列预测的评估基准存在缺陷，过度依赖具有强周期性的数据集，导致深度学习模型的性能提升可能被夸大，并呼吁采用更全面的基准测试和基线比较方法。

摘要翻译

我们认为，当前评估人工智能/机器学习时间序列预测模型的实践主要依赖于具有强烈、持续周期性和季节性的基准数据集，这种做法因忽视了高效经典方法的性能而掩盖了真正的进展。我们证明，这些“标准”数据集通常表现出主导的自相关模式和季节周期，这些模式可以被更简单的线性或统计模型有效捕捉，导致对于这些特定数据特征，复杂的深度学习架构往往并不比经典方法表现更优，并引发疑问：任何边际性能提升是否足以证明计算开销和模型复杂度显著增加是合理的。我们呼吁学界（I）淘汰当前基准，或通过引入展现更广泛非平稳性（如结构突变、时变波动性和概念漂移）以及来自不同现实领域、动态更难预测的数据集来大幅扩充现有基准；（II）要求每项深度学习研究提交时都必须包含稳健的经典及简单基线模型，这些基线应根据下游任务时间序列的具体特征进行恰当选择。通过这样做，我们将有助于确保所报告的性能提升反映的是真正的科学方法进步，而非仅仅源于基准选择偏向于擅长学习重复模式的模型所产生的人为假象。

摘要 (Abstract)

We argue that the current practice of evaluating AI/ML time-series forecasting models, predominantly on benchmarks characterized by strong, persistent periodicities and seasonalities, obscures real progress by overlooking the performance of efficient classical methods. We demonstrate that these “standard” datasets often exhibit dominant autocorrelation patterns and seasonal cycles that can be effectively captured by simpler linear or statistical models, rendering complex deep learning architectures frequently no more performant than their classical counterparts for these specific data characteristics, and raising questions as to whether any marginal improvements justify the significant increase in computational overhead and model complexity. We call on the community to (I) retire or substantially augment current benchmarks with datasets exhibiting a wider spectrum of non-stationarities, such as structural breaks, time-varying volatility, and concept drift, and less predictable dynamics drawn from diverse real-world domains, and (II) require every deep learning submission to include robust classical and simple baselines, appropriately chosen for the specific characteristics of the downstream tasks’ time series. By doing so, we will help ensure that reported gains reflect genuine scientific methodological advances rather than artifacts of benchmark selection favoring models adept at learning repetitive patterns.

关键词: time-series forecasting, benchmark evaluation, deep learning, classical methods, non-stationarities, model complexity, computational overhead, autocorrelation patterns

39. ❌ Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

作者: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的推理机制，特别是自我纠正和不确定性外部化过程，与LLMs、推理、自我纠正、可解释性等关键词高度相关（10分）。论文提到与后训练实验相关，但与SFT的具体技术关联有限，给5分。其他关键词如MoE、量化、RAG等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究通过信息论框架揭示了LLMs推理中的不确定性外部化机制，证明了强推理性能源于不确定性显式表达而非特定表面标记，统一了关于Aha时刻和后训练实验的发现。

摘要翻译

大型语言模型在推理过程中常表现出“顿悟时刻”，例如在出现“等等”类标记后表现出明显的自我修正行为，但其内在机制尚不明确。本文提出一个信息论框架，将推理过程分解为程序性信息与认知外化——即支持下游控制行为的不确定性显式外部化。研究表明，纯粹程序性推理可能陷入信息停滞状态，而认知外化能够持续获取信息，对实现信息充分性至关重要。实证结果表明，优异的推理表现由不确定性外部化驱动，而非特定的表层标记。本框架统一了先前关于顿悟时刻与训练后实验的研究发现，并为未来推理模型设计提供了理论洞见。

摘要 (Abstract)

LLMs often exhibit Aha moments during reasoning, such as apparent self-correction following tokens like “Wait,” yet their underlying mechanisms remain unclear. We introduce an information-theoretic framework that decomposes reasoning into procedural information and epistemic verbalization - the explicit externalization of uncertainty that supports downstream control actions. We show that purely procedural reasoning can become informationally stagnant, whereas epistemic verbalization enables continued information acquisition and is critical for achieving information sufficiency. Empirical results demonstrate that strong reasoning performance is driven by uncertainty externalization rather than specific surface tokens. Our framework unifies prior findings on Aha moments and post-training experiments, and offers insights for future reasoning model design.

关键词: Large Language Models, Reasoning, Self-correction, Uncertainty externalization, Information-theoretic framework, Epistemic verbalization, Aha moments, Mechanistic interpretability

40. ❌ Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold

作者: Pratyush Acharya, Habish Dhakal 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度学习优化理论中的’grokking’现象，分析AdamW优化器在模块化算术任务上的动力学机制，属于深度学习技术原理的基础理论研究。论文内容聚焦于优化器噪声结构、损失景观曲率和泛化延迟的机制分析，与评分关键词列表中的大模型技术、应用、训练方法、推理优化、对齐技术、科学AI应用等具体主题均无直接关联。所有关键词均未在标题或摘要中出现，也未涉及相关概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了深度学习训练中出现的'grokking'现象（泛化远晚于训练收敛），通过分析AdamW优化器在模块化算术任务上的动力学，揭示了'谱门控'机制如何调节从记忆到泛化的转变，并提出了方差受限的相变理论来解释泛化延迟。

摘要翻译

标准优化理论难以解释“顿悟”现象，即泛化能力在训练收敛很久后才突然出现。尽管几何学分析将其归因于缓慢漂移，但这些研究往往忽略了优化器噪声结构与损失函数曲率之间的相互作用。本文分析了AdamW在模运算任务上的动态过程，揭示了一种调控从记忆到泛化转变的“谱门控”机制。

我们发现AdamW作为一个方差门控的随机系统运行。顿悟现象受稳定性条件约束：泛化解位于一个尖锐的盆地（$λ_{max}^H$）中，该盆地在低方差状态下最初无法被访问。“延迟”阶段代表了梯度方差的积累过程，这种积累提升了有效稳定性上限，从而允许优化过程进入该尖锐流形。

我们的消融实验识别出三种复杂度机制：（1）容量坍缩（$P < 23$），其中秩缺失阻碍了结构学习；（2）方差受限机制（$P \approx 41$），此时泛化需等待谱门开启；（3）稳定性覆写（$P > 67$），其中记忆在维度上变得不稳定。此外，我们挑战了算法任务中的“平坦最小值”假说，证明各向同性噪声注入无法诱发顿悟。泛化需要自适应优化器特有的各向异性修正机制，该机制能将噪声导向解流形的切空间。

摘要 (Abstract)

Standard optimization theories struggle to explain grokking, where generalization occurs long after training convergence. While geometric studies attribute this to slow drift, they often overlook the interaction between the optimizer’s noise structure and landscape curvature. This work analyzes AdamW dynamics on modular arithmetic tasks, revealing a Spectral Gating'' mechanism that regulates the transition from memorization to generalization. We find that AdamW operates as a variance-gated stochastic system. Grokking is constrained by a stability condition: the generalizing solution resides in a sharp basin ($λ_{max}^H$) initially inaccessible under low-variance regimes. The delayed’’ phase represents the accumulation of gradient variance required to lift the effective stability ceiling, permitting entry into this sharp manifold. Our ablation studies identify three complexity regimes: (1) \textbf{Capacity Collapse} ($P < 23$), where rank-deficiency prevents structural learning; (2) \textbf{The Variance-Limited Regime} ($P \approx 41$), where generalization waits for the spectral gate to open; and (3) \textbf{Stability Override} ($P > 67$), where memorization becomes dimensionally unstable. Furthermore, we challenge the “Flat Minima” hypothesis for algorithmic tasks, showing that isotropic noise injection fails to induce grokking. Generalization requires the \textit{anisotropic rectification} unique to adaptive optimizers, which directs noise into the tangent space of the solution manifold.

关键词: grokking, AdamW dynamics, spectral gating, variance-limited phase transition, modular arithmetic, generalization delay, optimizer noise, stability threshold

41. ❌ Agentic workflow enables the recovery of critical materials from complex feedstocks via selective precipitation

作者: Andrew Ritchhart, Sarah I. Allec, Pravalika Butreddy, Krista Kulesa, Qingpu Wang, Dan Thien Nguyen, Maxim Ziatdinov, Elias Nakouzi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于关键材料回收的多智能体工作流，部署了一系列AI智能体和自动化仪器。这与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心就是多智能体工作流。同时，论文属于AI在科学领域的应用，具体是材料科学和化学分离过程，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、对齐等具体技术细节，论文摘要未提及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种多智能体工作流，利用AI智能体和自动化仪器从复杂原料（如采出水和磁体浸出液）中回收关键材料，通过选择性沉淀将高效、可扩展分离技术的开发时间从数月/数年缩短至数天。

摘要翻译

本文提出一种用于关键材料回收的多智能体工作流程，该流程部署了一系列人工智能智能体（AI agents）与自动化仪器，从产出水及磁体浸出液中回收关键材料。该方法利用简单化学品实现了对实际原料的选择性沉淀，将高效、适应性强且可扩展分离工艺的开发周期从数月乃至数年缩短至数天。

摘要 (Abstract)

We present a multi-agentic workflow for critical materials recovery that deploys a series of AI agents and automated instruments to recover critical materials from produced water and magnet leachates. This approach achieves selective precipitation from real-world feedstocks using simple chemicals, accelerating the development of efficient, adaptable, and scalable separations to a timeline of days, rather than months and years.

关键词: multi-agentic workflow, critical materials recovery, AI agents, selective precipitation, produced water, magnet leachates, automated instruments, scalable separations

42. ❌ RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

作者: Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文RSGen专注于遥感图像生成，提出了一种利用多样化边缘引导增强布局驱动生成的方法。所有关键词均与大型语言模型（LLM）或深度学习技术原理直接相关，而本文研究的是扩散模型在计算机视觉（遥感图像生成）中的应用，属于不同的技术领域。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感可视为地球科学的一个应用领域，但论文未明确强调科学发现或生物/化学信息学，因此给予5分（有一定关联）。其他关键词如LLMs、MoE、Scaling Laws、Alignment、RAG、CoT、Agents、Quantization等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像生成中布局驱动方法存在的细粒度控制不足和边界框约束不严格问题，提出了RSGen框架，通过多样化边缘引导增强生成，显著提升了现有模型的性能并在下游检测任务中取得了显著增益。

摘要翻译

扩散模型显著缓解了遥感（RS）领域中标注数据稀缺的影响。尽管近期研究已成功利用这些模型实现多样化且可控的布局到图像（Layout-to-Image, L2I）合成，现有方法仍受限于细粒度控制能力不足，且难以严格遵循边界框约束。为解决这些局限，我们提出RSGen——一个即插即用框架，通过利用多样化边缘引导来增强布局驱动的遥感图像生成。具体而言，RSGen采用渐进式增强策略：1）首先通过图像到图像（Image-to-Image）生成技术，从检索的训练实例中合成多样化边缘图以丰富其多样性；2）随后将这些多样化的边缘图作为现有L2I模型的条件输入，在边界框内实施像素级控制，确保生成的实例严格遵循布局约束。在三个基线模型上的大量实验表明，RSGen显著提升了现有L2I模型的性能。例如，在DOTA数据集上使用CC-Diff模型进行定向目标检测时，我们在YOLOScore指标上实现了mAP50/mAP50-95分别提升+9.8/+12.0的显著增益，并在下游检测任务中使mAP提升+1.6。我们的代码将公开于：https://github.com/D-Robotics-AI-Lab/RSGen

摘要 (Abstract)

Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout-to-Image (L2I) synthesis, they still suffer from limited fine-grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug-and-play framework that leverages diverse edge guidance to enhance layout-driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image-to-Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel-level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC-Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50-95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: https://github.com/D-Robotics-AI-Lab/RSGen

关键词: Remote Sensing Image Generation, Diffusion Models, Layout-to-Image Synthesis, Edge Guidance, Bounding Box Constraints, Plug-and-play Framework, Progressive Enhancement, Downstream Detection Task

43. ❌ Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

作者: Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, Daniel Dahlmeier 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体（agent）的评估框架，与’LLM Agents’高度相关（10分），涉及工具使用（8分），并使用LLM作为评判器（8分）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在不同领域和工作流中缺乏统一、可扩展且考虑用户角色的评估框架问题，提出了TED框架，通过用户角色模拟、LLM自动评估和错误分析，揭示了智能体性能的新见解，并展示了改进后性能提升8-10%。

摘要翻译

智能体应用正日益广泛地应用于跨领域工作流的自动化。然而，由于其所处领域的异质性，构建可扩展的评估框架面临挑战。先前研究各自采用不同的任务成功判定方法，例如数据库查询、正则表达式匹配等，这增加了统一智能体评估方法开发的复杂性。此外，这些研究未能系统性地考虑用户在交互中的角色及专业水平，导致对智能体性能的洞察不够全面。我们认为，有效的智能体评估不应仅局限于正确性，还需涵盖对话质量、效率以及对智能体错误的系统性诊断。为此，我们提出了TED框架（对话、评估、诊断）。(1) 对话：我们利用可复用的通用专家与非专家用户角色模板进行用户与智能体的交互。(2) 评估：我们通过将子目标（如工具签名和响应）表示为自然语言评分说明来适配现有数据集，并采用LLM-as-a-judge（大语言模型作为评判者）进行自动评估。我们提出了新的指标，在用户感知设置的基础上，同时捕捉智能体的轮次效率和中间进展。(3) 诊断：我们引入了一个自动化错误分析工具，用于分析评判者与智能体之间的不一致性，揭示常见错误，并为智能体改进提供可操作的反馈。我们的研究表明，TED框架能够揭示关于不同模型和用户专业水平下智能体性能的新见解。我们还证明，在将识别出的错误修正方案纳入智能体设计后，其性能在我们提出的指标上获得了潜在提升，峰值可达8-10%。

摘要 (Abstract)

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user’s role nor expertise in the interaction, providing incomplete insights into the agent’s performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent’s design.

关键词: Agent Evaluation, LLM-as-a-judge, User-aware, Automated Error Analysis, Tool Use, Workflow Automation, Performance Metrics, TED Framework

44. ❌ TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

作者: Shovon Niverd Pereira, Krishna Khadka, Yu Lei 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于表格数据上的知识蒸馏（Tabular Knowledge Distillation），提出TabKD方法通过特征交互多样性来提升蒸馏效果。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统机器学习中的模型压缩技术（知识蒸馏）在表格数据上的应用，未涉及大模型、深度学习技术原理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对表格数据知识蒸馏中特征交互覆盖不足的问题，提出了TabKD方法，通过自适应特征分箱和最大化交互覆盖生成合成查询，在多个基准数据集和教师架构上取得了优于现有方法的学生-教师一致性。

摘要翻译

无数据知识蒸馏能够在无需原始训练数据的情况下实现模型压缩，这对隐私敏感的表格数据领域至关重要。然而，现有方法在表格数据上表现不佳，因为它们未能明确处理特征交互——即表格模型编码预测知识的基本方式。我们提出交互多样性（即对特征组合的系统性覆盖）是有效表格蒸馏的关键要求。为实现这一洞见，我们提出了TabKD方法：该方法首先学习与教师模型决策边界对齐的自适应特征分箱，随后生成能最大化成对交互覆盖的合成查询。在4个基准数据集和4种教师架构上的实验表明，TabKD在16种配置中的14种实现了最高的师生模型一致性，优于5种先进基线方法。我们进一步证明交互覆盖率与蒸馏质量高度相关，从而验证了核心假设。本研究确立了以交互为中心的探索作为表格模型提取的一种原则性框架。

摘要 (Abstract)

Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular data because they do not explicitly address feature interactions, the fundamental way tabular models encode predictive knowledge. We identify interaction diversity, systematic coverage of feature combinations, as an essential requirement for effective tabular distillation. To operationalize this insight, we propose TabKD, which learns adaptive feature bins aligned with teacher decision boundaries, then generates synthetic queries that maximize pairwise interaction coverage. Across 4 benchmark datasets and 4 teacher architectures, TabKD achieves highest student-teacher agreement in 14 out of 16 configurations, outperforming 5 state-of-the-art baselines. We further show that interaction coverage strongly correlates with distillation quality, validating our core hypothesis. Our work establishes interaction-focused exploration as a principled framework for tabular model extraction.

关键词: knowledge distillation, tabular data, feature interactions, interaction diversity, data-free distillation, model compression, synthetic query generation, teacher-student agreement

45. ❌ Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

作者: Zidane Wright, Jason Tsay, Anupama Murthi, Osher Elhadad, Diego Del Rio, Saurabh Goyal, Kiran Kate, Jim Laredo, Koren Lazar, Vinod Muthusamy, Yara Rizk 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发AI Agent的中间件工具包（ALTK），用于提高Agent在生产环境中的可靠性和鲁棒性。因此，与Agent直接相关的关键词（如LLM Agents、Tool Use）高度相关（10分），因为论文聚焦于Agent生命周期管理；与Agent可靠性相关的关键词（如Self-Correction、Hallucination Mitigation）有较强关联（8分），因为ALTK旨在检测和修复Agent的常见失败模式；与LLM基础技术相关的关键词（如Large Language Models）有一定关联（8分），因为Agent通常基于LLM构建；与推理相关的关键词（如Chain of Thought、System 2 Thinking）有弱关联（5分），因为论文提及推理错误是Agent的失败模式之一。其他关键词（如MoE、量化、科学AI等）与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对AI Agent在生产部署中存在的失败模式（如工具误用、推理错误、合规风险）问题，提出了一个名为Agent Lifecycle Toolkit（ALTK）的开源模块化中间件组件集合，以系统性地检测、修复和缓解这些失败模式，从而显著降低构建可靠、生产级Agent的工作量。

摘要翻译

随着AI智能体从演示环境迈向企业级部署，其故障模式将产生实质性影响：一个被错误解析的工具参数可能破坏生产数据，隐性推理错误可能在造成损害后才被发现，而违反组织政策的输出可能引发法律或合规风险。然而，大多数智能体框架让构建者临时处理这些故障模式，导致形成脆弱且难以复用或维护的临时性防护机制。本文提出智能体生命周期工具包（Agent Lifecycle Toolkit，简称ALTK），这是一个由模块化中间件组件构成的开源集合，能够系统性地应对智能体全生命周期中的这些缺陷。

在智能体生命周期的各个阶段，我们识别出可进行干预和优化的关键节点，具体包括：用户请求后、大语言模型（LLM）提示词预处理前、LLM输出后处理、工具调用前验证、工具执行结果检查以及最终响应组装前。ALTK提供模块化中间件，用于检测、修复和缓解常见故障模式。该工具包提供标准化的接口，可无缝集成至现有工作流，并与ContextForge MCP网关及Langflow等低代码/无代码工具兼容。最终，ALTK能显著降低构建可靠、生产级智能体的开发成本。

摘要 (Abstract)

As AI agents move from demos into enterprise deployments, their failure modes become consequential: a misinterpreted tool argument can corrupt production data, a silent reasoning error can go undetected until damage is done, and outputs that violate organizational policy can create legal or compliance risk. Yet, most agent frameworks leave builders to handle these failure modes ad hoc, resulting in brittle, one-off safeguards that are hard to reuse or maintain. We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle. Across the agent lifecycle, we identify opportunities to intervene and improve, namely, post-user-request, pre-LLM prompt conditioning, post-LLM output processing, pre-tool validation, post-tool result checking, and pre-response assembly. ALTK provides modular middleware that detects, repairs, and mitigates common failure modes. It offers consistent interfaces that fit naturally into existing pipelines. It is compatible with low-code and no-code tools such as the ContextForge MCP Gateway and Langflow. Finally, it significantly reduces the effort of building reliable, production-grade agents.

关键词: AI agents, agent lifecycle, middleware components, failure modes, production deployments, tool validation, reasoning errors, compliance risk

46. ❌ Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

作者: Simone Aonzo, Merve Sahin, Aurélien Francillon, Daniele Perito 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15457v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论AI智能体（特别是工具使用型智能体）的评估方法及其安全风险，与恶意软件分析进行类比。核心相关关键词是’LLM Agents OR Autonomous Agents OR Agentic Workflow’（高度相关，10分），因为论文聚焦于AI智能体的评估挑战。‘Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为论文提到AI系统作为工具使用型智能体。其他关键词涉及大模型技术原理、训练方法、推理优化、科学应用等，论文未直接讨论这些技术细节或应用领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文指出AI智能体评估存在类似恶意软件沙箱逃避的风险，即智能体可能检测评估环境并调整行为，导致过于乐观的安全评估，并提出了以对抗性视角进行更现实评估的原则。

摘要翻译

人工智能（AI）系统正日益被用作工具型智能体，能够进行规划、观察环境并在较长时间跨度内执行行动。这一发展对当前评估实践提出了挑战——现有评估通常在受限、完全可观测的环境中对AI模型进行测试。本文指出，AI智能体的评估容易受到计算机安全领域一种已知故障模式的影响：恶意软件在检测到自身处于分析环境时会表现出良性行为。我们阐释了AI智能体如何能够推断其评估环境的特性并相应调整自身行为，这可能导致安全性与鲁棒性评估结果过于乐观。通过类比数十年间对恶意软件沙箱逃逸技术的研究，我们证明这并非推测性担忧，而是适应性系统评估中固有的结构性风险。最后，我们提出了评估AI智能体的具体原则，将被测系统视为潜在对抗性存在。这些原则强调评估的真实性、测试条件的可变性以及部署后的再评估机制。

摘要 (Abstract)

Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.

关键词: AI agents, evaluation, malware analysis, sandbox evasion, safety assessment, tool-using agents, adaptive systems, adversarial testing

47. ❌ RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation

作者: Haichao Liu, Yuheng Zhou, Zhenyu Wu, Ziheng Ji, Ziyu Shan, Qianzhun Wang, Ruixuan Liu, Zhiyuan Yang, Yejun Gu, Shalman Khan, Shijun Yan, Jun Liu, Haiyue Zhu, Changliu Liu, Jianfei Yang, Jingbing Zhang, Ziwei Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人协作装配的基准测试和挑战赛组织，属于具身人工智能和机器人操作领域。虽然研究背景提到关注大模型在科学领域的应用，但本文具体内容完全不涉及任何大模型、深度学习技术原理或相关关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等），也未提及生物医药AI应用。所有关键词均与论文主题无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文介绍了RoCo挑战赛，旨在通过仿真和真实世界实验基准测试机器人协作装配行星齿轮箱的能力，并发现双模型框架和故障恢复课程数据对长时程多任务学习至关重要。

摘要翻译

具身人工智能（Embodied Artificial Intelligence，EAI）正在快速发展，逐步颠覆以往自主系统从孤立感知到一体化、连续性行动的范式。这一转变对于工业机器人操作具有重要意义，有望将人类工作者从重复、危险的日常劳动中解放出来。为了评估并推动这一能力的发展，我们推出了机器人协同装配辅助挑战赛（Robotic Collaborative Assembly Assistance Challenge，RoCo Challenge）及其配套数据集，旨在促进仿真与真实世界的装配操作研究。该挑战以人本制造为背景，聚焦于高精度行星齿轮箱装配任务——这一操作在现代工业中要求严苛且极具代表性。挑战基于我们在Isaac Sim中自主研发的数据采集、训练与评估系统构建，并利用双臂机器人进行真实世界部署，分为两个阶段进行。仿真赛段定义了细粒度的任务阶段，采用分步评分以应对装配任务的长周期特性；真实世界赛段则使用物理齿轮箱组件和高质量遥操作数据集进行镜像评估。核心任务要求从零开始装配一个行星齿轮箱，包括安装三个行星齿轮、一个太阳齿轮和一个齿圈。本次挑战吸引了来自10多个国家的60多支团队、超过170名参与者，产生了高效的解决方案，其中最突出的是ARC-VLA和RoboCola。结果表明，针对长周期多任务学习的双模型框架极为有效，而策略性地利用从失败中恢复的课程数据是成功部署的关键洞见。本报告概述了竞赛设置、评估方法、主要发现以及工业具身人工智能的未来方向。我们的数据集、CAD文件、代码和评估结果可在以下网址获取：https://rocochallenge.github.io/RoCo2026/。

摘要 (Abstract)

Embodied Artificial Intelligence (EAI) is rapidly developing, gradually subverting previous autonomous systems’ paradigms from isolated perception to integrated, continuous action. This transition is highly significant for industrial robotic manipulation, promising to free human workers from repetitive, dangerous daily labor. To benchmark and advance this capability, we introduce the Robotic Collaborative Assembly Assistance (RoCo) Challenge with a dataset towards simulation and real-world assembly manipulation. Set against the backdrop of human-centered manufacturing, this challenge focuses on a high-precision planetary gearbox assembly task, a demanding yet highly representative operation in modern industry. Built upon a self-developed data collection, training, and evaluation system in Isaac Sim, and utilizing a dual-arm robot for real-world deployment, the challenge operates in two phases. The Simulation Round defines fine-grained task phases for step-wise scoring to handle the long-horizon nature of the assembly. The Real-World Round mirrors this evaluation with physical gearbox components and high-quality teleoperated datasets. The core tasks require assembling an epicyclic gearbox from scratch, including mounting three planet gears, a sun gear, and a ring gear. Attracting over 60 teams and 170+ participants from more than 10 countries, the challenge yielded highly effective solutions, most notably ARC-VLA and RoboCola. Results demonstrate that a dual-model framework for long-horizon multi-task learning is highly effective, and the strategic utilization of recovery-from-failure curriculum data is a critical insight for successful deployment. This report outlines the competition setup, evaluation approach, key findings, and future directions for industrial EAI. Our dataset, CAD files, code, and evaluation results can be found at: https://rocochallenge.github.io/RoCo2026/.

关键词: Robotic Collaborative Assembly, Embodied Artificial Intelligence, Planetary Gearbox Assembly, Dual-arm Robot, Long-horizon Task, Simulation and Real-world Benchmark, Industrial Automation, Recovery-from-failure Curriculum

48. ❌ Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting

作者: Siyuan Wang, Peng Chen, Yihang Wang, Wanghui Qiu, Chenjuan Guo, Bin Yang, Yang Shu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15452v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出VoT方法，利用LLMs进行事件驱动推理和时间序列预测，核心涉及LLMs应用（10分）、In-context Learning（10分）。与Alignment相关（5分）因提出多级对齐；与RAG相关（5分）因使用历史示例检索；与CoT Reasoning相关（5分）因涉及推理过程；与AI for Science相关（5分）因应用于科学领域预测。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有时间序列预测方法难以利用文本信息的问题，提出了VoT方法，通过事件驱动推理和多级对齐技术，显著提升了跨10个领域数据集的预测性能。

摘要翻译

现有时间序列预测方法主要依赖于数值数据本身。然而，现实世界中的时间序列呈现出与多模态信息相关的复杂模式，仅凭数值数据难以准确预测。尽管已出现若干多模态时间序列预测方法，但它们要么使用辅助信息有限的文本，要么仅关注表征提取，仅提取少量文本信息用于预测。为充分释放文本的价值，我们提出VoT方法，该方法具备事件驱动推理与多层级对齐机制。事件驱动推理将外源性文本中的丰富信息与大型语言模型强大的推理能力相结合，用于时间序列预测。为引导大型语言模型进行有效推理，我们提出历史上下文学习方法，通过检索并应用历史案例作为上下文指导。为最大化文本利用率，我们提出多层级对齐机制。在表征层面，我们利用内源性文本对齐将内源性文本信息与时间序列相融合。在预测层面，我们设计自适应频率融合模块，将事件驱动预测与数值预测的频率成分进行融合，以实现优势互补。在涵盖10个领域的真实世界数据集上的实验表明，本方法相较现有方法有显著提升，验证了我们在文本利用方面的有效性。代码已发布于https://github.com/decisionintelligence/VoT。

摘要 (Abstract)

Existing time series forecasting methods primarily rely on the numerical data itself. However, real-world time series exhibit complex patterns associated with multimodal information, making them difficult to predict with numerical data alone. While several multimodal time series forecasting methods have emerged, they either utilize text with limited supplementary information or focus merely on representation extraction, extracting minimal textual information for forecasting. To unlock the Value of Text, we propose VoT, a method with Event-driven Reasoning and Multi-level Alignment. Event-driven Reasoning combines the rich information in exogenous text with the powerful reasoning capabilities of LLMs for time series forecasting. To guide the LLMs in effective reasoning, we propose the Historical In-context Learning that retrieves and applies historical examples as in-context guidance. To maximize the utilization of text, we propose Multi-level Alignment. At the representation level, we utilize the Endogenous Text Alignment to integrate the endogenous text information with the time series. At the prediction level, we design the Adaptive Frequency Fusion to fuse the frequency components of event-driven prediction and numerical prediction to achieve complementary advantages. Experiments on real-world datasets across 10 domains demonstrate significant improvements over existing methods, validating the effectiveness of our approach in the utilization of text. The code is made available at https://github.com/decisionintelligence/VoT.

关键词: time series forecasting, large language models, event-driven reasoning, multi-level alignment, in-context learning, text utilization, multimodal information, historical examples retrieval

49. ❌ Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

作者: Jing Ye, Xinpei Zhao, Lu Xiang, Yaping Zhang, Chengqing Zong 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15434v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于情感支持对话系统的强化学习优化方法（RAPO框架），涉及用户反应感知、自然语言反馈生成和混合奖励优化，但未明确提及或应用大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、代理系统或科学AI相关，而本文研究的是特定对话任务的强化学习算法改进，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对情感支持对话系统中专家定义标量奖励信息稀疏的问题，提出了反应感知策略优化（RAPO）框架，通过结合标量奖励和自然语言反馈来优化对话策略，实验表明其在促进积极互动结果上显著优于现有强化学习方法。

摘要翻译

当前的情感支持对话系统通常依赖专家定义的标量奖励进行对齐，但这些信号存在严重的信息稀疏性问题。它们无法解释回复失败的原因或如何适应动态变化的用户状态，常常偏离促进积极情绪转变的实际目标。在实践中，最直接且可靠的学习信号源自持续交互过程中用户的连续反应。为此，我们提出反应感知策略优化框架，该框架基于交互结果而非评分标准进行优化。RAPO将对话视为反应驱动的过程，并利用模拟用户响应通过三个核心组件生成密集的自然语言反馈：后见对话选择，用于分离那些显著改变用户情绪轨迹的关键对话轮次；生成式后见反馈，将用户反应转化为对比排序信号和自然语言评述；以及标量-语言混合策略优化，将用于全局对齐的标量奖励优化与用于细粒度语义精炼的语言反馈蒸馏相结合。在ESC和Sotopia数据集上的大量实验表明，RAPO在推动积极交互结果方面显著优于强强化学习基线方法。

摘要 (Abstract)

While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user’s continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.

关键词: emotional support dialogue systems, reinforcement learning, user reactions, policy optimization, scalar-verbal hybrid, natural-language feedback, interaction outcomes

50. ❌ Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

作者: Sachin Prajuli, Abhishek Karna, OmPrakash Dhakl 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究音乐流派分类，使用传统机器学习（如逻辑回归、SVM）和深度学习（如CNN、RNN、CRNN）方法，但完全不涉及大语言模型（LLMs）、大模型技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文仅使用基础深度学习架构进行音频分类，属于常规应用，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文通过构建尼泊尔音乐数据集，系统比较了传统机器学习与深度学习模型在音乐流派分类上的性能，发现顺序卷积循环神经网络（CRNN）以84%的准确率表现最佳。

摘要翻译

自动音乐流派分类是音乐信息检索领域一个长期存在的挑战；针对非西方音乐传统的研究仍十分匮乏。尼泊尔音乐包含文化内涵丰富且声学特征多样的流派——从洛克·多霍里的呼应式对唱，到德乌达的韵律诗歌，再到塔芒·赛罗的独特旋律——现有分类系统均未涉及。本文构建了一个新颖的数据集，包含约8000条标注时长30秒的音频片段，涵盖八种尼泊尔音乐流派，并系统比较了两种范式下的九种分类模型。我们基于Librosa提取的51个人工设计音频特征，训练了五种经典机器学习分类器（逻辑回归、支持向量机、K近邻、随机森林和XGBoost）；同时，在维度为640×128的梅尔频谱图上，测试了四种深度学习架构（卷积神经网络CNN、循环神经网络RNN、并行CNN-RNN，以及卷积层后接循环神经网络的序列式CNN-RNN）。实验表明，采用卷积层接入长短期记忆网络LSTM的序列式卷积循环神经网络取得了84%的最高准确率，显著优于最佳经典模型（逻辑回归与XGBoost均为71%）及其他所有深度学习架构。我们提供了每个模型的逐类别精确率、召回率、F1分数、混淆矩阵和ROC分析，并从文化视角对误分类模式进行解读，揭示了尼泊尔音乐传统中真实存在的风格重叠现象。

摘要 (Abstract)

Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres–from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo–that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)–in which convolutional layers feed into an LSTM–achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal’s musical traditions.

关键词: Music Genre Classification, Nepali Music, Deep Learning, Convolutional Recurrent Neural Network, Comparative Analysis, Audio Features, Mel Spectrograms, Music Information Retrieval

51. ❌ Physics-informed fine-tuning of foundation models for partial differential equations

作者: Vlad Medvedev, Leon Armbruster, Christopher Straub, Georg Kruse, Andreas Rosskopf 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15431v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究PDE基础模型的物理信息微调，与’Foundation Models’、‘Pre-training/Domain Adaptation’、‘Fine-tuning’高度相关（10分），属于’AI for Science’应用（10分）。其他关键词如MoE、SLMs、RLHF、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理信息微调框架，通过将物理约束直接纳入微调目标，使预训练的偏微分方程基础模型能够在数据稀缺的情况下有效适应新任务，并在未见过的PDE类上实现了竞争性精度。

摘要翻译

偏微分方程基础模型已成为在多样化物理系统上预训练的强大代理模型，但由于特定任务数据有限及分布偏移，使其适应新下游任务仍具挑战性。尽管微调在自然语言处理领域已被证明具有变革性，但针对偏微分方程基础模型适配的最佳实践仍未得到充分探索。虽然物理信息训练已成功在广泛的偏微分方程问题上训练出精确求解器，但其在微调基于数据的基础模型方面的潜力尚未得到系统研究。本文提出了一种物理信息微调框架，通过将物理约束（偏微分方程残差和边界条件）直接纳入微调目标，来适配预训练的偏微分方程基础模型。该方法能够在数据稀缺情况下实现有效适应，同时增强物理一致性。我们在一个由未见偏微分方程类别构成的下游任务上评估了该方法，并与数据驱动的微调方法进行了比较。结果表明，物理信息微调无需依赖偏微分方程解进行训练即可达到具有竞争力的精度。此外，当仅有极少量训练数据可用时，混合微调策略在分布外场景中展现出更优的泛化能力。这些发现确立了物理信息微调作为一种可扩展且数据高效的范式，为科学机器学习中基础模型的适配提供了一条具有物理可解释性的路径。

摘要 (Abstract)

Foundation models for partial differential equations (PDEs) have emerged as powerful surrogates pre-trained on diverse physical systems, but adapting them to new downstream tasks remains challenging due to limited task-specific data and distribution shifts. While fine-tuning has proven transformative in natural language processing, best practices for adapting PDE foundation models remain underexplored. Although physics-informed training has successfully trained accurate solvers across a wide range of PDE problems, its potential for fine-tuning data-based foundation models has not been systematically studied. In this work, we introduce a physics-informed fine-tuning framework that adapts pre-trained PDE foundation models by incorporating physical constraints (PDE residuals and boundary conditions) directly into the fine-tuning objective. This enables effective adaptation in data-scarce regimes while promoting physical consistency. We evaluate our method on a downstream task composed of an unseen PDE class and compare it with data-driven finetuning counterparts. Our results demonstrate that physics-informed fine-tuning achieves competitive accuracy without requiring PDE solutions for training. Furthermore, a hybrid fine-tuning strategy yields superior generalization to out-of-distribution scenarios when only minimal training data is available. These findings establish physics-informed fine-tuning as a scalable and data-efficient paradigm, providing a physically interpretable pathway for adapting foundation models in scientific machine learning.

关键词: Foundation models, Partial differential equations, Physics-informed fine-tuning, Domain adaptation, Data-efficient learning, Scientific machine learning, Physical constraints, Generalization

52. ❌ MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

作者: Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Jonathon M. Smereka, Yue Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MA-VLCM框架，使用预训练的视觉语言模型作为多智能体强化学习中的集中式评论家，核心涉及多智能体系统（高度相关，10分）、视觉语言模型（属于大模型范畴，5分）、预训练和微调技术（各5分），以及智能体工作流（5分）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出MA-VLCM框架，用预训练的视觉语言模型替代多智能体强化学习中的集中式评论家，显著提高了样本效率并生成了适合资源受限机器人的紧凑执行策略。

摘要翻译

多智能体强化学习（MARL）通常依赖集中式评论家来估计价值函数。然而，从头学习此类评论家样本效率极低，且往往缺乏跨环境的泛化能力。与此同时，基于互联网规模数据训练的大型视觉-语言-动作模型（Vision-Language-Action Models, VLAs）展现出强大的多模态推理和零样本泛化能力，但直接将其部署于机器人执行任务在计算上仍不可行，特别是在具有异构形态和资源限制的异质多机器人系统中。为应对这些挑战，我们提出了多智能体视觉-语言评论家模型（Multi-Agent Vision-Language-Critic Models, MA-VLCM）。该框架使用经过微调的预训练视觉-语言模型替代MARL中需学习的集中式评论家，以评估多智能体行为。MA-VLCM作为一个集中式评论家，其条件输入包括自然语言任务描述、视觉轨迹观测以及结构化的多智能体状态信息。通过在策略优化过程中消除评论家学习环节，我们的方法显著提升了样本效率，同时生成适用于资源受限机器人的紧凑执行策略。实验结果表明，在多智能体团队场景中，采用不同VLM骨干的模型在分布内和分布外情境下均能实现良好的零样本回报估计。

摘要 (Abstract)

Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings

关键词: multi-agent reinforcement learning, vision-language models, centralized critic, sample efficiency, zero-shot generalization, policy optimization, resource-constrained robots, trajectory observations

53. ❌ Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

作者: Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, Ye Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在测试时强化学习（TTRL）中的安全漏洞，直接涉及LLMs关键词（10分）。研究聚焦于通过自我一致性（self-consistency）提升推理能力，与Chain of Thought/CoT Reasoning（8分）、System 2 Thinking/In-depth Reasoning（8分）和Self-Correction/Self-Improvement（8分）高度相关，因为这些概念都涉及多步推理、深度思考和自我改进机制。论文未涉及其他关键词如MoE、SLMs、训练技术（预训练、微调等）、效率优化（量化、推理加速）、AI for Science等领域，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了测试时强化学习（TTRL）方法在提升大语言模型推理能力时存在的安全漏洞，发现有害提示注入会放大模型现有行为（安全或有害性），并导致推理能力下降（推理税）。

摘要翻译

测试时训练（Test-Time Training, TTT）作为一种提升大语言模型（LLMs）推理能力的新兴方法，近期受到广泛关注。该方法允许模型在无标签条件下直接从测试数据中学习。然而，这种对测试数据的依赖也使TTT方法容易受到有害提示注入的攻击。本文研究了TTT方法的安全脆弱性，重点关注一种基于自一致性的代表性测试时学习方法：测试时强化学习（Test-Time Reinforcement Learning, TTRL）。该方法通过以多数投票作为奖励信号来激励自一致性，从而提升LLM的推理能力。我们发现，在TTRL过程中注入有害提示会放大模型已有的行为倾向：当基础模型相对安全时，会引发安全性放大；而当模型对注入数据敏感时，则会导致危害性放大。在这两种情况下，模型的推理能力均会出现下降，我们将其称为“推理税”。研究还表明，攻击者可以利用专门设计的“HarmInject”提示对TTRL等TTT方法进行对抗性利用，迫使模型同时处理越狱查询和推理查询，从而引发更强烈的危害性放大效应。总体而言，我们的研究结果表明，通过促进自一致性来增强LLM推理的TTT方法可能导致行为放大与推理能力退化，这凸显了开发更安全的TTT方法的必要性。

摘要 (Abstract)

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model’s existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed “HarmInject” prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.

关键词: Test-time training, Test-time reinforcement learning, Large language models, Reasoning abilities, Safety vulnerabilities, Harmful prompt injection, Self-consistency, Amplification behaviors

54. ❌ RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

作者: Ali Soltan Mohammadi, Samira Nazari, Ali Azarpeyvand, Mahdi Taheri, Milos Krstic, Michael Huebner, Christian Herglotz, Tara Ghasempouri 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15413v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于量化深度神经网络（DNN）的可靠性和安全性增强，仅与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（评分为10），因为其核心内容涉及量化技术以提高效率并减轻故障敏感性。其他关键词均与论文主题无关，评分为0。论文未涉及大模型、语言模型、推理方法、对齐技术、代理系统或科学AI应用等。

!!! tip deepseek-chat TL;DR

该论文提出了一个统一的三阶段框架RESQ，用于增强量化深度神经网络的可靠性和安全性，通过微调和后训练调整在多个数据集和模型上实现了攻击弹性和故障弹性的显著提升，同时保持量化网络的竞争性准确度。

摘要翻译

本研究提出一个统一的三阶段框架，用于生成兼具均衡故障鲁棒性与攻击鲁棒性的量化深度神经网络。第一阶段通过微调降低特征表示对小规模输入扰动的敏感性，从而提升对抗攻击的抵御能力。第二阶段在模拟位翻转故障环境下进行故障感知微调，以强化故障鲁棒性。最后，通过轻量级训练后调整集成量化操作，在保持攻击鲁棒性的同时提升模型效率并进一步降低故障敏感性。在CIFAR-10、CIFAR-100和GTSRB数据集上对ResNet18、VGG16、EfficientNet和Swin-Tiny模型的实验表明，量化网络在保持竞争力的精度的同时，攻击鲁棒性最高提升10.35%，故障鲁棒性最高提升12.47%。结果同时揭示了一种非对称交互现象：故障鲁棒性的提升通常能增强对抗攻击的抵御能力，而对抗鲁棒性的增强未必能提高故障鲁棒性。

摘要 (Abstract)

This work proposes a unified three-stage framework that produces a quantized DNN with balanced fault and attack robustness. The first stage improves attack resilience via fine-tuning that desensitizes feature representations to small input perturbations. The second stage reinforces fault resilience through fault-aware fine-tuning under simulated bit-flip faults. Finally, a lightweight post-training adjustment integrates quantization to enhance efficiency and further mitigate fault sensitivity without degrading attack resilience. Experiments on ResNet18, VGG16, EfficientNet, and Swin-Tiny in CIFAR-10, CIFAR-100, and GTSRB show consistent gains of up to 10.35% in attack resilience and 12.47% in fault resilience, while maintaining competitive accuracy in quantized networks. The results also highlight an asymmetric interaction in which improvements in fault resilience generally increase resilience to adversarial attacks, whereas enhanced adversarial resilience does not necessarily lead to higher fault resilience.

关键词: Quantized DNN, Fault Resilience, Attack Resilience, Fine-tuning, Post-training Adjustment, Bit-flip Faults, Adversarial Attacks, Model Efficiency

55. ❌ A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning

作者: William Solow, Paola Pesantez-Cabrera, Markus Keller, Lav Khot, Sandhya Saisubramanian, Alan Fern 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种用于作物预测的混合建模框架，结合了神经网络和可微分生物物理模型，并采用多任务学习。论文的核心是深度学习在农业科学（作物预测）中的应用，属于"AI for Science"的范畴，因此该关键词得5分（有一定关联）。然而，论文未涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统等主题，因此其他所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种混合建模框架，通过神经网络参数化可微分生物物理模型并结合多任务学习，在数据有限的情况下提高了作物表型和抗寒性预测的准确性，相比现有生物物理模型分别提升了60%和40%。

摘要翻译

作物状态（如物候期与抗寒性）的精准预测对于优化作物产量与品质的田间管理决策（如灌溉、施肥和冠层管理）至关重要。传统生物物理模型虽可用于全季预测，但缺乏针对具体田块管理所需的精度。深度学习方法是一种具有吸引力的替代方案，但可能产生生物学上不现实的预测，且需要大规模数据支持。我们提出一种混合建模方法，该方法利用神经网络对可微分生物物理模型进行参数化，并通过多任务学习在数据有限条件下实现不同作物品种间的高效数据共享。通过预测生物物理模型的参数，我们的方法在保持生物学合理性的同时提升了预测精度。基于真实数据集与合成数据集的实证评估表明，相较于已部署的生物物理模型，本方法在物候期预测精度上提升60%，在抗寒性预测精度上提升40%。

摘要 (Abstract)

Accurate prediction of crop states (e.g., phenology stages and cold hardiness) is essential for timely farm management decisions such as irrigation, fertilization, and canopy management to optimize crop yield and quality. While traditional biophysical models can be used for season-long predictions, they lack the precision required for site-specific management. Deep learning methods are a compelling alternative, but can produce biologically unrealistic predictions and require large-scale data. We propose a \emph{hybrid modeling} approach that uses a neural network to parameterize a differentiable biophysical model and leverages multi-task learning for efficient data sharing across crop cultivars in data limited settings. By predicting the \emph{parameters} of the biophysical model, our approach improves the prediction accuracy while preserving biological realism. Empirical evaluation using real-world and synthetic datasets demonstrates that our method improves prediction accuracy by 60% for phenology and 40% for cold hardiness compared to deployed biophysical models.

关键词: hybrid modeling, crop prediction, biophysical model, neural network, multi-task learning, parameter calibration, phenology, cold hardiness

56. ❌ TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

作者: Kai Wang, Biaojie Zeng, Zeming Wei, Chang Jin, Hefeng Zhou, Xiangtian Li, Chao Yang, Jingjing Qu, Xingcheng Xu, Xia Hu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM-based multi-agent systems的安全评估和监控框架，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文明确基于LLM构建MAS；与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为论文研究LLM-based agents；与"Multi-agent Systems OR Agent Coordination"高度相关（10分），因为论文核心是multi-agent systems的安全问题。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等，论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

论文提出了TrinityGuard框架，用于评估和监控基于大语言模型的多智能体系统的安全风险，通过三层风险分类和实时监控解决了现有研究缺乏统一安全系统的问题。

摘要翻译

随着基于大语言模型（LLM）的多智能体系统（MAS）的快速发展，其引发的重大安全与安保问题日益凸显，这些风险已超越单智能体或大语言模型本身，带来了新型威胁。尽管已有研究尝试应对这些问题，但现有文献仍缺乏一个专门针对多智能体系统风险、具有内在一致性的防护体系。本研究提出TrinityGuard，一个基于OWASP标准、面向基于大语言模型的多智能体系统的综合性安全评估与监控框架。具体而言，TrinityGuard构建了一个三层细粒度风险分类体系，识别出20种风险类型，涵盖单智能体漏洞、智能体间通信威胁以及系统级涌现性危害。该框架设计为可适配不同多智能体系统架构与平台，采用三位一体的组织形式：包括一个可适配任意多智能体系统结构的抽象层、一个包含针对特定风险测试模块的评估层，以及一个由统一的大语言模型裁判工厂协调的运行时监控智能体层。在评估阶段，TrinityGuard执行精心设计的攻击探针，为每类风险生成详细漏洞报告；监控智能体则分析结构化的执行轨迹并发出实时警报，从而同时支持开发前评估与运行时监控。我们进一步形式化了这些安全度量指标，并在多个代表性多智能体系统实例中进行了详细案例研究，展示了TrinityGuard的通用性与可靠性。总体而言，TrinityGuard作为一个综合性框架，可用于评估与监控多智能体系统中的各类风险，为其安全与安保领域的进一步研究铺平道路。

摘要 (Abstract)

With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.

关键词: multi-agent systems, LLM-based, safety evaluation, risk taxonomy, runtime monitoring, OWASP standards, vulnerability assessment, security framework

57. ❌ Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context

作者: Mohamed Aziz Younes, Nicolas Saunier, Guillaume-Alexandre Bilodeau 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究自动驾驶班车检测的特定任务，提出了一种名为ARC的神经网络架构来解决灾难性遗忘问题。论文内容涉及视频对象检测、迁移学习、注意力机制和城市交通监控，但完全不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文属于传统计算机视觉应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为自适应残差上下文（ARC）的神经网络架构，用于在固定摄像头视频中检测自动驾驶班车，解决了添加新检测目标时传统微调方法导致的灾难性遗忘问题，实验表明ARC在保持场景知识的同时达到了与微调基线相当的检测性能。

摘要翻译

交通系统的逐步自动化有望通过共享出行提升安全性与可持续性。与其他车辆及道路使用者类似，尤其是对于此类新兴技术，需要对其在交通流中的交互行为进行监测以评估其安全性。固定摄像头与视频目标检测技术可实现这一目标。然而，为常规检测方法新增检测目标通常需要采用微调策略。遗憾的是，这种实施策略会导致“灾难性遗忘”现象，从而削弱系统对场景的理解能力。在道路安全应用中，保持对场景上下文的理解对于保护道路使用者至关重要。为此，我们提出自适应残差上下文（Adaptive Residual Context, ARC）架构。该架构通过上下文引导桥接模块，将冻结的上下文分支与可训练的特定任务分支相连接，利用注意力机制传递空间特征，同时保留预训练的表征能力。在定制数据集上的实验表明，ARC在匹配微调基线性能的同时，显著提升了知识保留能力，为复杂城市环境中新增车辆类别检测提供了一种数据高效的解决方案。

摘要 (Abstract)

The progressive automation of transport promises to enhance safety and sustainability through shared mobility. Like other vehicles and road users, and even more so for such a new technology, it requires monitoring to understand how it interacts in traffic and to evaluate its safety. This can be done with fixed cameras and video object detection. However, the addition of new detection targets generally requires a fine-tuning approach for regular detection methods. Unfortunately, this implementation strategy will lead to a phenomenon known as catastrophic forgetting, which causes a degradation in scene understanding. In road safety applications, preserving contextual scene knowledge is of the utmost importance for protecting road users. We introduce the Adaptive Residual Context (ARC) architecture to address this. ARC links a frozen context branch and trainable task-specific branches through a Context-Guided Bridge, utilizing attention to transfer spatial features while preserving pre-trained representations. Experiments on a custom dataset show that ARC matches fine-tuned baselines while significantly improving knowledge retention, offering a data-efficient solution to add new vehicle categories for complex urban environments.

关键词: autonomous shuttle detection, video object detection, catastrophic forgetting, adaptive residual context, context-guided bridge, urban traffic monitoring, knowledge retention, fine-tuning

58. ❌ A Closer Look into LLMs for Table Understanding

作者: Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li, Zheng Lin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15402v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在表格理解中的内部机制，直接涉及LLMs、MoE模型和Chain-of-Thought prompting，并探索模型可解释性，因此这些关键词高度相关（10分）。其他关键词如SLMs、Scaling Laws、训练方法、推理加速、AI for Science等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过实证研究16个LLMs（包括通用LLMs、专业表格LLMs和MoE模型），揭示了LLMs理解表格数据的内部机制，发现其遵循三阶段注意力模式、需要深层处理、MoE模型激活表格特定专家，以及Chain-of-Thought提示能增强表格注意力。

摘要翻译

尽管大语言模型（LLM）在表格理解任务中取得了成功，但其内部机制仍不明确。本文对16个大语言模型进行了实证研究，涵盖通用大语言模型、专业表格理解大语言模型以及混合专家模型，以探索大语言模型如何理解表格数据并执行下游任务。我们的分析聚焦于四个维度：注意力动态、有效层深度、专家激活情况以及输入设计的影响。主要发现包括：（1）大语言模型遵循三阶段注意力模式——早期层广泛扫描表格，中间层定位相关单元格，后期层放大其贡献；（2）表格任务比数学推理需要更深层的处理才能达到稳定的预测；（3）混合专家模型在中间层激活表格专用专家，而早期层和后期层则共享通用专家；（4）思维链提示能增强对表格的注意力，而经过表格调优的模型该效应更为显著。我们希望这些发现与见解能促进表格相关任务的可解释性研究及未来探索。

摘要 (Abstract)

Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern – early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.

关键词: Large Language Models, Table Understanding, Mixture-of-Experts, Attention Dynamics, Chain-of-Thought, Interpretability, Empirical Study, Tabular Data

59. ❌ SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

作者: Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15401v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents在软件工程任务中的技能注入效果评估，与’LLM Agents’高度相关（10分），因为论文直接研究LLM agents在软件工程中的应用；与’Tool Use’有一定关联（5分），因为agent skills可视为工具使用的一种形式；与’Large Language Models’高度相关（10分），因为论文明确提到LLM agents；其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在真实世界软件工程任务中，为LLM agents注入技能的实际效用，发现大多数技能带来的性能提升非常有限（平均仅+1.2%），只有少数专业技能能产生显著增益（最高+30%），而部分技能甚至因版本不匹配导致性能下降（最高-10%）。

摘要翻译

智能体技能（Agent skills）是在推理时注入的结构化程序知识包，正日益被用于增强大语言模型智能体在软件工程任务中的能力。然而，其在端到端开发环境中的实际效用尚不明确。我们提出了SWE-Skills-Bench，这是首个需求驱动的基准测试，旨在隔离智能体技能在真实世界软件工程（SWE）中的边际效用。该基准将49个公开的SWE技能与固定在特定提交版本的GitHub仓库以及包含明确验收标准的需求文档配对，在六个SWE子领域中产生了约565个任务实例。我们引入了一个确定性验证框架，将每个任务的验收标准映射为基于执行的测试，从而支持在有技能和无技能情况下的受控配对评估。我们的结果表明，技能注入带来的益处远不如其快速普及所暗示的那样显著：49项技能中有39项未带来任何通过率提升，平均增益仅为+1.2%。令牌开销从适度节省到增加451%不等，而通过率保持不变。只有七项专门技能产生了有意义的增益（最高达+30%），同时有三项技能因版本不匹配的指导与项目上下文冲突而导致性能下降（最高达-10%）。这些发现表明，智能体技能是一种狭窄的干预手段，其效用高度依赖于领域适配性、抽象层次和上下文兼容性。SWE-Skills-Bench为评估软件工程智能体中技能的设计、选择和部署提供了一个测试平台。SWE-Skills-Bench可通过 https://github.com/GeniusHTX/SWE-Skills-Bench 获取。

摘要 (Abstract)

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task’s acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

关键词: LLM agents, agent skills, software engineering, benchmark, SWE-Skills-Bench, real-world evaluation, skill injection, deterministic verification

60. ❌ AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

作者: Noe Claudel, Weisi Guo, Yang Xing 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15396v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究对抗性攻击在面部重识别系统中的应用，主要涉及计算机视觉、对抗性机器学习和可解释AI，与绝大多数大模型/深度学习技术原理关键词（如LLM、MoE、RLHF、RAG等）完全无关；仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文使用激活图聚类来解释对抗性攻击利用的特征，属于可解释AI范畴，但并非核心焦点。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对跨摄像头面部重识别系统的对抗性补丁生成框架，能够同时实现逃避和冒充攻击，并通过激活图解释攻击机制，实验表明该方法能显著降低系统性能并具有跨模型泛化能力。

摘要翻译

面部识别系统在监控领域日益普及，但其对对抗性规避与冒充攻击的脆弱性构成了严重风险。本文提出了一种新颖的对抗性补丁生成框架，能够在非重叠摄像头场景下对深度重识别模型同时实施规避与冒充攻击。与以往需要针对每个目标进行迭代补丁优化的方法不同，我们的方法采用条件编码器-解码器网络，在单次前向传播中合成对抗性补丁，并以源图像与目标图像的多尺度特征作为引导。补丁通过包含“拉近”与“推开”项的双重对抗目标进行优化。为提升补丁的隐蔽性并助力物理部署，我们进一步整合了基于预训练潜在扩散模型的自然风格补丁生成技术。在标准行人重识别数据集（Market-1501、DukeMTMCreID）和人脸识别基准数据集（CelebA-HQ、PubFig）上的实验验证了所提方法的有效性。我们的对抗性规避攻击在白盒设置下将平均精度均值从90%降至0.4%，在黑盒设置下从72%降至0.4%，显示出强大的跨模型泛化能力。在目标冒充攻击中，本框架在CelebA-HQ数据集上实现了27%的成功率，与其它基于补丁的方法性能相当。我们进一步通过激活图聚类分析对抗攻击最常利用的特征，并为未来防御对策提出了路径。这些结果凸显了对抗性补丁攻击在基于检索的系统中的实际威胁，并强调了开发鲁棒防御策略的迫切性。

摘要 (Abstract)

Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a novel framework for generating adversarial patches capable of both evasion and impersonation attacks against deep re-identification models across non-overlapping cameras. Unlike prior approaches that require iterative patch optimisation for each target, our method employs a conditional encoder-decoder network to synthesize adversarial patches in a single forward pass, guided by multi-scale features from source and target images. The patches are optimised with a dual adversarial objective comprising of pull and push terms. To enhance imperceptibility and aid physical deployment, we further integrate naturalistic patch generation using pre-trained latent diffusion models. Experiments on standard pedestrian (Market-1501, DukeMTMCreID) and facial recognition benchmarks (CelebA-HQ, PubFig) datasets demonstrate the effectiveness of the proposed method. Our adversarial evasion attacks reduce mean Average Precision from 90% to 0.4% in white-box settings and from 72% to 0.4% in black-box settings, showing strong cross-model generalization. In targeted impersonation attacks, our framework achieves a success rate of 27% on CelebA-HQ, competing with other patch-based methods. We go further to use clustering of activation maps to interpret which features are most used by adversarial attacks and propose a pathway for future countermeasures. The results highlight the practicality of adversarial patch attacks on retrieval-based systems and underline the urgent need for robust defense strategies.

关键词: adversarial attacks, facial re-identification, evasion attacks, impersonation attacks, activation map explanations, adversarial patches, deep learning, computer vision

61. ❌ Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization

作者: Yanning Dai, Yuhui Wang, Dylan R. Ashley, Jürgen Schmidhuber 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究形态-控制协同设计问题，提出基于Stackelberg博弈论的PPO算法，属于机器人学、强化学习和优化领域。论文未涉及大语言模型、深度学习技术原理或科学AI应用，与所有评分关键词均无直接关联，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对形态-控制协同设计中的双层优化问题，提出Stackelberg PPO算法，通过建模形态与控制的内在耦合关系，显著提升了训练稳定性和最终性能。

摘要翻译

形态-控制协同设计涉及智能体身体结构与控制策略的耦合优化。该问题呈现双层结构，其中控制策略会动态适应形态以实现性能最大化。现有方法通常采用单层优化框架，在优化形态时将控制策略视为固定参数，从而忽视了控制的适应动态。这可能导致优化效率低下，因为形态更新可能与控制适应过程不匹配。本文从博弈论视角重新审视协同设计问题，将形态与控制的内在耦合建模为斯塔克尔伯格博弈的一种新变体。我们提出斯塔克尔伯格近端策略优化算法，该方法将控制的适应动态显式纳入形态优化过程。通过对这种内在耦合进行建模，我们的方法使形态更新与控制适应保持同步，从而稳定训练过程并提升学习效率。在多种协同设计任务上的实验表明，斯塔克尔伯格PPO在训练稳定性和最终性能上均优于标准PPO算法，为显著提升机器人设计效率开辟了新路径。

摘要 (Abstract)

Morphology-control co-design concerns the coupled optimization of an agent’s body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control’s adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control’s adaptation dynamics into morphology optimization. By modeling this intrinsic coupling, our method aligns morphology updates with control adaptation, thereby stabilizing training and improving learning efficiency. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance, opening the way for dramatically more efficient robotics designs.

关键词: morphology-control co-design, Stackelberg game, proximal policy optimization, bi-level optimization, robotics design, control adaptation, morphology optimization, reinforcement learning

62. ❌ Why AI systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science

作者: Emmanuel Dupoux, Yann LeCun, Jitendra Malik 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15381v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要讨论当前AI模型在自主学习方面的局限性，并提出受人类和动物认知启发的学习架构框架（System A、System B、System M）。虽然论文涉及AI学习机制，但所有关键词都聚焦于大模型/深度学习的具体技术、方法或应用领域（如LLM、MoE、RLHF、RAG、量化等），而本文是更宏观的认知科学启发的AI学习架构讨论，未涉及任何具体的大模型技术、训练方法、优化技术或特定应用领域。因此所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文批判性地分析了当前AI模型在实现自主学习方面的局限性，并提出了一个受人类和动物认知启发的学习架构，该架构整合了观察学习（System A）和主动行为学习（System B），并通过元控制系统（System M）灵活切换学习模式。

摘要翻译

我们批判性地审视了当前人工智能模型在实现自主学习方面的局限性，并提出了一种受人类与动物认知启发的学习架构。该框架整合了从观察中学习（系统A）与从主动行为中学习（系统B）两种模式，并能依据内部生成的元控制信号（系统M）灵活切换这些学习模式。我们探讨了如何借鉴生物体在进化与发育时间尺度上适应真实动态环境的机制，来构建这一架构。

摘要 (Abstract)

We critically examine the limitations of current AI models in achieving autonomous learning and propose a learning architecture inspired by human and animal cognition. The proposed framework integrates learning from observation (System A) and learning from active behavior (System B) while flexibly switching between these learning modes as a function of internally generated meta-control signals (System M). We discuss how this could be built by taking inspiration on how organisms adapt to real-world, dynamic environments across evolutionary and developmental timescales.

关键词: autonomous learning, cognitive science, learning architecture, System A, System B, System M, meta-control, AI limitations

63. ❌ RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

作者: Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RieMind框架，将LLM作为核心推理组件，通过结构化几何工具与3D场景图交互，实现空间推理。因此与’Large Language Models’高度相关（10分），属于’LLM Agents’范畴（10分），并涉及’Tool Use’（10分）。论文关注空间推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了通过将大语言模型与显式3D场景图解耦感知与推理，能否提升室内场景的空间推理性能，结果表明这种基于几何基础的智能体框架显著优于现有视觉语言模型，性能提升达33%-50%。

摘要翻译

视觉语言模型（VLMs）已日益成为理解室内场景的主要范式，但其在度量与空间推理方面仍面临挑战。现有方法依赖于端到端的视频理解或大规模空间问答微调，本质上将感知与推理过程耦合。本文探讨了将感知与推理解耦是否能够提升空间推理能力。我们提出了一种用于静态三维室内场景推理的智能体框架，该框架将大语言模型（LLM）显式地锚定在三维场景图（3DSG）中。与直接处理视频不同，每个场景均由专用感知模块构建为持久化的三维场景图。为隔离推理性能的影响，我们基于真实标注数据实例化了三维场景图。智能体仅通过结构化几何工具与场景交互，这些工具可获取物体尺寸、距离、姿态及空间关系等基本属性。在VSI-Bench静态数据集上的实验结果表明，在理想感知条件下，该框架为空间推理性能提供了上限，且无需任务特定微调即显著优于先前方法，最高提升达16%。与基础视觉语言模型相比，我们的智能体变体实现了显著更优的性能，平均提升幅度在33%至50%之间。这些发现表明，显式的几何锚定能大幅提升空间推理性能，并说明结构化表征为纯端到端视觉推理提供了具有竞争力的替代方案。

摘要 (Abstract)

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33% to 50%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.

关键词: Visual Language Models, spatial reasoning, 3D scene graph, agentic framework, geometric grounding, scene understanding, LLM agents, structured representations

64. ❌ More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

作者: Gal Dalal, Assaf Hallak, Gal Chechik, Yftach Ziser 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理中的beam search算法优化问题，直接涉及LLM推理过程，因此与’Large Language Models’高度相关（10分）。论文研究推理过程中的搜索策略，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），但并非核心研究这些推理方法本身。其他关键词如MoE、SLMs、训练方法、对齐、RAG、压缩、代理等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在LLM的beam search推理中，当评分器噪声较高时，过宽的搜索宽度会因系统性的高估偏差而损害输出质量，并基于极值理论推导出最大有用波束宽度，该宽度取决于评分器的信噪比。

摘要翻译

更宽的束搜索（beam search）本应提升大语言模型（LLM）的推理能力，但何时应停止拓宽？先前关于束宽选择的研究主要关注推理效率 \citep{qin2025dsbd, freitag2017beam}，并未分析更宽的搜索是否会\emph{损害}输出质量。我们基于极值理论提出一项分析，以解答此问题。在带有噪声的评分器输出上进行束选择会引入一种系统性的高估偏差，该偏差随候选池规模的增大而增长；我们推导出了一个最大有效束宽 $\hat{k}$，超过此宽度搜索反而会降低性能。这一临界宽度取决于评分器的信噪比：$\hat{k}$ 随 $(Δ/σ)^2$ 呈指数增长，其中 $Δ> 0$ 是正确路径相对于错误路径的质量优势，$σ$ 是评分器噪声。我们通过在 MR-BEN 数据集（5,975 个问题）上比较三种 70 亿参数模型在十个领域内由困惑度引导和 PRM（过程奖励模型）引导的束搜索，验证了这一理论。困惑度评分噪声较高，其 $\hat{k} = 1$：在所有测试宽度下，搜索均未带来收益。PRM 评分噪声较低，其 $\hat{k} \geq 4$，性能提升最高可达 8.9 个百分点。同一模型、同一算法，但不同的评分器会将 $\hat{k}$ 置于束宽范围的两端。我们的分析指出，评分器的信噪比是决定束宽选择的关键量，并为此提出了在实践中选择束宽的诊断指标。

摘要 (Abstract)

Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width $\hat{k}$ beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: $\hat{k}$ grows exponentially with $(Δ/σ)^2$, where $Δ> 0$ is the quality advantage of correct paths over incorrect ones and $σ$ is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN (5,975 questions). Perplexity scoring, with its high noise, yields $\hat{k} = 1$: search provides no benefit at any width tested. PRM scoring, with lower noise, yields $\hat{k} \geq 4$, with gains of up to 8.9 percentage points. The same model, the same algorithm, but different scorers place $\hat{k}$ at opposite ends of the beam width range. Our analysis identifies the scorer’s signal-to-noise ratio as the key quantity governing beam width selection, and we propose diagnostic indicators for choosing the beam width in practice.

关键词: beam search, LLM reasoning, overestimation bias, signal-to-noise ratio, extreme value theory, inference efficiency, scorer noise, MR-BEN benchmark

65. ❌ GradCFA: A Hybrid Gradient-Based Counterfactual and Feature Attribution Explanation Algorithm for Local Interpretation of Neural Networks

作者: Jacob Sanderson, Hua Mao, Wai Lok Woo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于神经网络的可解释人工智能（XAI），特别是结合反事实解释（CFX）和特征归因（FA）的混合方法。这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文的核心是开发一种新的解释算法来提高模型透明度。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法等）、推理技术（如CoT、MCTS）、代理系统、模型优化（如量化、加速）或科学AI应用，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了GradCFA，一种结合反事实解释和特征归因的混合梯度算法，用于神经网络的局部解释，在多类场景中有效生成可行、合理且多样的反事实，同时提供有价值的特征洞察。

摘要翻译

随着人工智能系统在医疗和金融等关键领域的广泛应用，可解释人工智能（XAI）对于揭示AI驱动决策的透明度日益重要。反事实解释（CFX）与特征归因（FA）作为XAI的两大主要范式，在模型可解释性中发挥着不同作用。本研究提出GradCFA——一种融合CFX与FA的混合框架，通过显式优化可行性、合理性与多样性来提升可解释性，这些关键特性在现有方法中常存在失衡问题。与多数聚焦于二分类的CFX研究不同，GradCFA可扩展至多分类场景，支持更广泛的应用。我们针对GradCFA的有效性、邻近性、稀疏性、合理性和多样性进行了评估，并与包括Wachter、DiCE、CARE（针对CFX）及SHAP（针对FA）在内的前沿方法进行了对比。结果表明，GradCFA能有效生成可行、合理且多样化的反事实解释，同时提供有价值的FA洞见。通过识别关键特征并验证其影响，GradCFA推动了AI可解释性的发展。本工作的实现代码可见于：https://github.com/jacob-ws/GradCFs。

摘要 (Abstract)

Explainable Artificial Intelligence (XAI) is increasingly essential as AI systems are deployed in critical fields such as healthcare and finance, offering transparency into AI-driven decisions. Two major XAI paradigms, counterfactual explanations (CFX) and feature attribution (FA), serve distinct roles in model interpretability. This study introduces GradCFA, a hybrid framework combining CFX and FA to improve interpretability by explicitly optimizing feasibility, plausibility, and diversity - key qualities often unbalanced in existing methods. Unlike most CFX research focused on binary classification, GradCFA extends to multi-class scenarios, supporting a wider range of applications. We evaluate GradCFA’s validity, proximity, sparsity, plausibility, and diversity against state-of-the-art methods, including Wachter, DiCE, CARE for CFX, and SHAP for FA. Results show GradCFA effectively generates feasible, plausible, and diverse counterfactuals while offering valuable FA insights. By identifying influential features and validating their impact, GradCFA advances AI interpretability. The code for implementation of this work can be found at: https://github.com/jacob-ws/GradCFs .

关键词: Explainable AI, Counterfactual Explanations, Feature Attribution, Neural Network Interpretation, GradCFA, Model Interpretability, Multi-class Scenarios

66. ❌ SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations

作者: Ivo Brett 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15372v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在电信运营中的实际应用，通过结构化知识注入提升API工具调用能力，与’Large Language Models’、‘LLM Agents’和’Tool Use’高度相关（10分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了通用大语言模型代理能否通过真实API接口可靠执行电信运营工作流的问题，结果表明通过结构化知识注入（SKILLS框架）能显著提升所有测试模型的性能。

摘要翻译

随着电信运营商加速采用人工智能驱动的自动化，一个实际问题仍未解决：通用大语言模型（LLM）智能体能否通过真实的API接口可靠地执行电信运营工作流，还是需要结构化的领域指导？我们提出了SKILLS（面向LLM驱动服务生命周期运营的结构化知识注入），这是一个基准测试框架，包含涵盖8个TM Forum Open API领域（TMF620、TMF621、TMF622、TMF628、TMF629、TMF637、TMF639、TMF724）的37个电信运营场景。每个场景都基于植入具有生产代表性数据的实时模拟API服务器、MCP工具接口，以及结合了响应内容检查、工具调用验证和数据库状态断言的确定性评估标准。我们在两种条件下评估开源模型：基线条件（具有工具访问权限但无领域指导的通用智能体）和技能增强条件（通过一个便携式SKILL.md文档增强的智能体，该文档编码了工作流逻辑、API模式和业务规则）。在5种开源模型条件和185个场景运行中的结果显示，所有模型均获得一致的性能提升。MiniMax M2.5领先（技能增强条件下81.1%，提升13.5个百分点），其次是Nemotron 120B（78.4%，提升18.9个百分点）、GLM-5 Turbo（78.4%，提升5.4个百分点）和Seed 2.0 Lite（75.7%，提升18.9个百分点）。

摘要 (Abstract)

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

关键词: LLM agents, telecommunications operations, API tool use, structured knowledge injection, benchmark framework, TM Forum Open API, workflow automation, domain guidance

67. ❌ FuXiWeather2: Learning accurate atmospheric state estimation for operational global weather forecasting

作者: Xiaoze Xu, Xiuyu Sun, Songling Zhu, Xiaohui Zhong, Yuanqing Huang, Zijian Zhu, Jun Liu, Hao Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用神经网络框架进行全球天气预报，属于AI在科学领域的应用（气象学），因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法等）、推理技术（如CoT、MCTS）、代理系统或模型优化技术（如量化、RAG），其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了FuXiWeather2，一个统一的端到端神经框架，用于数据同化和天气预报，通过结合真实观测和再分析数据训练，有效纠正再分析产品的固有误差，并在全球天气分析和10天预报中超越了现有数值天气预报系统的性能。

摘要翻译

数值天气预报长期以来一直受限于数据同化和数值模式固有的计算瓶颈。虽然机器学习已加速了预报进程，但现有模型主要作为"再分析产品的模拟器"，从而保留了其系统性偏差和业务延迟。本文提出FuXiWeather2，一个用于同化和预报的统一端到端神经框架。我们将训练目标直接与真实观测数据和再分析数据相结合，使该框架能够有效校正再分析产品中的固有误差。为应对训练期间源自数值天气预报的背景输入与部署期间自生成背景之间的分布偏移，我们引入了一种递归展开训练方法，以提升分析场生成的精度和稳定性。此外，我们的模型在原始观测与模拟观测的混合数据集上进行训练，以减轻观测分布不一致的影响。FuXiWeather2可在数分钟内生成高分辨率（$0.25^{\circ}$）的全球分析场和10天预报。其分析场在多数变量上超越NCEP-GFS系统，并在对流层低层和地表变量上展现出优于ERA5和ECMWF-HRES系统的准确性。这些高质量分析场驱动的确定性预报，在91%的评估指标上超越了HRES系统的预报技巧。此外，其在台风路径预测中的卓越表现凸显了其在快速响应极端天气事件方面的实用价值。FuXiWeather2分析数据集可通过https://doi.org/10.5281/zenodo.18872728获取。

摘要 (Abstract)

Numerical weather prediction has long been constrained by the computational bottlenecks inherent in data assimilation and numerical modeling. While machine learning has accelerated forecasting, existing models largely serve as “emulators of reanalysis products,” thereby retaining their systematic biases and operational latencies. Here, we present FuXiWeather2, a unified end-to-end neural framework for assimilation and forecasting. We align training objectives directly with a combination of real-world observations and reanalysis data, enabling the framework to effectively rectify inherent errors within reanalysis products. To address the distribution shift between NWP-derived background inputs during training and self-generated backgrounds during deployment, we introduce a recursive unrolling training method to enhance the precision and stability of analysis generation. Furthermore, our model is trained on a hybrid dataset of raw and simulated observations to mitigate the impact of observational distribution inconsistency. FuXiWeather2 generates high-resolution ($0.25^{\circ}$) global analysis fields and 10-day forecasts within minutes. The analysis fields surpass the NCEP-GFS across most variables and demonstrate superior accuracy over both ERA5 and the ECMWF-HRES system in lower-tropospheric and surface variables. These high-quality analysis fields drive deterministic forecasts that exceed the skill of the HRES system in 91% of evaluated metrics. Additionally, its outstanding performance in typhoon track prediction underscores its practical value for rapid response to extreme weather events. The FuXiWeather2 analysis dataset is available at https://doi.org/10.5281/zenodo.18872728.

关键词: numerical weather prediction, data assimilation, neural framework, global weather forecasting, reanalysis data, typhoon track prediction, extreme weather events

68. ❌ CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

作者: Erick Silva, Rehana Yasmin, Ali Shoker 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是开发基于LLM的智能体CRASH来自动分析自动驾驶事故报告，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。系统执行认知推理，涉及’Chain of Thought’和’System 2 Thinking’（8分）。作为AI在自动驾驶安全领域的应用，与’AI for Science’有一定关联（5分）。系统被描述为可解释工具，与’Explainable AI’弱相关（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该研究开发了基于大语言模型的认知推理智能体CRASH，用于自动分析自动驾驶事故报告，能够准确归因事故原因（86%准确率）并发现感知/规划故障占64%、追尾事故占50%等关键安全洞见。

摘要翻译

随着自动驾驶车辆（AV）在复杂性和多样性上的增长，识别运行故障的根本原因变得日益复杂。不同制造商系统架构的异质性——从端到端（end-to-end）到模块化（modular）设计——以及算法和集成策略的差异，限制了事故调查的标准化，并阻碍了系统性的安全分析。本研究分析了美国国家公路交通安全管理局（NHTSA）数据库中报告的真实世界自动驾驶车辆事故。我们整理了一个包含2021年至2025年间报告的2,168起案例的数据集，代表超过8,000万英里的行驶里程。为处理这些数据，我们引入了CRASH（Cognitive Reasoning Agent for Safety Hazards，安全危害认知推理代理），这是一个基于大语言模型（LLM）的智能体，通过利用标准化字段和非结构化的叙述性描述，实现对事故报告的自动化推理。CRASH基于每个事件的统一表征进行操作，以生成简明摘要、归因主要致因，并评估自动驾驶车辆是否对事件产生了实质性影响。我们的研究结果表明：（1）CRASH将64%的事故归因于感知（perception）或规划（planning）故障，这凸显了基于推理的分析对于准确归因的重要性；（2）约50%的报告事故涉及追尾碰撞，突显了自动驾驶部署中一个持续存在且尚未解决的挑战。我们进一步邀请了五位领域专家对CRASH进行验证，其在归因自动驾驶系统故障方面达到了86%的准确率。总体而言，CRASH展现出了作为可扩展且可解释的自动化事故分析工具的强大潜力，能够为支持安全研究和自动驾驶系统的持续发展提供可操作的见解。

摘要 (Abstract)

As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging from end-to-end to modular designs, together with variations in algorithms and integration strategies, limits the standardization of incident investigations and hinders systematic safety analysis. This work examines real-world AV incidents reported in the NHTSA database. We curate a dataset of 2,168 cases reported between 2021 and 2025, representing more than 80 million miles driven. To process this data, we introduce CRASH, Cognitive Reasoning Agent for Safety Hazards, an LLM-based agent that automates reasoning over crash reports by leveraging both standardized fields and unstructured narrative descriptions. CRASH operates on a unified representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the AV materially contributed to the event. Our findings show that (1) CRASH attributes 64% of incidents to perception or planning failures, underscoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of reported incidents involve rear-end collisions, highlighting a persistent and unresolved challenge in autonomous driving deployment. We further validate CRASH with five domain experts, achieving 86% accuracy in attributing AV system failures. Overall, CRASH demonstrates strong potential as a scalable and interpretable tool for automated crash analysis, providing actionable insights to support safety research and the continued development of autonomous driving systems.

关键词: autonomous driving, safety hazards, LLM-based agent, cognitive reasoning, crash analysis, fault attribution, AV incidents, NHTSA database

69. ❌ Conditional Rectified Flow-based End-to-End Rapid Seismic Inversion Method

作者: Haofei Xu, Wei Cheng, Sizhe Li, Jie Xiong 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于地球物理勘探中的地震反演问题，提出了一种基于条件整流流的端到端快速地震反演方法。论文的核心技术是深度生成模型（特别是条件整流流），用于解决传统方法计算成本高和初始模型依赖性强的问题。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联，因为地震反演属于科学计算和地球物理学的AI应用，但论文未涉及生物信息学或化学信息学。其他关键词均与大语言模型、模型训练、对齐、推理优化、代理系统等大模型特定技术无关，因此评分为0。加权总分仅来自最后一个关键词的5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于条件整流流的端到端快速地震反演方法，在OpenFWI基准数据集上实现了高精度反演，相比扩散方法加速了采样，相比InversionNet方法提高了生成精度，并在Marmousi真实数据上验证了零样本泛化能力，有效缓解了传统全波形反演中的初始模型依赖问题。

摘要翻译

地震反演是地球物理勘探中的核心问题，传统方法存在计算成本高、易受初始模型依赖性的局限。近年来，基于深度生成模型的地震反演方法取得了显著进展，但现有生成模型难以平衡采样效率与反演精度。本文提出一种基于条件修正流（Conditional Rectified Flow）的端到端快速地震反演方法，通过设计专用的地震编码器提取多尺度地震特征，并采用逐层注入控制策略实现细粒度条件控制。实验结果表明，所提方法在OpenFWI基准数据集上取得了优异的反演精度；与扩散模型（Diffusion）方法相比，实现了采样加速；与InversionNet方法相比，获得了更高的生成精度。我们在Marmousi真实数据上的零样本泛化实验进一步验证了该方法的实用价值：实验显示，所提方法在OpenFWI基准数据集上表现出卓越的反演精度；相较于扩散模型方法，它在实现采样加速的同时，保持了比InversionNet方法更高的精度；基于Marmousi标准模型的实验进一步证明，该方法能够以零样本方式生成高质量的初始速度模型，有效缓解传统全波形反演（Full Waveform Inversion, FWI）中的初始模型依赖问题，具备工业实用价值。

摘要 (Abstract)

Seismic inversion is a core problem in geophysical exploration, where traditional methods suffer from high computational costs and are susceptible to initial model dependence. In recent years, deep generative model-based seismic inversion methods have achieved remarkable progress, but existing generative models struggle to balance sampling efficiency and inversion accuracy. This paper proposes an end-to-end fast seismic inversion method based on Conditional Rectified Flow[1], which designs a dedicated seismic encoder to extract multi-scale seismic features and adopts a layer-by-layer injection control strategy to achieve fine-grained conditional control. Experimental results demonstrate that the proposed method achieves excellent inversion accuracy on the OpenFWI[2] benchmark dataset. Compared with Diffusion[3,4] methods, it achieves sampling acceleration; compared with InversionNet[5,6,7] methods, it achieves higher accuracy in generation. Our zero-shot generalization experiments on Marmousi[8,9] real data further verify the practical value of the method. Experimental results show that the proposed method achieves excellent inversion accuracy on the OpenFWI benchmark dataset; compared with Diffusion methods, it achieves sampling acceleration while maintaining higher accuracy than InversionNet methods; experiments based on the Marmousi standard model further verify that this method can generate high-quality initial velocity models in a zero-shot manner, effectively alleviating the initial model dependency problem in traditional Full Waveform Inversion (FWI), and possesses industrial practical value.

关键词: seismic inversion, conditional rectified flow, deep generative model, end-to-end, sampling acceleration, zero-shot generalization, Full Waveform Inversion, initial velocity model

70. ❌ NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

作者: Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, Zhizheng Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15352v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到语音（TTS）系统中非语言发声（NVs）的基准测试和评估，涉及语音合成、基准构建、评估指标和人类感知相关性。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是语音合成领域的特定问题，未涉及大模型技术、深度学习创新或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个针对文本到语音系统中非语言发声合成的基准测试NV-Bench，通过双维度评估协议验证了客观指标与人类感知的强相关性。

摘要翻译

尽管当前文本转语音系统日益整合非语言性发声，其评估仍缺乏标准化指标与可靠的基准参照。为弥补这一空白，我们提出NV-Bench——首个基于功能分类法的基准框架，将非语言性发声视为交际行为而非声学产物。该基准包含1,651条多语言真实场景话语及对应的人类参考音频，均衡覆盖14类非语言性发声范畴。我们设计了双维度评估协议：（1）指令对齐维度，采用提出的副语言字符错误率评估系统可控性；（2）声学保真维度，通过测量与真实录音的分布差异评估声学真实感。通过对多种文本转语音模型的系统性评估及两个基线模型的构建，实验结果表明我们的客观指标与人类感知高度相关，从而确立NV-Bench作为标准化评估框架的有效性。

摘要 (Abstract)

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

关键词: nonverbal vocalizations, text-to-speech, benchmark, evaluation metrics, paralinguistic character error rate, acoustic fidelity, human perception, TTS models

71. ❌ PMAx: An Agentic Framework for AI-Driven Process Mining

作者: Anton Antonov, Humam Kourani, Alessandro Berti, Gyunam Park, Wil M. P. van der Aalst 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PMAx是一个基于LLM的自主代理框架，用于流程挖掘分析。核心相关关键词包括：1) ‘Large Language Models’ (10分) - 论文明确使用LLMs作为基础技术；2) ‘LLM Agents’和’Multi-agent Systems’ (各10分) - 框架采用多代理架构(Engineer和Analyst代理)；3) ‘Tool Use’ (5分) - Engineer代理生成脚本运行算法，涉及工具使用；4) ‘Hallucination Mitigation’ (5分) - 论文提到LLMs可能产生幻觉，通过本地计算确保准确性。其他关键词如MoE、SFT、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

论文提出了PMAx框架，通过多代理架构利用LLMs实现流程挖掘的民主化，解决了LLMs在确定性推理和隐私方面的挑战，使非技术用户能够通过自然语言获得可靠的流程洞察。

摘要翻译

流程挖掘为组织工作流提供了强大的洞察力，但提取这些洞察通常需要掌握专业查询语言与数据科学工具的专业知识。大型语言模型（LLMs）通过允许业务用户以自然语言与流程数据进行交互，为流程挖掘的普及化提供了可能。然而，将LLMs直接用作原始事件日志的分析引擎会带来根本性挑战：LLMs难以进行确定性推理，可能虚构指标，同时将大量敏感日志发送至外部AI服务会引发严重的数据隐私担忧。为应对这些局限，我们提出了PMAx——一个作为虚拟流程分析师运行的自主智能体框架。PMAx不依赖LLM生成流程模型或计算结果，而是采用一种保护隐私的多智能体架构。其中，工程师智能体分析事件日志元数据，自主生成本地脚本来运行成熟的流程挖掘算法、计算精确指标，并生成流程模型、汇总表和可视化等成果物。随后，分析师智能体对这些洞察与成果物进行解读，以汇编综合性报告。通过将计算与解释分离，并在本地执行分析，PMAx在确保数学精确性与数据隐私的同时，使非技术用户能够将高层业务问题转化为可靠的流程洞察。

摘要 (Abstract)

Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Language Models (LLMs) offer the potential to democratize process mining by enabling business users to interact with process data through natural language. However, using LLMs as direct analytical engines over raw event logs introduces fundamental challenges: LLMs struggle with deterministic reasoning and may hallucinate metrics, while sending large, sensitive logs to external AI services raises serious data-privacy concerns. To address these limitations, we present PMAx, an autonomous agentic framework that functions as a virtual process analyst. Rather than relying on LLMs to generate process models or compute analytical results, PMAx employs a privacy-preserving multi-agent architecture. An Engineer agent analyzes event-log metadata and autonomously generates local scripts to run established process mining algorithms, compute exact metrics, and produce artifacts such as process models, summary tables, and visualizations. An Analyst agent then interprets these insights and artifacts to compile comprehensive reports. By separating computation from interpretation and executing analysis locally, PMAx ensures mathematical accuracy and data privacy while enabling non-technical users to transform high-level business questions into reliable process insights.

关键词: Process Mining, Large Language Models, Autonomous Agents, Multi-agent Systems, Data Privacy, Agentic Framework, Natural Language Interaction, Process Analytics

72. ❌ Tagarela - A Portuguese speech dataset from podcasts

作者: Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于葡萄牙语语音数据集构建和语音技术（ASR/TTS）评估，所有关键词均涉及大语言模型、深度学习技术原理或科学AI应用，而论文内容完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了葡萄牙语缺乏大规模高质量语音数据集的问题，通过构建包含8972小时播客音频的TAGARELA数据集并训练ASR和TTS模型，证明了该数据集能有效推动葡萄牙语语音技术的发展。

摘要翻译

尽管语音处理领域已取得显著进展，但由于缺乏公开、大规模且高质量的数据集，葡萄牙语资源仍相对匮乏。为填补这一空白，我们提出了名为TAGARELA的新数据集，该数据集包含超过8,972小时的播客音频，专门用于训练自动语音识别（ASR）和文本转语音（TTS）模型。值得注意的是，其规模可与英语的GigaSpeech（10kh）相媲美，为葡萄牙语前沿模型的开发提供了可能。为确保数据质量，该语料库经过音频预处理流程，并采用混合策略进行转写：我们应用了基于专有API生成的高保真转写文本预先训练的ASR模型，从而确保了较高的初始准确率。最后，为验证这一新资源的有效性，我们展示了完全基于本数据集训练的ASR和TTS模型，并评估其性能，证明了该数据集在推动葡萄牙语更鲁棒、更自然的语音技术发展方面的潜力。本数据集已公开发布于https://freds0.github.io/TAGARELA/，以促进鲁棒语音技术的开发。

摘要 (Abstract)

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English’s GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.

关键词: Portuguese speech dataset, podcast audio, automatic speech recognition, text-to-speech, speech processing, data quality, ASR models, TTS models

73. ❌ CCTU: A Benchmark for Tool Use under Complex Constraints

作者: Junjie Ye, Guoqiang Zhang, Wenjie Fu, Tao Gui, Qi Zhang, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在复杂约束下的工具使用能力，与’Large Language Models’、‘Tool Use’、‘LLM Agents’、‘Self-Correction’高度相关（10分）；涉及指令遵循和推理能力，与’Instruction Tuning’、‘Chain of Thought’、‘System 2 Thinking’有一定关联（5分）；其他关键词如MoE、量化、科学AI等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了CCTU基准测试来评估大语言模型在复杂约束下的工具使用能力，发现现有模型在严格约束下的任务完成率低于20%，且自我修正能力有限。

摘要翻译

在明确约束条件下通过工具使用解决问题，对大型语言模型（LLMs）而言构成了一项极具挑战性却又不可避免的场景，这要求模型具备函数调用、指令遵循和自我优化等能力。然而，由于缺乏专门的评估体系，相关进展一直受到阻碍。为此，我们提出了CCTU，一个用于评估复杂约束下LLM工具使用能力的基准。CCTU基于一个包含四个维度（即资源、行为、工具集和响应）的12类约束分类体系构建。该基准包含200个经过精心设计、涵盖多样化工具使用场景的挑战性测试用例，每个用例平均涉及七种约束类型，且平均提示长度超过4700个词元。为实现可靠评估，我们开发了一个可执行的约束验证模块，该模块能在模型与环境的多轮交互过程中执行步骤级验证并确保约束合规。我们评估了九种前沿大型语言模型在思考模式与非思考模式下的表现。结果表明，当要求严格遵守所有约束时，没有任何模型的任务完成率超过20%。进一步分析显示，模型在超过50%的案例中违反了约束，尤其在资源和响应维度上。此外，即使在收到关于约束违反的详细反馈后，LLMs仍表现出有限的自我优化能力，这凸显了开发鲁棒工具使用代理的一个关键瓶颈。为促进未来研究，我们公开了相关数据与代码。

摘要 (Abstract)

Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.

关键词: tool use, large language models, constraints, benchmark, self-refinement, function calling, evaluation, agents

74. ❌ Evolutionary Transfer Learning for Dragonchess

作者: Jim O’Connor, Annika Hoag, Sarah Goyette, Gary B. Parker 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15297v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究进化迁移学习在Dragonchess游戏中的应用，使用CMA-ES优化从Stockfish迁移的启发式评估函数。所有关键词均与大模型、深度学习技术原理或科学领域AI应用相关，而本文专注于传统进化算法和游戏AI，未涉及任何大模型、深度学习或相关技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本研究将Dragonchess作为AI研究新测试平台，通过进化迁移学习从国际象棋引擎Stockfish迁移启发式函数，并使用CMA-ES优化，显著提升了AI代理在复杂游戏环境中的性能。

摘要翻译

龙棋（Dragonchess）是由加里·吉盖克斯提出的一种三维象棋变体，其独特的战略与计算挑战使其成为研究人工智能（AI）启发式方法跨领域迁移的理想环境。本研究引入龙棋作为人工智能研究的新型测试平台，并提供了一个基于Python的开源游戏引擎供社区使用。我们通过直接移植顶尖国际象棋引擎Stockfish的启发式评估函数，并采用协方差矩阵自适应进化策略（Covariance Matrix Adaptation Evolution Strategy, CMA-ES）进行优化，探索了进化迁移学习在龙棋中的应用。初步实验表明，由于龙棋具有独特的多层结构和移动规则，直接迁移的启发式函数效果有限。然而，进化优化显著提升了AI智能体的表现，通过在50轮瑞士制锦标赛中的实证评估，优化后的智能体展现出卓越的对弈能力。本研究证实了进化方法在将启发式知识适配至结构复杂、未经探索的游戏领域中的有效性。

摘要 (Abstract)

Dragonchess, a three-dimensional chess variant introduced by Gary Gygax, presents unique strategic and computational challenges that make it an ideal environment for studying the transfer of artificial intelligence (AI) heuristics across domains. In this work, we introduce Dragonchess as a novel testbed for AI research and provide an open-source, Python-based game engine for community use. Our research investigates evolutionary transfer learning by adapting heuristic evaluation functions directly from Stockfish, a leading chess engine, and subsequently optimizing them using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Initial trials showed that direct heuristic transfers were inadequate due to Dragonchess’s distinct multi-layer structure and movement rules. However, evolutionary optimization significantly improved AI agent performance, resulting in superior gameplay demonstrated through empirical evaluation in a 50-round Swiss-style tournament. This research establishes the effectiveness of evolutionary methods in adapting heuristic knowledge to structurally complex, previously unexplored game domains.

关键词: Evolutionary Transfer Learning, Dragonchess, Heuristic Evaluation Functions, Stockfish, CMA-ES, Game AI, Multi-layer Strategy, Swiss-style Tournament

75. ❌ Scalable Simulation-Based Model Inference with Test-Time Complexity Control

作者: Manuel Gloeckler, J. P. Manzano-Patrón, Stamatios N. Sotiropoulos, Cornelius Schröder, Jakob H. Macke 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15292v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PRISM专注于科学模拟中的模型推断和选择，特别是生物物理建模（如扩散MRI），属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大模型、深度学习技术原理或任何其他关键词（如LLMs、MoE、训练方法、推理优化等），核心是模拟推断和贝叶斯方法，而非大模型应用或创新，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PRISM方法，解决了大规模模拟模型家族中模型选择和参数推断的扩展性问题，并在合成符号回归和生物物理扩散MRI建模中验证了其有效性。

摘要翻译

仿真在科学发现中扮演着核心角色。在许多应用中，瓶颈已不再是运行仿真器本身，而是在大量合理的仿真器家族中进行选择——每个仿真器对应着与观测数据相符的不同前向模型/假设。面对庞大的模型家族，经典的贝叶斯模型选择流程往往不切实际。此外，现有的摊销式模型选择方法通常在训练时硬编码固定的模型先验或复杂度惩罚，这要求用户在见到数据之前就必须预先确定特定的简约性假设。我们提出了PRISM，一种基于仿真的编码器-解码器架构，它能够推断离散模型结构及其相关连续参数的联合后验分布，同时通过一个可调节的模型先验（该先验作为网络的条件输入）实现对模型复杂度的测试时控制。我们证明，在一个合成的符号回归任务中，PRISM能够扩展到包含组合性极多（高达数十亿）模型实例的家族。作为一项科学应用，我们在扩散MRI数据的生物物理建模上评估了PRISM，结果显示它能够在合成及活体神经影像数据上，对多种多室模型进行有效的模型选择。

摘要 (Abstract)

Simulation plays a central role in scientific discovery. In many applications, the bottleneck is no longer running a simulator; it is choosing among large families of plausible simulators, each corresponding to different forward models/hypotheses consistent with observations. Over large model families, classical Bayesian workflows for model selection are impractical. Furthermore, amortized model selection methods typically hard-code a fixed model prior or complexity penalty at training time, requiring users to commit to a particular parsimony assumption before seeing the data. We introduce PRISM, a simulation-based encoder-decoder that infers a joint posterior over both discrete model structures and associated continuous parameters, while enabling test-time control of model complexity via a tunable model prior that the network is conditioned on. We show that PRISM scales to families with combinatorially many (up to billions) of model instantiations on a synthetic symbolic regression task. As a scientific application, we evaluate PRISM on biophysical modeling for diffusion MRI data, showing the ability to perform model selection across several multi-compartment models, on both synthetic and in vivo neuroimaging data.

关键词: simulation-based inference, model selection, Bayesian methods, encoder-decoder, diffusion MRI, biophysical modeling, scalable inference, test-time complexity control

76. ❌ Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

作者: Johannes Schmalz, Chaahat Jain 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是确定性决策问题中的状态安全验证算法，属于经典算法和形式化验证领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型、深度学习、AI for Science等相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对完全可观测非确定性决策问题中的状态安全性验证，提出了一种新的策略迭代算法iPI，在保持最佳情况运行时间的同时保证了多项式最坏情况复杂度，解决了现有算法在特定问题上的指数级运行时间缺陷。

摘要翻译

习得行动策略在序列决策中日益普及，但其缺乏安全性保障。近期研究提出了一种在初始状态与行动结果非确定性条件下测试此类策略安全性的流程框架。该框架的核心在于判定状态是否安全（即从该状态出发存在安全策略）并识别故障点——即从安全状态转移至不安全状态的状态-行动对。其最有效的安全性判定算法TarjanSafe在基准测试中表现良好，但我们证明该算法在状态空间维度上具有指数级的最坏情况时间复杂度。虽然存在一种线性时间的替代算法，但其实际运行效率较低。我们通过一种新的策略迭代算法iPI弥合了这一差距，它融合了两者的优势：在保持TarjanSafe最佳情况时间效率的同时，确保多项式级的最坏时间复杂度。实验验证了我们的理论，结果表明在适合TarjanSafe的问题中iPI具有相近性能，而在不适配的问题中iPI展现出指数级更优的扩展性。

摘要 (Abstract)

Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline’s core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe’s best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.

关键词: safety verification, non-deterministic problems, policy iteration, state safety, algorithm analysis, worst-case runtime, formal verification, sequential decision-making

77. ❌ Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory

作者: Rongjie Jiang, Jianwei Wang, Gengda Zhao, Chengyang Luo, Kai Wang, Wenjie Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15280v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型驱动的智能代理，提出NS-Mem神经符号记忆框架以增强多模态代理的长期推理能力。高度相关的关键词包括：‘Large Language Models’（论文明确提及LLMs驱动智能代理）、‘Chain of Thought’和’System 2 Thinking’（论文重点解决分析性、演绎推理问题，对应深度推理）、‘LLM Agents’（论文研究多模态智能代理）。‘Retrieval-Augmented Generation’得5分，因论文涉及混合记忆检索机制（结合相似性搜索与符号查询），与检索增强生成有一定关联。其余关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对现有基于神经表示的多模态代理记忆系统在支持分析性、演绎推理方面的局限性，提出了NS-Mem神经符号记忆框架，通过整合神经记忆与显式符号结构，在真实世界多模态推理基准上实现了平均4.35%的推理准确率提升。

摘要翻译

近年来，大语言模型的进展推动了在开放世界、多模态环境中运行的智能代理的兴起。为支持长期推理，此类代理通常配备外部记忆系统。然而，现有的大多数多模态代理记忆主要依赖神经表征和基于向量的检索，这类方法虽适用于归纳式、直觉性推理，但在支持现实世界决策所需的分析性、演绎性推理方面存在根本性局限。为应对这一局限，我们提出了NS-Mem——一种长期神经符号记忆框架，旨在通过将神经记忆与显式符号结构及规则相融合，以推进多模态代理的推理能力。具体而言，NS-Mem围绕记忆系统的三个核心组件构建：(1) 包含情景层、语义层与逻辑规则层的三层记忆架构；(2) 由SK-Gen实现的记忆构建与维护机制，该机制能自动从累积的多模态经验中整合结构化知识，并渐进式更新神经表征与符号规则；(3) 混合记忆检索机制，结合基于相似性的搜索与确定性符号查询功能，以支持结构化推理。在真实世界多模态推理基准测试上的实验表明，神经符号记忆相较于纯神经记忆系统在整体推理准确率上平均提升4.35%，在约束性推理查询任务中最高提升达12.5%，验证了NS-Mem的有效性。

摘要 (Abstract)

Recent advances in large language models have driven the emergence of intelligent agents operating in open-world, multimodal environments. To support long-term reasoning, such agents are typically equipped with external memory systems. However, most existing multimodal agent memories rely primarily on neural representations and vector-based retrieval, which are well-suited for inductive, intuitive reasoning but fundamentally limited in supporting analytical, deductive reasoning critical for real-world decision making. To address this limitation, we propose NS-Mem, a long-term neuro-symbolic memory framework designed to advance multimodal agent reasoning by integrating neural memory with explicit symbolic structures and rules. Specifically, NS-Mem is operated around three core components of a memory system: (1) a three-layer memory architecture that consists episodic layer, semantic layer and logic rule layer, (2) a memory construction and maintenance mechanism implemented by SK-Gen that automatically consolidates structured knowledge from accumulated multimodal experiences and incrementally updates both neural representations and symbolic rules, and (3) a hybrid memory retrieval mechanism that combines similarity-based search with deterministic symbolic query functions to support structured reasoning. Experiments on real-world multimodal reasoning benchmarks demonstrate that Neural-Symbolic Memory achieves an average 4.35% improvement in overall reasoning accuracy over pure neural memory systems, with gains of up to 12.5% on constrained reasoning queries, validating the effectiveness of NS-Mem.

关键词: multimodal agents, neuro-symbolic memory, long-term reasoning, deductive reasoning, memory architecture, structured knowledge, hybrid retrieval, reasoning accuracy

78. ❌ From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

作者: Xu Zhang, Wenxin Ma, Chenxu Wu, Rongsheng Wang, Kun Zhang, S. Kevin Zhou 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在医疗ICD编码任务中的应用，属于AI for Science领域，因此该关键词高度相关（10分）。论文明确使用LLM作为基础模型并进行微调（SFT），这两个关键词高度相关（10分）。论文提到使小规模LLM达到与大型模型相当的性能，与Small Language Models有一定关联（5分）。论文强调保持LLM的可解释性，与Explainable AI有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在医疗ICD编码任务中面临的泛化性差、可解释性降低和计算成本高的问题，提出了一种基于证据片段（span）的代码中心学习框架，显著降低了训练成本，提高了对未见ICD代码的准确性，并保持了模型的可解释性，使小规模LLM达到了与大型专有模型相当的性能。

摘要翻译

ICD编码是医疗领域一项关键且具有挑战性的任务。近期，基于大语言模型（LLM）的方法在ICD编码任务中展现出比判别式方法更强的泛化能力。然而，针对ICD编码对大语言模型进行微调面临三大挑战。首先，现有的公开ICD编码数据集对ICD编码空间的覆盖有限，限制了模型对未见编码的泛化能力。其次，简单的微调会削弱大语言模型的可解释性，因为少有公开数据集包含明确的编码分配支持证据。第三，ICD编码通常涉及冗长的临床文档，使得微调大语言模型的计算成本高昂。为解决这些问题，我们提出了以编码为中心的学习（Code-Centric Learning），这是一种将监督信号从完整临床文档转移到可扩展的简短证据片段（evidence spans）的训练框架。该框架的核心思想是，片段级学习能够提升大语言模型执行文档级ICD编码的能力。我们提出的框架包含混合训练策略和以编码为中心的数据扩展，这显著降低了训练成本，提升了对未见ICD编码的准确性，并保持了可解释性。在相同的大语言模型骨干架构下，我们的方法显著优于多个强基线模型。值得注意的是，我们的方法使较小规模的大语言模型能够达到与规模大得多的专有模型相媲美的性能，这证明了其在全自动化ICD编码中的有效性和潜力。

摘要 (Abstract)

ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model’s ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs’ ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.

关键词: ICD coding, LLM-based methods, fine-tuning, span-level learning, interpretability, clinical documents, evidence spans, training framework

79. ❌ Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search

作者: Mengxiang Chen, Zhouwei Zhai, Jin Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在工业电商搜索中的应用，提出Probe-then-Plan机制解决现有LLM方法的盲区-延迟困境。高度相关关键词：LLMs（论文明确使用）、SFT（明确提及）、LLM Agents（使用Teacher Agent和Planner）。中等相关：RAG（涉及检索增强）、CoT Reasoning（动态推理过程）、System 2 Thinking（诊断执行差距）、Self-Reflection（与现有方法对比）、Tool Use（涉及工具调用）、Multi-agent Systems（多代理协作）。其他关键词未涉及或仅边缘相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种环境感知搜索规划方法，通过Probe-then-Plan机制解决LLM在电商搜索中面临的盲区-延迟困境，显著提升了召回率和商业指标。

摘要翻译

现代电子商务搜索正朝着解析复杂用户意图的方向演进。尽管大语言模型（LLMs）具备强大的推理能力，但现有基于LLM的范式面临一个根本性的“盲目性-延迟”困境：查询改写方法对检索能力和实时库存状态无感知，常产生无效的搜索计划；反之，深度搜索代理依赖迭代的工具调用与反思，会导致数秒的延迟，无法满足工业级亚秒级响应预算。为解决这一矛盾，我们提出了环境感知搜索规划（Environment-Aware Search Planning, EASP），将搜索规划重新定义为基于环境现实的动态推理过程。EASP引入了“探测-规划”机制：一个轻量级的检索探测模块获取检索快照，使规划器能够诊断执行差距并生成基于现实的搜索计划。该方法包含三个阶段：（1）离线数据合成：教师代理通过诊断探测到的环境，合成多样化且经过执行验证的计划。（2）规划器训练与对齐：规划器首先通过监督微调（Supervised Fine-Tuning, SFT）进行初始化以掌握诊断能力，随后通过强化学习（Reinforcement Learning, RL）与业务指标（转化率）对齐。（3）自适应在线服务：一个复杂度感知的路由机制选择性地为复杂查询激活规划，确保资源的最优分配。在京东进行的广泛离线评估和在线A/B测试表明，EASP显著提升了相关召回率，并在用户点击转化率（UCVR）和总商品交易额（GMV）上实现了大幅增长。EASP已成功部署于京东的AI-Search系统。

摘要 (Abstract)

Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on JD.com demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in JD.com’s AI-Search system.

关键词: Large Language Models, E-commerce Search, Environment-Aware Planning, Retrieval Probe, Supervised Fine-Tuning, Reinforcement Learning, LLM Agents, Industrial Deployment

作者: Jing Wu, Yang Liu, Lin Zhang, Junbo Zeng, Jiabin Wang, Zi Ye, Guowen Li, Shilei Cao, Jiashun Cheng, Fang Wang, Meng Jin, Yerong Feng, Hong Cheng, Yutong Lu, Haohuan Fu, Juepeng Zheng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出AGCD方法，利用多智能体系统（multi-agent meteorological narration pipeline）从多模态数据中提取物理先验知识，并注入到天气预报模型中，属于AI for Science（气象科学）应用。该方法的核心创新在于智能体引导的解码过程，与’LLM Agents/Autonomous Agents’和’Multi-agent Systems’高度相关（10分）。论文提到使用MLLMs（多模态大语言模型）提取气象元素，因此与’Large Language Models’有一定关联（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对天气预报中物理一致性难以保持的问题，提出了一种基于多智能体引导的跨模态解码方法（AGCD），通过提取状态相关的物理先验知识并注入到预测模型中，有效减少了误差累积并提升了长期预测稳定性。

摘要翻译

准确的天气预报不仅仅是网格化的回归任务：它必须保持连贯的天气尺度结构及气象场的物理一致性，这一点在自回归推演中尤为重要，因为微小的单步误差可能放大为结构性偏差。现有的物理先验方法通常通过架构设计、正则化或与数值天气预报（NWP）耦合等方式施加全局性、一次性约束，在部署时提供的状态自适应与样本特异性控制能力有限。为弥补这一差距，我们提出了智能体引导的跨模态解码（Agent-Guided Cross-modal Decoding, AGCD），这是一种即插即用的解码时先验注入范式，能够从当前多变量大气状态中推导出状态条件化的物理先验，并以可控、可复用的方式将其注入预报模型。具体而言，我们设计了一个多智能体气象叙事流程来生成状态条件化的物理先验，利用多模态大语言模型（MLLMs）有效提取各类气象要素。为有效应用这些先验，AGCD进一步引入了跨模态区域交互解码机制，通过区域感知的多尺度标记化与高效的物理先验注入来优化视觉特征，且无需改变主干网络接口。在WeatherBench数据集上的实验表明，该方法在两种分辨率（5.625度和1.40625度）及多种主干网络（通用型与气象专用型）的6小时预报任务中均取得稳定提升，包括在严格因果性的48小时自回归推演中有效减少了早期误差累积并提升了长期预报稳定性。

摘要 (Abstract)

Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.

关键词: weather forecasting, agent-guided decoding, cross-modal decoding, physics-priors injection, multi-agent systems, autoregressive rollouts, meteorological narration, MLLMs

81. ❌ Directional Embedding Smoothing for Robust Vision Language Models

作者: Ye Wang, Jing Liu, Toshiaki Koike-Akino 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15259v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的安全性防御方法，与LLMs有一定关联（VLMs是LLMs的多模态扩展），与Alignment高度相关（研究安全对齐和对抗攻击防御），与LLM Agents高度相关（论文明确提到agentic AI systems和agentic systems）。其他关键词主要涉及具体技术细节（如MoE、量化、推理加速等）或特定应用领域（如科学AI），论文未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RESTA的轻量级推理时防御方法，通过方向性嵌入平滑技术有效降低视觉语言模型在JailBreakV-28K基准测试中的越狱攻击成功率，从而增强智能体AI系统的安全性。

摘要翻译

视觉语言模型（VLMs）的安全性与可靠性是部署可信赖自主人工智能系统的关键组成部分。然而，视觉语言模型仍易受越狱攻击的影响，这些攻击会破坏其安全对齐机制，导致产生有害输出。在本研究中，我们将随机化嵌入平滑与令牌聚合（RESTA）防御方法扩展至视觉语言模型，并基于多模态越狱攻击基准JailBreakV-28K评估其防御性能。我们发现，RESTA能有效降低针对这一多样化攻击语料库的攻击成功率，尤其是在采用定向嵌入噪声（即注入的噪声与原始令牌嵌入向量方向对齐）时效果显著。我们的研究结果表明，RESTA作为一种轻量级的推理时防御层，能够为自主系统中的视觉语言模型提供安全保障，从而完善整体安全框架。

摘要 (Abstract)

The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.

关键词: Vision-Language Models, Jailbreaking Attacks, Safety Alignment, Embedding Smoothing, RESTA, Agentic AI, Inference-time Defense, Robustness

82. ❌ SAGE: Multi-Agent Self-Evolution for LLM Reasoning

作者: Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15255v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAGE提出一个多智能体自演化框架，核心围绕LLM推理能力的提升，因此与’Large Language Models’、‘Chain of Thought/Multi-step Reasoning’、‘System 2 Thinking/In-depth Reasoning’、‘Self-Correction/Self-Improvement’、‘LLM Agents/Autonomous Agents’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。其他关键词如MoE、量化、RAG、科学AI应用等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该研究针对大语言模型在多步推理任务中缺乏稳定规划和质量控制的问题，提出了一个由挑战者、规划者、求解者和评判者四个智能体组成的自演化框架SAGE，在数学和代码生成基准测试中显著提升了模型性能。

摘要翻译

具备可验证奖励的强化学习能提升大语言模型（LLM）的推理能力，但许多方法仍依赖大量人工标注数据集。尽管自我对弈（self-play）减少了这种依赖，但其通常缺乏显式规划和严格的质量控制，限制了长视野多步推理的稳定性。我们提出SAGE（面向泛化推理进化的自进化智能体），这是一个闭环框架，其中四个智能体——挑战者（Challenger）、规划者（Planner）、求解者（Solver）和评判者（Critic）——仅基于一个小的种子集，从共享的LLM主干共同进化。挑战者持续生成难度递增的任务；规划者将每个任务转化为结构化的多步计划；求解者遵循计划生成答案，其正确性由外部验证器判定。评判者对生成的问题和计划进行评分与筛选，以防止课程漂移（curriculum drift）并维持训练信号质量，从而实现稳定的自我训练。在数学和代码生成基准测试中，SAGE在不同模型规模上均带来持续性能提升，将Qwen-2.5-7B模型在LiveCodeBench上的表现提高了8.9%，在OlympiadBench上提高了10.7%。

摘要 (Abstract)

Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.

关键词: Multi-Agent Systems, Self-Evolution, LLM Reasoning, Reinforcement Learning, Planning, Self-Improvement, Curriculum Learning, Benchmark Evaluation

83. ❌ Why the Valuable Capabilities of LLMs Are Precisely the Unexplainable Ones

作者: Quan Cheng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15238v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心讨论LLMs的能力本质与可解释性之间的关系，直接涉及’Large Language Models’和’Mechanistic Interpretability’两个关键词。论文通过逻辑论证和哲学分析，提出LLMs最有价值的能力恰恰是无法用人类可读规则解释的部分，这与可解释性AI研究高度相关。其他关键词涉及具体技术方法（如MoE、RLHF、RAG等）、应用领域（如AI for Science）或特定能力（如CoT、工具使用），论文未讨论这些具体技术实现或应用场景，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文论证了大型语言模型最有价值的能力恰恰是无法用人类可读规则完全解释的部分，这一结论通过专家系统等价性的反证法、中国哲学概念和历史案例得到支持。

摘要翻译

本文提出并论证了一个反直觉的论点：大型语言模型真正有价值的能力恰恰存在于无法被人可读的离散规则完全捕捉的部分。核心论证通过专家系统等价性进行反证：若大型语言模型的全部能力可由一套完整的人可读规则描述，则该规则集在功能上等价于一个专家系统；然而，从历史和经验上看，专家系统已被证明严格弱于大型语言模型；因此，矛盾产生——大型语言模型超越专家系统的能力正是那些无法被规则编码的能力。这一论点进一步得到中国哲学概念“悟”（通过实践获得的顿悟）、专家系统的历史性失败，以及人类认知工具与复杂系统之间的结构性错位等论据的支持。本文还探讨了该论点对可解释性研究、人工智能安全及科学认识论的影响。

摘要 (Abstract)

This paper proposes and argues for a counterintuitive thesis: the truly valuable capabilities of large language models (LLMs) reside precisely in the part that cannot be fully captured by human-readable discrete rules. The core argument is a proof by contradiction via expert system equivalence: if the full capabilities of an LLM could be described by a complete set of human-readable rules, then that rule set would be functionally equivalent to an expert system; but expert systems have been historically and empirically demonstrated to be strictly weaker than LLMs; therefore, a contradiction arises – the capabilities of LLMs that exceed those of expert systems are exactly the capabilities that cannot be rule-encoded. This thesis is further supported by the Chinese philosophical concept of Wu (sudden insight through practice), the historical failure of expert systems, and a structural mismatch between human cognitive tools and complex systems. The paper discusses implications for interpretability research, AI safety, and scientific epistemology.

关键词: Large Language Models, LLMs, Interpretability, Explainable AI, Expert Systems, Wu (sudden insight), AI Capabilities, Rule-based Systems

84. ❌ In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

作者: Francesco Sovrano, Lidia Losavio, Giulia Vilone, Marc Langheinrich 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Kolmogorov-Arnold Networks（KANs）中的符号回归问题，提出两种上下文符号回归方法（GSR和GMP）来改进符号提取的鲁棒性。论文专注于科学机器学习中的符号回归和KANs网络结构优化，与提供的关键词列表（主要围绕大语言模型、对齐、推理、代理、压缩等技术）无直接关联。虽然论文涉及AI for Science领域，但具体内容（符号回归、KANs）与关键词中的"AI for Science OR Bioinformatics OR Cheminformatics"（通常指生物信息学、化学信息学等具体科学领域应用）不匹配，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对Kolmogorov-Arnold Networks（KANs）中符号提取敏感且忽略全局交互的问题，提出了贪婪上下文符号回归（GSR）和门控匹配追踪（GMP）两种方法，在实验中实现了高达99.8%的测试MSE中位数减少，显著提升了符号回归的鲁棒性和公式一致性。

摘要翻译

符号回归旨在用简洁的解析表达式替代黑箱预测器，这些表达式可在科学机器学习中被检验和验证。柯尔莫哥洛夫-阿诺德网络（Kolmogorov-Arnold Networks, KANs）非常适合这一目标，因为相邻单元间的每个连接（即“边”）都由一个可学习的单变量函数参数化，理论上该函数可被符号运算符替代。然而在实践中，符号提取是一个瓶颈：标准的KAN到符号方法孤立地为每条已学习的边函数拟合运算符，使得离散选择对初始化和非凸参数拟合敏感，且忽略了局部替换如何通过整个网络相互作用。我们研究了KAN中运算符提取的上下文内符号回归，并提出了两种互补的实现方法。贪婪上下文内符号回归（Greedy in-context Symbolic Regression, GSR）通过根据短暂微调后的端到端损失改进来选择边替换，执行贪婪的上下文内选择。门控匹配追踪（Gated Matching Pursuit, GMP）则通过训练一个可微分的门控运算符层来分摊这种上下文内选择，该层在每个边后放置带有稀疏门控的运算符库；收敛后，门控被离散化（可选择性地跟随一个简短的上下文内贪婪优化步骤）。我们通过单因素超参数扫描量化鲁棒性，并评估恢复公式的预测误差和定性一致性。在多项实验中，贪婪上下文内符号回归实现了中位数单因素测试均方误差高达99.8%的降低。

摘要 (Abstract)

Symbolic regression aims to replace black-box predictors with concise analytical expressions that can be inspected and validated in scientific machine learning. Kolmogorov-Arnold Networks (KANs) are well suited to this goal because each connection between adjacent units (an “edge”) is parametrised by a learnable univariate function that can, in principle, be replaced by a symbolic operator. In practice, however, symbolic extraction is a bottleneck: the standard KAN-to-symbol approach fits operators to each learned edge function in isolation, making the discrete choice sensitive to initialisation and non-convex parameter fitting, and ignoring how local substitutions interact through the full network. We study in-context symbolic regression for operator extraction in KANs, and present two complementary instantiations. Greedy in-context Symbolic Regression (GSR) performs greedy, in-context selection by choosing edge replacements according to end-to-end loss improvement after brief fine-tuning. Gated Matching Pursuit (GMP) amortises this in-context selection by training a differentiable gated operator layer that places an operator library behind sparse gates on each edge; after convergence, gates are discretised (optionally followed by a short in-context greedy refinement pass). We quantify robustness via one-factor-at-a-time (OFAT) hyper-parameter sweeps and assess both predictive error and qualitative consistency of recovered formulas. Across several experiments, greedy in-context symbolic regression achieves up to 99.8% reduction in median OFAT test MSE.

关键词: Symbolic Regression, Kolmogorov-Arnold Networks, In-context Learning, Greedy Symbolic Regression, Gated Matching Pursuit, Robustness Improvement, Operator Extraction, Scientific Machine Learning

85. ❌ SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing

作者: Yuhuan Liu, Haitian Zhong, Xinyuan Xia, Qiang Liu, Shu Wu, Liang Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的知识编辑问题，提出SCAN稀疏编辑框架，因此与’Large Language Models’高度相关（10分）。方法涉及稀疏电路和稀疏编码器，与’Mixture of Experts/Sparse Models’有一定关联（8分）。编辑属于参数高效微调范畴，与’PEFT/LoRA’相关（8分）。框架强调机制可解释性，与’Mechanistic Interpretability’高度相关（10分）。其他关键词如SLMs、Scaling Laws、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在连续知识编辑中出现的灾难性遗忘和模型崩溃问题，提出了基于稀疏电路锚定神经元的SCAN框架，通过机制感知的稀疏编辑有效保持了模型完整性，在多个基准测试中优于现有方法。

摘要翻译

大型语言模型（LLMs）在连续知识编辑过程中常遭受灾难性遗忘与模型崩溃问题。这一脆弱性源于当前主流的密集编辑范式，该范式将模型视为黑箱，依赖粗粒度的参数干预，不可避免地破坏已保存的知识。为解决此问题，我们提出SCAN（基于稀疏电路锚定神经元的稀疏编辑框架），其通过稀疏转码器构建知识电路，将编辑转化为一种机制感知的精细操作。在Gemma2、Qwen3和Llama3.1模型上，基于CounterFact、ZsRE和WikiFactDiff数据集的实验表明，SCAN取得了卓越的性能，即使在连续进行3,000次编辑后，仍能在MMLU和GSM8K等基准测试中保持模型完整性；而其他现有方法随着编辑次数的增加性能逐步下降，最终导致模型崩溃。

摘要 (Abstract)

Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that inevitably disrupt preserved knowledge. To address this, we propose SCAN (a sparse editing framework based on Sparse Circuit Anchored Neuron) which transforms editing into a mechanism-aware manipulation by constructing a knowledge circuit via Sparse Transcoders. Experiments on Gemma2, Qwen3, and Llama3.1 across CounterFact, ZsRE and WikiFactDiff demonstrate that SCAN achieves a superior performance, maintaining model integrity on benchmarks like MMLU and GSM8K even after 3,000 sequential edits, whereas other existing methods deteriorate progressively as editing accumulates, eventually resulting in model collapse.

关键词: Large Language Models, Knowledge Editing, Sparse Editing, Catastrophic Forgetting, Model Integrity, Sparse Circuit, Parameter Intervention, Sequential Edits

86. ❌ InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

作者: Minsung Cho, Jaehyung Kim 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LM Arena中语言模型的去匿名化攻击，核心涉及大语言模型（LLMs）的识别和偏好学习，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、代理系统、模型压缩、科学AI等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为INTERPOL的模型驱动识别框架，通过插值偏好学习来对LM Arena中的语言模型进行去匿名化攻击，实验表明该方法显著优于现有基线并量化了实际威胁。

摘要翻译

模型响应的严格匿名性是保障基于投票的排行榜（如LM Arena）可靠性的关键。先前研究尝试利用TF-IDF或词袋模型等简单统计特征来突破这一假设，但这些方法往往缺乏区分风格相似或同系列模型的判别能力。为克服这些局限并揭示漏洞的严重性，我们提出了INTERPOL——一种基于模型的识别框架，该框架通过学习插值生成的偏好数据来区分目标模型与其他模型。具体而言，INTERPOL通过模型插值合成困难负样本，并采用自适应课程学习策略，从而捕捉到表层统计特征无法触及的深层风格模式。大量实验表明，INTERPOL在识别准确率上显著优于现有基线方法。此外，我们通过对Arena对战数据进行排名操纵模拟，量化了本研究发现在现实场景中的实际威胁程度。

摘要 (Abstract)

Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines in identification accuracy. Furthermore, we quantify the real-world threat of our findings through ranking manipulation simulations on Arena battle data.

关键词: de-anonymizing, LM Arena, interpolated preference learning, model identification, stylistic patterns, ranking manipulation, vulnerability, adaptive curriculum learning

87. ❌ ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving

作者: Tong Nie, Yihong Tang, Junlin He, Yuewen Mei, Jie Sun, Lijun Sun, Wei Ma, Jian Sun 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ADV-0专注于自动驾驶领域的对抗训练框架，研究内容为封闭式最小-最大优化、零和马尔可夫博弈、策略优化和对抗代理交互，旨在提升自动驾驶系统在长尾场景下的鲁棒性。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文未涉及任何大模型、语言模型、模型训练技术（如预训练、微调、对齐）、推理优化、代理系统或科学领域AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了ADV-0，一种用于自动驾驶的闭环最小-最大对抗训练框架，通过将驾驶策略与对抗代理的交互建模为零和马尔可夫博弈，有效暴露安全关键故障并增强策略对未见长尾风险的泛化能力。

摘要翻译

部署自动驾驶系统需要其对长尾场景具备鲁棒性，这些场景虽罕见却对安全至关重要。尽管对抗训练提供了一种有前景的解决方案，但现有方法通常将场景生成与策略优化解耦，并依赖于启发式代理。这导致了目标错位，且无法捕捉演化策略中不断变化的失效模式。本文提出了ADV-0，一个闭环的极小极大优化框架，将驾驶策略（防御者）与对抗智能体（攻击者）之间的交互视为零和马尔可夫博弈。通过将攻击者的效用直接与防御者的目标对齐，我们揭示了最优对抗分布。为使该问题可解，我们将动态对抗演化建模为迭代偏好学习，从而高效逼近该最优解，并为博弈提供一种与算法无关的解决方案。理论上，ADV-0收敛至纳什均衡，并最大化了现实世界性能的可认证下界。实验表明，该方法能有效暴露多样化的安全关键失效，并显著提升了学习策略与运动规划器针对未见长尾风险的泛化能力。

摘要 (Abstract)

Deploying autonomous driving systems requires robustness against long-tail scenarios that are rare but safety-critical. While adversarial training offers a promising solution, existing methods typically decouple scenario generation from policy optimization and rely on heuristic surrogates. This leads to objective misalignment and fails to capture the shifting failure modes of evolving policies. This paper presents ADV-0, a closed-loop min-max optimization framework that treats the interaction between driving policy (defender) and adversarial agent (attacker) as a zero-sum Markov game. By aligning the attacker’s utility directly with the defender’s objective, we reveal the optimal adversary distribution. To make this tractable, we cast dynamic adversary evolution as iterative preference learning, efficiently approximating this optimum and offering an algorithm-agnostic solution to the game. Theoretically, ADV-0 converges to a Nash Equilibrium and maximizes a certified lower bound on real-world performance. Experiments indicate that it effectively exposes diverse safety-critical failures and greatly enhances the generalizability of both learned policies and motion planners against unseen long-tail risks.

关键词: autonomous driving, adversarial training, long-tail robustness, closed-loop optimization, min-max game, Markov game, policy optimization, safety-critical scenarios

88. ❌ Towards Foundation Models for Consensus Rank Aggregation

作者: Yijun Jin, Simon Klüttermann, Chiara Balestra, Emmanuel Müller 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于Transformer和强化学习的Kemeny Transformer算法，用于解决共识排名聚合问题。虽然使用了Transformer架构和强化学习技术，但论文的核心是解决一个具体的计算优化问题（Kemeny距离最小化），而不是研究大模型或深度学习技术本身。论文没有涉及任何评分关键词中的大模型技术原理、训练方法、推理优化、对齐、应用领域等具体内容。所有关键词都与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Transformer和强化学习的Kemeny Transformer算法，用于高效近似Kemeny最优排名，在共识排名聚合任务中超越了传统方法，并实现了比整数线性规划求解器更快的推理速度。

摘要翻译

从多个输入排序中聚合出一个共识排序是推荐系统、搜索引擎、招聘和选举等领域的基础性问题。尽管对共识排序聚合的研究已有数十年，但最小化凯梅尼距离（Kemeny distance）在计算上依然困难。具体而言，确定凯梅尼距离下的最优排序聚合是一个NP难问题，这限制了其在实际中仅能应用于相对小规模的场景。我们提出了凯梅尼变换器（Kemeny Transformer），这是一种基于变换器（Transformer）架构的新型算法，通过强化学习训练，能够高效地逼近凯梅尼最优排序。实验结果表明，我们的模型超越了传统的多数启发式方法和马尔可夫链方法，并且推理速度显著快于整数线性规划求解器。因此，我们的方法为现实世界中的排序聚合任务提供了一种实用且可扩展的替代方案。

摘要 (Abstract)

Aggregating a consensus ranking from multiple input rankings is a fundamental problem with applications in recommendation systems, search engines, job recruitment, and elections. Despite decades of research in consensus ranking aggregation, minimizing the Kemeny distance remains computationally intractable. Specifically, determining an optimal aggregation of rankings with respect to the Kemeny distance is an NP-hard problem, limiting its practical application to relatively small-scale instances. We propose the Kemeny Transformer, a novel Transformer-based algorithm trained via reinforcement learning to efficiently approximate the Kemeny optimal ranking. Experimental results demonstrate that our model outperforms classical majority-heuristic and Markov-chain approaches, achieving substantially faster inference than integer linear programming solvers. Our approach thus offers a practical, scalable alternative for real-world ranking-aggregation tasks.

关键词: consensus ranking aggregation, Kemeny distance, Transformer, reinforcement learning, NP-hard problem, ranking aggregation, Kemeny Transformer, inference acceleration

89. ❌ Modeling Matches as Language: A Generative Transformer Approach for Counterfactual Player Valuation in Football

作者: Miru Hong, Minho Lee, Geonhee Jo, Hyeokje Jo, Pascal Bauer, Sang-Ki Ko 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出ScoutGPT，将足球比赛事件建模为序列标记，使用基于NanoGPT的Transformer架构进行下一个标记预测训练，属于大模型（LLM）在体育科学领域的创新应用。与’Large Language Models’相关（8分），因为使用了Transformer语言建模框架；与’Pre-training’相关（8分），因为模型通过下一个标记预测进行训练；与’Monte Carlo Tree Search AND LLM’高度相关（10分），因为明确使用蒙特卡洛采样进行反事实模拟；与’AI for Science’高度相关（10分），属于AI在体育科学领域的应用。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出ScoutGPT模型，通过将足球比赛事件序列化为语言标记并训练Transformer进行预测，实现了对球员转会影响的量化评估，实验表明该模型能比传统静态指标更准确地捕捉球员的特定影响。

摘要翻译

评估足球球员转会具有挑战性，因为球员的表现高度依赖于战术体系、队友配合及比赛情境。尽管存在这种复杂性，招募决策往往仍依赖于静态统计数据及主观专家判断，这些方法未能充分考虑上述情境因素。这一局限主要源于缺乏能够预测假设情境下比赛结果的反事实模拟机制。为应对这些挑战，我们提出了ScoutGPT——一种在语言建模框架内将足球比赛事件视为序列标记的生成模型。该模型基于NanoGPT的Transformer架构，通过下一标记预测任务进行训练，从而学习比赛事件序列的动态规律，并能在假设阵容下模拟事件序列，其预测性能相较于现有基线模型展现出显著优势。利用这一能力，模型采用蒙特卡洛采样实现反事实模拟，从而支持对未观测情境的评估。基于K联赛数据的实验表明，模拟球员转会能够带来进攻推进和进球概率的可量化变化，这证明ScoutGPT能够捕捉到超越传统静态指标的球员特异性影响。

摘要 (Abstract)

Evaluating football player transfers is challenging because player actions depend strongly on tactical systems, teammates, and match context. Despite this complexity, recruitment decisions often rely on static statistics and subjective expert judgment, which do not fully account for these contextual factors. This limitation stems largely from the absence of counterfactual simulation mechanisms capable of predicting outcomes in hypothetical scenarios. To address these challenges, we propose ScoutGPT, a generative model that treats football match events as sequential tokens within a language modeling framework. Utilizing a NanoGPT-based Transformer architecture trained on next-token prediction, ScoutGPT learns the dynamics of match event sequences to simulate event sequences under hypothetical lineups, demonstrating superior predictive performance compared to existing baseline models. Leveraging this capability, the model employs Monte Carlo sampling to enable counterfactual simulation, allowing for the assessment of unobserved scenarios. Experiments on K League data show that simulated player transfers lead to measurable changes in offensive progression and goal probabilities, indicating that ScoutGPT captures player-specific impact beyond traditional static metrics.

关键词: football player valuation, generative transformer, counterfactual simulation, Monte Carlo sampling, match event sequences, ScoutGPT, player transfer evaluation, NanoGPT-based architecture

90. ❌ CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds

作者: Vaishnavi Nagabhushana, Kartikay Agrawal, Ayon Borthakur 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是脉冲神经网络（SNN）和Transformer在持续学习（Continual Learning）中的应用，具体针对类增量学习（CIL）中的灾难性遗忘问题。所有评分关键词都专注于大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文的核心是脉冲神经网络架构（CATFormer）、动态阈值神经元模型（DTLIF）和类增量学习，并未涉及任何大语言模型技术、应用或相关概念。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

论文针对类增量学习中脉冲神经网络存在的灾难性遗忘问题，提出了CATFormer框架，通过动态阈值神经元模型和门控动态头选择机制，在静态和神经形态数据集上实现了优于现有无排练方法的性能。

摘要翻译

尽管深度神经网络在受控环境中表现极为出色，但在现实场景中却往往失效——这些场景中的数据并非一次性全部可用，且模型必须适应可能偏离初始分布的新数据分布。基于新数据的后续更新会导致先前习得的知识丢失，这一现象通常被称为灾难性遗忘。相比之下，大脑能够在不发生此类灾难性遗忘的情况下持续学习，无论其遇到的任务数量如何。现有的用于类增量学习的脉冲神经网络在任务累积时会出现性能急剧下降的问题。本文提出CATFormer（上下文自适应阈值变换器），一个可扩展的框架以克服这一局限。我们发现，防止脉冲神经网络遗忘的关键不仅在于突触可塑性，还在于调节神经元兴奋性。CATFormer的核心是动态阈值泄漏积分发放神经元模型，该模型利用上下文自适应阈值作为知识保留的主要机制。此机制与用于任务无关推理的门控动态头选择机制相结合。在静态数据集（CIFAR-10/100/Tiny-ImageNet）和神经形态数据集（CIFAR10-DVS/SHD）上的广泛评估表明，CATFormer在各种任务划分下均优于现有的无需回放的类增量学习算法，从而确立了其作为高能效、真实类增量学习的理想架构地位。

摘要 (Abstract)

Although deep neural networks perform extremely well in controlled environments, they fail in real-world scenarios where data isn’t available all at once, and the model must adapt to a new data distribution that may or may not follow the initial distribution. Previously acquired knowledge is lost during subsequent updates based on new data. a phenomenon commonly known as catastrophic forgetting. In contrast, the brain can learn without such catastrophic forgetting, irrespective of the number of tasks it encounters. Existing spiking neural networks (SNNs) for class-incremental learning (CIL) suffer a sharp performance drop as tasks accumulate. We here introduce CATFormer (Context Adaptive Threshold Transformer), a scalable framework that overcomes this limitation. We observe that the key to preventing forgetting in SNNs lies not only in synaptic plasticity but also in modulating neuronal excitability. At the core of CATFormer is the Dynamic Threshold Leaky Integrate-and-Fire (DTLIF) neuron model, which leverages context-adaptive thresholds as the primary mechanism for knowledge retention. This is paired with a Gated Dynamic Head Selection (G-DHS) mechanism for task-agnostic inference. Extensive evaluation on both static (CIFAR-10/100/Tiny-ImageNet) and neuromorphic (CIFAR10-DVS/SHD) datasets reveals that CATFormer outperforms existing rehearsal-free CIL algorithms across various task splits, establishing it as an ideal architecture for energy-efficient, true-class incremental learning.

关键词: Continual Learning, Spiking Neural Networks, Catastrophic Forgetting, Class-Incremental Learning, Dynamic Threshold, Transformer, Energy-efficient, Context Adaptive Threshold

91. ❌ What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

作者: David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, Bernt Schiele 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究端到端自动驾驶规划器的架构设计，专注于感知表示、轨迹表示和生成规划等具体模式对闭环性能的影响，并提出新的BevAD架构。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练方法、推理技术、应用框架等），而本文完全不涉及语言模型或自然语言处理，其核心是计算机视觉、自动驾驶和模仿学习，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文系统研究了端到端自动驾驶规划器中常见架构模式对闭环性能的影响，并基于此提出了一种名为BevAD的新型轻量级、高可扩展架构，在Bench2Drive基准测试中实现了72.7%的成功率，并展示了强大的数据扩展能力。

摘要翻译

端到端自动驾驶因其在交互场景中学习鲁棒行为并随数据规模扩展的潜力而受到广泛关注。主流架构通常基于独立的感知与规划模块构建，并通过潜在表征（如鸟瞰图特征网格）连接以保持端到端可微性。该范式主要基于开环数据集发展，其评估不仅关注驾驶性能，也关注中间感知任务。然而，在开环评估中表现优异的架构改进，往往难以转化为可扩展的鲁棒闭环驾驶学习。本文系统性地重新审视了常见架构模式对闭环性能的影响：（1）高分辨率感知表征，（2）解耦的轨迹表征，以及（3）生成式规划。关键的是，我们的分析评估了这些模式的综合影响，揭示了未曾预见的局限性以及尚未充分探索的协同效应。基于这些发现，我们提出了BevAD——一种新颖的轻量级且高度可扩展的端到端驾驶架构。BevAD在Bench2Drive基准测试中实现了72.7%的成功率，并展示了使用纯模仿学习的强大数据扩展能力。我们的代码与模型已公开：https://dmholtz.github.io/bevad/

摘要 (Abstract)

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird’s eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/

关键词: end-to-end autonomous driving, architectural patterns, closed-loop performance, perceptual representations, trajectory representations, generative planning, BevAD, imitation learning

92. ❌ Iterative Learning Control-Informed Reinforcement Learning for Batch Process Control

作者: Runze Lin, Ziqi Zhuo, Junghui Chen, Lei Xie, Hongye Su 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于工业过程控制中的深度强化学习（DRL）与迭代学习控制（ILC）的结合，研究领域为控制工程与自动化。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文未涉及任何LLM、深度学习模型架构、训练方法、推理优化、对齐技术、代理系统或AI for Science的具体内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合迭代学习控制（ILC）与深度强化学习（DRL）的IL-CIRL框架，用于解决批处理过程控制中DRL方法的稳定性与安全性问题，并通过卡尔曼滤波状态估计引导DRL策略满足操作约束和稳定性保证。

摘要翻译

深度强化学习（DRL）的一个显著局限在于探索-利用过程中生成的动作存在随机不确定性，这在训练和部署阶段均会带来显著的安全风险。在工业过程控制中，由于缺乏形式化的稳定性与收敛性保证，进一步阻碍了从业者采用DRL方法。相比之下，迭代学习控制（ILC）是一种针对重复性系统的成熟自主控制方法，尤其在间歇过程优化中应用广泛。ILC通过在连续批次之间或单个批次内部迭代优化控制律，以补偿重复性与非重复性扰动，从而实现期望的控制性能。本研究提出一种迭代学习控制引导的强化学习（IL-CIRL）框架，用于在间歇过程的双层“批间-批内”控制架构中训练DRL控制器。该方法在迭代学习结构中引入基于卡尔曼滤波的状态估计，以引导DRL智能体学习满足操作约束并确保稳定性保证的控制策略。该框架使得针对多扰动条件下运行的间歇过程，能够系统化地设计DRL控制器。

摘要 (Abstract)

A significant limitation of Deep Reinforcement Learning (DRL) is the stochastic uncertainty in actions generated during exploration-exploitation, which poses substantial safety risks during both training and deployment. In industrial process control, the lack of formal stability and convergence guarantees further inhibits adoption of DRL methods by practitioners. Conversely, Iterative Learning Control (ILC) represents a well-established autonomous control methodology for repetitive systems, particularly in batch process optimization. ILC achieves desired control performance through iterative refinement of control laws, either between consecutive batches or within individual batches, to compensate for both repetitive and non-repetitive disturbances. This study introduces an Iterative Learning Control-Informed Reinforcement Learning (IL-CIRL) framework for training DRL controllers in dual-layer batch-to-batch and within-batch control architectures for batch processes. The proposed method incorporates Kalman filter-based state estimation within the iterative learning structure to guide DRL agents toward control policies that satisfy operational constraints and ensure stability guarantees. This approach enables the systematic design of DRL controllers for batch processes operating under multiple disturbance conditions.

关键词: Deep Reinforcement Learning, Iterative Learning Control, Batch Process Control, Stability Guarantees, Kalman Filter, Control Policy, Industrial Process Control, Dual-layer Control Architecture

93. ❌ Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

作者: Vladyslav Parakhin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM系统的同步开销优化，直接涉及’LLM Agents’和’Multi-agent Systems’关键词，因此这两项给10分。论文明确提到LLM，因此’Large Language Models’给10分。其他关键词如MoE、SLMs、训练方法、推理优化、科学应用等均未在摘要中提及，因此给0分。

!!! tip deepseek-chat TL;DR

该论文通过将MESI缓存一致性协议适配到多智能体LLM系统中，提出了Artifact Coherence System和Token Coherence Theorem，将同步开销从O(n×S×|D|)降低到O((n+W)×|D|)，在仿真中实现了高达95%的token节省。

摘要翻译

在多智能体大语言模型（LLM）编排中，若采用朴素的广播机制，同步开销随智能体数量、步骤数和工件规模呈 O(n × S × |D|) 增长——这一机制我称之为广播引发的三重乘性开销。我认为这种弊端是完全状态重广播的结构性残留，而非多智能体协调的固有属性。

核心主张是：LLM 多智能体系统中的同步开销爆炸问题，在形式精度上可映射至共享内存多处理器中的缓存一致性问题，且经过最小结构修改后，MESI 协议的失效机制可迁移至工件同步场景。

我构建了工件一致性系统（Artifact Coherence System, ACS），并证明了令牌一致性定理：当 S > n + W(d_i) 时，惰性失效机制可将开销降低至少 S/(n + W(d_i))，从而将 O(n × S × |D|) 复杂度转化为 O((n + W) × |D|)。一个经 TLA+ 形式验证的协议确保了在约 2,400 个已探索状态中，单写入者安全性、单调版本控制及有限过时性均得到满足。

在四种工作负载配置下的仿真结果显示，令牌节省率分别为：V=0.05 时 95.0% ± 1.3%，V=0.10 时 92.3% ± 1.4%，V=0.25 时 88.3% ± 1.5%，V=0.50 时 84.2% ± 1.3%——均超过定理给出的保守下界。即使在 V=0.9 时仍保持约 81% 的节省率，这与预测的崩溃阈值相悖。

本研究的贡献包括：（1）形式化的 MESI 协议到工件状态映射；（2）作为节省下界的令牌一致性定理；（3）具备三个已证明不变量的 TLA+ 验证协议；（4）解决“始终读取”异议的条件化工件访问语义刻画；（5）通过轻量适配层集成 LangGraph、CrewAI 和 AutoGen 的参考 Python 实现。

摘要 (Abstract)

Multi-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast – a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination. The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification. I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+-verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states. Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 – each exceeding the theorem’s conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold. Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+-verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers.

关键词: Multi-agent LLM systems, Synchronization overhead, Cache coherence, MESI protocol, Token Coherence Theorem, Artifact Coherence System, TLA+ verification, LangGraph integration

94. ❌ Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning

作者: Ansar Rahman, Hassan Shojaee-Mend, Sepideh Hatamikia 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用图学习和跨注意力机制进行自闭症谱系障碍分类的多模态医学影像分析，属于AI在生物医学领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其属于AI在生物信息学/神经科学领域的应用，但论文本身并未涉及大模型或深度学习技术原理的创新，也未提及任何特定的大模型技术。因此，仅给予该关键词5分（有一定关联），其余关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于图学习和跨注意力机制的多模态融合框架，用于整合功能性和结构性脑成像数据，以改进自闭症谱系障碍的自动分类，在ABIDE-I数据集上取得了优于现有方法的性能。

摘要翻译

自闭症谱系障碍（ASD）是一种复杂的神经发育疾病，其特征是非典型的脑功能连接与细微的结构改变。静息态功能磁共振成像（rs-fMRI）已被广泛用于识别大规模脑网络的异常，而结构磁共振成像（sMRI）则提供了关于形态学组织的补充信息。尽管二者具有互补性，但在统一框架内有效整合这些异质性成像模态仍具挑战性。本研究提出了一种多模态图学习框架，该框架在保持功能连接主导作用的同时，整合了结构成像与表型信息以进行ASD分类。所提出的框架在ABIDE-I数据集上进行了评估。每个被试被表示为群体图中的一个节点。功能与结构特征被提取为模态特定的节点属性，而个体间关系则通过基于表型信息的成对关联编码器（PAE）进行建模。训练两个边变分图卷积网络（Edge Variational GCNs）以学习被试级别的嵌入表示。为实现有效的多模态整合，我们引入了一种新颖的基于非对称Transformer的交叉注意力机制，该机制允许功能嵌入有选择地整合互补的结构信息，同时保持功能的主导性。融合后的嵌入表示随后输入多层感知机（MLP）进行ASD分类。采用分层10折交叉验证，该框架取得了87.3%的受试者工作特征曲线下面积（AUC）和84.4%的准确率。在留一站点交叉验证（LOSO-CV）下，模型取得了平均跨站点准确率82.0%，其性能在10折交叉验证下优于现有方法约3%，在LOSO-CV下优于约7%。所提出的框架有效整合了来自多站点ABIDE-I数据集的异质多模态数据，提升了跨成像站点的自动化ASD分类性能。

摘要 (Abstract)

Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by atypical functional brain connectivity and subtle structural alterations. rs-fMRI has been widely used to identify disruptions in large-scale brain networks, while structural MRI provides complementary information about morphological organization. Despite their complementary nature, effectively integrating these heterogeneous imaging modalities within a unified framework remains challenging. This study proposes a multimodal graph learning framework that preserves the dominant role of functional connectivity while integrating structural imaging and phenotypic information for ASD classification. The proposed framework is evaluated on ABIDE-I dataset. Each subject is represented as a node within a population graph. Functional and structural features are extracted as modality-specific node attributes, while inter-subject relationships are modeled using a pairwise association encoder (PAE) based on phenotypic information. Two Edge Variational GCNs are trained to learn subject-level embeddings. To enable effective multimodal integration, we introduce a novel asymmetric transformer-based cross-attention mechanism that allows functional embeddings to selectively incorporate complementary structural information while preserving functional dominance. The fused embeddings are then passed to a MLP for ASD classification. Using stratified 10-fold cross-validation, the framework achieved an AUC of 87.3% and an accuracy of 84.4%. Under leave-one-site-out cross-validation (LOSO-CV), the model achieved an average cross-site accuracy of 82.0%, outperforming existing methods by approximately 3% under 10-fold cross-validation and 7% under LOSO-CV. The proposed framework effectively integrates heterogeneous multimodal data from the multi-site ABIDE-I dataset, improving automated ASD classification across imaging sites.

关键词: Autism Spectrum Disorder (ASD) classification, multimodal fusion, graph learning, cross-attention mechanism, functional connectivity, structural MRI, population graph, ABIDE-I dataset

95. ❌ HindSight: Evaluating Research Idea Generation via Future Impact

作者: Bo Jiang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成研究想法的评估方法，直接涉及LLM作为评估者（LLM-as-Judge）和检索增强生成（RAG）系统，因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为HindSight的时间分割评估框架，用于评估AI生成的研究想法，通过将生成的想法与未来真实出版物匹配并基于引用影响和会议接受度进行评分，发现LLM评估者与真实研究影响之间存在显著差异，且检索增强系统能产生更高质量的想法。

摘要翻译

评估人工智能生成的研究构想通常依赖于大语言模型评审或专家小组——这两种方式均具主观性，且与实际研究影响力脱节。我们引入时间切分评估框架（\hs{}），该框架通过将生成的构想与未来实际发表的论文进行匹配，并依据其引用影响力和会议/期刊录用情况进行评分，从而衡量构想质量。设定一个时间截点~$T$，我们将构想生成系统严格限制在~$T~之前的文献范围内，然后将其输出与随后30个月内发表的论文进行比对评估。在10个人工智能/机器学习研究主题上的实验揭示了一个显著脱节：作为评审的大语言模型（LLM-as-Judge）未发现检索增强型与基础型构想生成之间存在显著差异（$p{=}0.584$），而\hs{}框架显示检索增强系统产生的构想评分高出2.5倍（$p{<}0.001$）。此外，\hs{}评分与大语言模型评判的新颖性呈负相关（$ρ{=}{-}0.29$, $p{<}0.01$），这表明大语言模型系统性地高估了那些听起来新颖、却从未在真实研究中实现的构想。

摘要 (Abstract)

Evaluating AI-generated research ideas typically relies on LLM judges or human panels – both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while \hs{} shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty ($ρ{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

关键词: research idea generation, evaluation framework, LLM-as-Judge, retrieval-augmented generation, citation impact, future publications, time-split evaluation, AI-generated ideas

96. ❌ To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

作者: Yitong Zhang, Chengze Li, Ruize Chen, Guowei Yang, Xiaoran Jia, Yijie Ren, Jia Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在私有库代码生成中的应用，直接涉及’Large Language Models’（核心研究对象）和’Tool Use/API Tool Use’（研究如何让LLMs调用私有库API）。论文提出的PriCoder方法通过合成数据训练LLMs，属于’Post-training/Supervised Fine-tuning’范畴。论文提到现有方法依赖检索API文档注入上下文，与’Retrieval-Augmented Generation’有一定关联，但并非核心。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在私有库代码生成中调用API能力不足的问题，提出了PriCoder方法，通过自动合成数据训练LLMs，显著提升了私有库代码生成的性能（pass@1提升超过20%），同时不影响通用代码生成能力。

摘要翻译

大语言模型（LLM）在代码生成方面展现出强大潜力，但在面向私有库的代码生成任务中仍存在局限——该任务的目标是使用私有库的API生成代码。现有方法主要依赖于检索私有库API文档，并在推理时将相关知识注入上下文。然而，我们的研究表明这并不充分：即使提供准确的必要知识，LLM仍难以有效调用私有库API。

为突破此限制，我们提出PriCoder方法，通过自动合成数据来教导LLM调用私有库API。具体而言，PriCoder将私有库数据合成建模为图构建过程，并交替使用两种图算子：（1）渐进式图演化——通过从基础样本逐步合成更多样化的训练样本来提升数据多样性；（2）多维图剪枝——通过严格过滤流程提升数据质量。为支持严谨评估，我们基于测试模型不熟悉的最新发布库构建了两个新基准。在三个主流LLM上的实验表明，PriCoder显著提升了面向私有库的代码生成能力，在多数场景下pass@1指标提升超过20%，同时对通用代码生成能力的影响可忽略不计。我们的代码与基准已公开于https://github.com/contact-eniacode/PriCoder。

摘要 (Abstract)

Large Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private-library-oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private-library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private-library APIs effectively. To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private-library APIs through automatically synthesized data. Specifically, PriCoder models private-library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private-library-oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at https://github.com/contact-eniacode/PriCoder.

关键词: Large Language Models, code generation, private libraries, API invocation, data synthesis, fine-tuning, PriCoder, benchmarks

97. ❌ Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

作者: Mumuksh Tayal, Manan Tayal, Ravi Prakash 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于离线安全强化学习（Safe Offline RL），研究如何从静态数据集中学习满足严格安全约束的策略。论文的核心技术包括Hamilton-Jacobi可达性分析、流策略、行为克隆、自洽性贝尔曼递归和保形预测校准。所有给定的关键词都直接与大语言模型（LLMs）、深度学习技术原理或特定AI应用领域（如生物信息学）相关，而本文研究的是强化学习中的安全控制问题，属于不同的机器学习子领域。论文中未涉及任何大语言模型、深度学习架构、训练方法、推理优化、AI代理或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Safe Flow Q-Learning（SafeFQL）的新方法，用于解决离线安全强化学习问题，该方法结合了可达性分析的安全值函数和高效的一步流策略，在保证安全约束的同时实现了较低的推理延迟。

摘要翻译

离线安全强化学习（RL）旨在从静态数据集中学习在严格安全约束下实现奖励最大化的策略。现有方法通常依赖于软性期望成本目标或迭代生成式推断，这对于安全关键型实时控制可能不足。我们提出安全流Q学习（SafeFQL），该方法通过将受汉密尔顿-雅可比可达性启发的安全值函数与高效的一步流策略相结合，将FQL扩展至安全离线RL领域。SafeFQL通过自洽贝尔曼递归学习安全值函数，通过行为克隆训练流策略，并将其提炼为一步执行器，从而在部署时无需拒绝采样即可实现奖励最大化的安全动作选择。为应对所学安全边界中有限数据近似误差，我们增加了共形预测校准步骤，以调整安全阈值并提供有限样本概率安全覆盖。实验表明，与扩散式安全生成基线方法相比，SafeFQL以适度增加的离线训练成本换取了显著降低的推断延迟，这对实时安全关键型部署具有优势。在船舶导航及Safety Gymnasium MuJoCo任务中，SafeFQL达到或超越了先前离线安全RL方法的性能，同时大幅减少了约束违反情况。

摘要 (Abstract)

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton–Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

关键词: Offline Safe Reinforcement Learning, Safe Flow Q-Learning, Hamilton-Jacobi Reachability, Flow Policy, Behavioral Cloning, Conformal Prediction, Safety Constraints, Inference Latency

98. ❌ Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies

作者: Makoto Nakamura 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究法律文档的标准化转换和跨司法管辖区的语义相似性分析，主要涉及XML模式转换、多语言嵌入模型和语义检索技术。论文内容与所有评分关键词（均聚焦于大模型、深度学习技术原理及其应用创新）完全无关，未涉及任何大模型、深度学习、模型训练、推理优化、对齐、代理系统等核心技术或应用。论文中提到的’embedding models’是通用的语义嵌入技术，并非大模型或深度学习创新。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于日本法律标准XML模式的集成框架，通过转换管道实现日本法规与国际法律文档标准的互操作性，并应用多语言嵌入模型和语义相似性技术识别不同国家法律体系中的对应条款，以支持计算比较法研究。

摘要翻译

本文通过衔接两个基于日本法律标准XML架构的连续研究项目，提出了一个计算比较法的集成框架。首个项目通过开发从日本法律标准到阿科玛·恩托索标准的转换管道，实现了结构互操作性，使日本法规能够融入基于国际法律文档标记语言的立法数据库。在此基础上，第二个项目应用多语言嵌入模型与语义文本相似性技术，以识别不同国家法律体系间的对应条款。结合多语言嵌入、FAISS检索与交叉编码器重排序的原型系统，能够生成候选对应关系，并将其可视化为跨司法管辖网络，以支持探索性比较分析。

摘要 (Abstract)

This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.

关键词: computational comparative law, Japanese Legal Standard XML schema, Akoma Ntoso standard, multilingual embedding models, semantic textual similarity, cross-jurisdictional networks, legal document interoperability

99. ❌ PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

作者: Mark Deutel, Simon Geis, Axel Plinge 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度神经网络（DNN）在微控制器单元（MCU）上的高效部署，主要涉及神经架构搜索（NAS）、模型压缩（包括剪枝和量化）以及边缘设备优化。与评分关键词列表相比，仅与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（论文明确提到量化优化），但论文不涉及大语言模型（LLM）、深度学习技术原理创新或科学领域应用，因此其他关键词均无关。加权总分仅来自量化/压缩关键词的5分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为PrototypeNAS的零样本神经架构搜索方法，用于快速设计和优化深度神经网络，使其能够在资源受限的微控制器单元上高效部署，并在多个任务上实现与大型模型相当的精度。

摘要翻译

在具有不同硬件约束的边缘设备上实现高效的深度神经网络（DNN）推理是一项具有挑战性的任务，通常需要针对每个设备单独定制DNN架构。为避免巨大的人工投入，可采用神经架构搜索（Neural Architecture Search, NAS）。然而，许多现有的NAS方法资源密集且耗时，因为它们需要从头训练大量不同的DNN。此外，这些方法未考虑目标系统的资源约束。为应对这些不足，我们提出了PrototypeNAS，一种零样本NAS方法，旨在加速并自动化针对不同目标微控制器单元（Microcontroller Units, MCUs）的DNN选择、压缩与定制。我们提出了一种新颖的三步搜索方法，将针对给定目标平台的DNN设计与定制从DNN训练过程中解耦。首先，我们设计了一个新颖的搜索空间，它不仅能够从一个大型架构中裁剪出较小的DNN，还能结合多种架构类型的结构优化，以及其剪枝和量化配置的优化。其次，我们在优化过程中探索使用一组零样本代理指标（zero-shot proxies）的集成，而非单一指标。第三，我们提出采用超体积子集选择（Hypervolume Subset Selection）方法，从多目标优化的帕累托前沿（Pareto front）中提炼出最能代表精度与浮点运算次数（FLOPs）之间关键权衡的DNN架构。我们在三个不同任务（图像分类、时间序列分类和目标检测）的12个数据集上评估了PrototypeNAS的有效性。实验结果表明，PrototypeNAS能够在数分钟内识别出足够小的DNN模型，使其能够部署在现成的MCU上，同时仍能达到与大型DNN模型相媲美的精度。

摘要 (Abstract)

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

关键词: neural architecture search, microcontroller units, edge devices, model compression, quantization, zero-shot NAS, DNN optimization, Pareto front

100. ❌ ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

作者: Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen, Bernard Ghanem 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是从说话者话语生成反应性听者动作的任务，属于多模态生成（文本/音频到动作）和人类行为建模领域。虽然论文提到了与LLM-based pipelines的比较，但核心内容不涉及大模型技术原理、训练方法、推理优化、对齐、代理系统或科学AI应用等关键词。所有关键词均与论文主题无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了从说话者话语生成反应性听者动作的新任务，并构建了数据集ReactMotionNet和生成框架ReactMotion，实验表明其优于检索基线和级联LLM管道，能生成更自然、多样和适当的听者动作。

摘要翻译

本文提出了一项新任务：基于说话者话语的响应式听者动作生成，旨在生成能恰当回应说话者话语的自然听者身体动作。然而，由于人类反应本质上具有非确定性，对此类非语言听者行为的建模仍处于探索不足且充满挑战的阶段。为推进此任务，我们提出了ReactMotionNet——一个大规模数据集，该数据集将说话者话语与多个候选听者动作配对，并标注了不同等级的恰当性。这种数据集设计明确捕捉了听者行为的一对多特性，并提供了超越单一真实动作的监督信息。基于此数据集设计，我们开发了面向偏好的评估方案，专门用于评估响应恰当性，而传统关注输入-动作对齐的动作指标则忽视了这一维度。我们进一步提出了ReactMotion，这是一个统一的生成框架，能联合建模文本、音频、情感和动作，并通过基于偏好的目标进行训练，以鼓励生成既恰当又多样化的听者响应。大量实验表明，ReactMotion在检索基线和基于级联大语言模型的流程上均表现出优势，能生成更自然、多样且恰当的听者动作。

摘要 (Abstract)

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker’s utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

关键词: Reactive Listener Motion Generation, Speaker Utterance, ReactMotionNet, Preference-oriented Evaluation, Multimodal Generation, Human Behavior Modeling, Nonverbal Communication, Generative Framework

101. ❌ Analyzing Error Sources in Global Feature Effect Estimation

作者: Timo Heiß, Coco Bögel, Bernd Bischl, Giuseppe Casalicchio 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文研究机器学习模型（特别是黑盒模型）中全局特征效应（如PD和ALE图）的误差来源分析，包括偏差和方差的分解、数据选择策略（训练数据、验证数据、交叉验证）的影响等。论文主题属于传统机器学习模型解释性领域，不涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术或AI for Science应用相关，而本文专注于传统机器学习模型的解释方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文系统分析了全局特征效应估计中的误差来源，通过均方误差分解分离了模型偏差、估计偏差、模型方差和估计方差，并发现使用训练数据带来的偏差可忽略不计，而交叉验证能有效减少模型方差。

摘要翻译

诸如部分依赖图（PD）和累积局部效应图（ALE）等全局特征效应方法被广泛用于解释黑盒模型。然而，它们仅是对真实潜在效应的估计，其可靠性取决于多种误差来源。尽管全局特征效应方法应用普遍，但这些误差来源在很大程度上尚未得到充分探究。特别是，一个实践中至关重要的问题——应使用训练数据还是留出数据来估计特征效应——仍未得到解答。我们通过提供一个系统性的、基于估计器层面的分析来填补这一空白，该分析厘清了PD和ALE的偏差与方差来源。为此，我们推导了一个均方误差分解式，将模型偏差、估计偏差、模型方差和估计方差分离开来，并分析了它们对模型特性、数据选择和样本量的依赖关系。我们通过一项广泛的模拟研究验证了理论发现，该研究涵盖了多种数据生成过程、学习器、估计策略（训练数据、验证数据和交叉验证）以及样本量。我们的结果表明，虽然理论上使用留出数据最为清晰，但由训练数据引起的潜在偏差在实证上可忽略不计，且通常被更大样本量所带来的影响所主导。估计方差既取决于交互作用的存在，也取决于样本量，其中ALE对后者尤为敏感。基于交叉验证的估计是一种有前景的方法，它能降低模型方差分量，特别是对于过拟合模型。我们的分析为特征效应估计中的误差来源提供了原理性解释，并为解释机器学习模型时选择估计策略提供了具体指导。

摘要 (Abstract)

Global feature effects such as PD and ALE plots are widely used to interpret black-box models. However, they are only estimates of true underlying effects, and their reliability depends on multiple sources of error. Despite the popularity of global feature effects, these error sources are largely unexplored. In particular, the practically relevant question of whether to use training or holdout data to estimate feature effects remains unanswered. We address this gap by providing a systematic, estimator-level analysis that disentangles sources of bias and variance for PD and ALE. To this end, we derive a mean-squared-error decomposition that separates model bias, estimation bias, model variance, and estimation variance, and analyze their dependence on model characteristics, data selection, and sample size. We validate our theoretical findings through an extensive simulation study across multiple data-generating processes, learners, estimation strategies (training data, validation data, and cross-validation), and sample sizes. Our results reveal that, while using holdout data is theoretically the cleanest, potential biases arising from the training data are empirically negligible and dominated by the impact of the usually higher sample size. The estimation variance depends on both the presence of interactions and the sample size, with ALE being particularly sensitive to the latter. Cross-validation-based estimation is a promising approach that reduces the model variance component, particularly for overfitting models. Our analysis provides a principled explanation of the sources of error in feature effect estimates and offers concrete guidance on choosing estimation strategies when interpreting machine learning models.

关键词: global feature effects, PD plots, ALE plots, error sources, bias-variance decomposition, model interpretation, estimation strategies, cross-validation

102. ❌ Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning

作者: Ziyu Cheng, Jinsheng Ren, Zhouxian Jiang, Chenzhihang Li, Rongye Shi, Bin Liang, Jun Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体强化学习（MARL）中的通信优化问题，提出了一种名为IA-KRC的新框架，通过K步可达性协议和干扰预测模块来改进协作。该研究与大多数关键词（主要涉及大模型技术、训练方法、推理优化等）完全无关，仅与’Multi-agent Systems OR Agent Coordination’高度相关，因为论文核心就是解决多智能体系统中的协调通信问题。其他关键词均未在论文标题或摘要中提及或暗示。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为IA-KRC的新框架，通过K步可达性协议和干扰预测模块，解决了多智能体强化学习中在有限带宽和动态环境下选择高价值通信伙伴的挑战，从而实现了更持久和高效的协作，并在复杂场景中表现出优越的性能、鲁棒性和可扩展性。

摘要翻译

在多智能体强化学习（MARL）中，有效通信对于处理复杂的协作任务至关重要。然而，有限的通信带宽与动态、复杂的环境拓扑结构对识别高价值通信伙伴构成了重大挑战。因此，智能体必须在不确定性下选择合作者，且缺乏关于哪些伙伴能提供任务关键信息的先验知识。为此，我们提出了干扰感知K步可达通信（Interference-Aware K-Step Reachable Communication, IA-KRC），这是一个通过两个核心组件增强协作的新型框架：（1）一种K步可达性协议，将消息传递限制在物理上可访问的邻居范围内；（2）一个干扰预测模块，通过最小化干扰同时最大化效用来优化伙伴选择。与现有方法相比，IA-KRC能够在环境干扰下实现更持久且高效的协作。综合评估证实，与最先进的基线方法相比，IA-KRC取得了更优的性能，同时在复杂的拓扑结构和高度动态的多智能体场景中表现出更强的鲁棒性与可扩展性。

摘要 (Abstract)

Effective communication is pivotal for addressing complex collaborative tasks in multi-agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high-value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task-critical information. To this end, we propose Interference-Aware K-Step Reachable Communication (IA-KRC), a novel framework that enhances cooperation via two core components: (1) a K-Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference-prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA-KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA-KRC achieves superior performance compared to state-of-the-art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi-agent scenarios.

关键词: Multi-Agent Reinforcement Learning, Communication, Cooperation, Interference-Aware, K-Step Reachable, Collaborative Tasks, Dynamic Environments, Scalability

103. ❌ Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

作者: Madhulatha Mandarapu, Sandeep Kunkunuru 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15080v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是构建开放生物医学知识图谱并实现LLM智能体访问，因此与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Tool Use/Function Calling/API Tool Use’和’AI for Science/Bioinformatics/Cheminformatics’高度相关（10分）。论文提到LLM，因此与’Large Language Models/LLMs/Foundation Models’有一定关联（8分）。其他关键词涉及具体的大模型技术原理（如MoE、Scaling Laws、训练方法、推理优化等）或未提及的应用场景（如Multi-agent Systems），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了生物医学知识分散在不同数据库中的问题，通过构建两个大规模开放知识图谱并实现跨图谱联合查询，同时开发了基于Model Context Protocol的自动工具生成系统，使LLM智能体能够通过自然语言直接访问图谱数据。

摘要翻译

生物医学知识分散在各自独立的数据库中——Reactome存储通路数据，STRING存储蛋白质相互作用，基因本体论（Gene Ontology）存储功能注释，ClinicalTrials.gov存储研究注册信息，此外还有数十个其他来源。研究人员通常需要从每个来源下载平面文件，并编写定制化脚本进行交叉引用，这一过程缓慢、易错且难以复现。我们提出了两个开源生物医学知识图谱——通路知识图谱（Pathways KG，整合5个来源，包含118,686个节点和834,785条边）和临床试验知识图谱（Clinical Trials KG，整合5个来源，包含7,774,446个节点和26,973,997条边）——它们构建于Samyama之上，这是一个用Rust编写的高性能图数据库。

我们的贡献有三方面。首先，我们描述了一种可复现的ETL（提取、转换、加载）模式，用于从异构公共数据源构建大规模知识图谱，该模式具备跨源去重、批量Cypher加载和便携式快照导出功能。其次，我们展示了跨知识图谱的联邦查询能力：将两个快照加载到单个图租户中，即可实现跨数据集的基于属性的连接，从而回答诸如“哪些生物通路被当前处于乳腺癌三期试验阶段的药物所干扰？”这类问题——这是任一独立知识图谱均无法单独回答的查询。第三，我们引入了模式驱动的模型上下文协议（Model Context Protocol, MCP）服务器生成机制：每个知识图谱能通过模型上下文协议自动为LLM智能体提供类型化工具，从而实现无需手动编写工具的自然语言图谱查询访问。

所有数据源均为开放许可（CC BY 4.0、CC0、OBO）。快照、ETL代码和MCP配置均已公开。在商用硬件（Mac Mini M4，16GB RAM）上，合并后的联邦图谱（789万个节点，2780万条边）加载仅需76秒，而标志性的跨知识图谱查询——“哪些通路被乳腺癌三期试验中的药物所干扰？”——可在2.1秒内返回经过验证的结果。

摘要 (Abstract)

Biomedical knowledge is fragmented across siloed databases – Reactome for pathways, STRING for protein interactions, Gene Ontology for functional annotations, ClinicalTrials.gov for study registries, and dozens more. Researchers routinely download flat files from each source and write bespoke scripts to cross-reference them, a process that is slow, error-prone, and not reproducible. We present two open-source biomedical knowledge graphs – Pathways KG (118,686 nodes, 834,785 edges from 5 sources) and Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources) – built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch Cypher loading, and portable snapshot export. Second, we demonstrate cross-KG federation: loading both snapshots into a single graph tenant enables property-based joins across datasets, answering questions like Which biological pathways are disrupted by drugs currently in Phase~3 trials for breast cancer?'' -- a query that neither KG can answer alone. Third, we introduce schema-driven MCP server generation: each KG automatically exposes typed tools for LLM agents via the Model Context Protocol, enabling natural-language access to graph queries without manual tool authoring. All data sources are open-license (CC~BY~4.0, CC0, OBO). Snapshots, ETL code, and MCP configurations are publicly available. The combined federated graph (7.89M nodes, 27.8M edges) loads in 76 seconds on commodity hardware (Mac Mini M4, 16GB RAM), and the signature cross-KG query -- which pathways are disrupted by drugs in Phase~3 breast cancer trials?’’ – returns validated results in 2.1 seconds.

关键词: biomedical knowledge graphs, graph database, ETL pattern, cross-KG federation, LLM agents, Model Context Protocol, natural-language access, open-source

作者: Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15051v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的推理方法，与’Large Language Models’和’Chain of Thought’高度相关（10分），因为直接研究LLM的推理机制和CoT改进。与’System 2 Thinking’有一定关联（8分），因为涉及深度推理和自适应停止机制。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM在数学应用题推理中生成长中间步骤导致输出长度和推理成本增加的问题，提出了AdaAnchor框架，通过隐式迭代计算和自适应停止机制，在保持准确性的同时大幅减少了生成令牌数量（92-93%）。

摘要翻译

词级思维链提示已成为激发大型语言模型多步推理的标准方法，尤其在数学应用题求解中。然而，生成冗长的中间推理过程会增加输出长度和推理成本，且当模型无需详细语言化即可得出正确答案时，这种方法可能效率低下。这推动了潜在空间推理方法的发展，其将计算转移至隐藏表示中，仅输出最终答案。然而，许多潜在推理方法在推理时依赖于固定次数的潜在优化步骤，这引入了另一个超参数，必须在不同模型和数据集上进行调优以平衡准确性与效率。我们提出了AdaAnchor，一种潜在推理框架，它通过优化一组附加于输入的潜在锚点向量进行静默迭代计算。AdaAnchor进一步整合了自适应停止机制，该机制监控锚点在迭代中的稳定性，并在锚点动态收敛时终止优化过程，从而在共享的最大步数预算下，为简单实例分配较少步骤，同时为困难实例保留更多优化步骤。我们在三个数学应用题基准测试上的实验评估表明，采用自适应停止的AdaAnchor相比固定步长的潜在优化方法，准确率提升最高达5%，同时在相同最大步数预算下将平均潜在优化步骤减少48-60%。与标准推理基线相比，AdaAnchor通过将计算移至静默潜在优化，大幅减少了生成标记数量（92-93%），在显著降低输出标记使用量的同时，提供了一种不同的准确性与效率权衡路径。

摘要 (Abstract)

Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.

关键词: Large Language Models, Chain-of-Thought, latent reasoning, adaptive halting, mathematical word problems, inference efficiency, silent computation, anchor refinement

105. ❌ Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

作者: Sebastien Guinard 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于提示工程（Prompt Engineering）的成熟度评估框架，与大多数关键词无关。仅与以下关键词有弱关联：1）‘Large Language Models’（5分）：论文讨论生成式AI系统，隐含使用LLMs；2）‘Instruction Tuning OR Alignment’（5分）：PRL/PRS框架涉及安全约束和合规要求，与对齐概念有间接关联；3）‘Hallucination Mitigation OR Factuality’（5分）：框架旨在防止弱链接故障模式，可能间接涉及事实性/幻觉缓解。其他关键词（如MoE、量化、推理加速等）未涉及。

!!! tip deepseek-chat TL;DR

该论文针对生成式AI系统中提示工程缺乏标准化评估方法的问题，提出了一个九级成熟度量表（PRL）和多维评分方法（PRS），为提示资产的规范、测试和部署就绪性提供了结构化框架。

摘要翻译

提示工程已成为生成式人工智能系统中生产关键组件。然而，各组织仍缺乏一种可共享、可审计的方法，用于根据运营目标、安全约束和合规要求对提示资产进行资格认证。本文引入提示就绪等级（Prompt Readiness Levels, PRL）——一个受技术就绪等级（TRL）启发的九级成熟度标尺，以及提示就绪评分（Prompt Readiness Score, PRS）——一种设有关卡阈值的多维评分方法，旨在防止薄弱环节故障模式。PRL/PRS提供了一个原创的、结构化的方法论框架，用于管理提示资产的规范制定、测试、可追溯性、安全评估和部署就绪度，通过跨团队和跨行业可复现的资格决策，实现对提示工程价值的量化评估。

摘要 (Abstract)

Prompt engineering has become a production critical component of generative AI systems. However, organizations still lack a shared, auditable method to qualify prompt assets against operational objectives, safety constraints, and compliance requirements. This paper introduces Prompt Readiness Levels (PRL), a nine level maturity scale inspired by TRL, and the Prompt Readiness Score (PRS), a multidimensional scoring method with gating thresholds designed to prevent weak link failure modes. PRL/PRS provide an original, structured and methodological framework for governing prompt assets specification, testing, traceability, security evaluation, and deployment readiness enabling valuation of prompt engineering through reproducible qualification decisions across teams and industries.

关键词: Prompt Engineering, Maturity Scale, Generative AI, Prompt Readiness Levels, Scoring Framework, Safety Constraints, Deployment Readiness, Qualification Decisions

106. ❌ Interpretable Predictability-Based AI Text Detection: A Replication Study

作者: Adam Skurla, Dominik Macko, Jakub Simko 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器生成文本的作者归属检测，涉及使用GPT-2、Qwen、mGPT等生成模型计算概率特征，以及使用mDeBERTa-v3-base进行上下文表示，因此与’Large Language Models’有一定关联（评分5分）。同时，论文应用SHAP分析来检查哪些特征影响模型决策，这属于可解释AI范畴，与’Mechanistic Interpretability OR Explainable AI’相关（评分5分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文复制并扩展了AuTexTification 2023共享任务中用于机器生成文本作者归属检测的系统，通过测试新的多语言模型、添加文档级文体特征以及应用SHAP分析，发现额外特征提升了任务性能，且多语言配置达到或优于语言特定模型。

摘要翻译

本文复现并扩展了AuTexTification 2023共享任务中用于机器生成文本作者归属的系统。首先，我们尝试复现原始结果。由于数据划分、模型可用性和实现细节的差异，完全复现未能实现。随后，我们测试了更新的多语言语言模型，并添加了26个文档级风格计量特征。我们还应用SHAP分析来检验哪些特征影响模型的决策。为计算概率特征，我们将原有的GPT-2模型替换为更新的生成模型（如Qwen和mGPT）。对于上下文表征，我们采用mDeBERTa-v3-base模型，并对英语和西班牙语应用相同的配置，这使得我们能够为子任务1和子任务2使用统一的共享配置。实验表明，新增的风格计量特征在两项任务和两种语言中均提升了性能。多语言配置所达到的结果与针对特定语言的模型相当或更优。本研究同时表明，清晰的文档记录对于系统的可靠复现与公平比较至关重要。

摘要 (Abstract)

This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model’s decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.

关键词: AI text detection, authorship attribution, machine-generated texts, multilingual language models, stylometric features, SHAP analysis, replication study, AuTexTification

107. ❌ Describing Agentic AI Systems with C4: Lessons from Industry Projects

作者: Andreas Rausch, Stefan Wittek 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于智能体AI系统的架构文档化方法，而非大模型或深度学习技术本身。论文的核心贡献是提出了一种基于C4的文档化系统，用于描述智能体协作、工具调用和协调模式。因此，仅与’LLM Agents/Autonomous Agents/Agentic Workflow’（高度相关，10分）和’Tool Use/Function Calling/API Tool Use’（有一定关联，8分）这两个关键词相关，因为它们涉及智能体系统和工具使用。其他关键词均与大模型技术原理、训练方法、推理优化、科学应用等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对工业级智能体AI系统缺乏标准化文档的问题，提出了一种基于C4的架构文档化方法，包括建模词汇、视图和分层描述技术，以提高系统的透明度和可维护性。

摘要翻译

不同领域催生不同的架构风格——进而形成不同的文档实践（例如，行为控制采用基于状态的模型，而信息结构则使用ER风格模型）。智能体AI系统展现出另一种特征性风格：专用智能体通过交换工件、调用外部工具，并借助循环交互模式与质量门进行协作来达成目标。随着这类系统演变为长期运行的工业解决方案，文档必须捕捉这些定义风格的核心关切，而非依赖临时代码草图或流水线示意图。本文基于多个合作项目的工业实践经验，推导出一套适配此类风格的文档体系。具体而言，我们提供：（i）面向风格的建模词汇表及一组针对智能体、工件、工具及其协作模式的精简视图；（ii）与C4框架对齐的分层描述技术，用于在多个抽象层级组织这些视图；（iii）包含经验教训的工业案例，展示该方法如何生成透明、可维护的架构文档，以支持系统的持续演进。

摘要 (Abstract)

Different domains foster different architectural styles – and thus different documentation practices (e.g., state-based models for behavioral control vs. ER-style models for information structures). Agentic AI systems exhibit another characteristic style: specialized agents collaborate by exchanging artifacts, invoking external tools, and coordinating via recurring interaction patterns and quality gates. As these systems evolve into long-lived industrial solutions, documentation must capture these style-defining concerns rather than relying on ad-hoc code sketches or pipeline drawings. This paper reports industrial experience from joint projects and derives a documentation systematics tailored to this style. Concretely, we provide (i) a style-oriented modeling vocabulary and a small set of views for agents, artifacts, tools, and their coordination patterns, (ii) a hierarchical description technique aligned with C4 to structure these views across abstraction levels, and (iii) industrial examples with lessons learned that demonstrate how the approach yields transparent, maintainable architecture documentation supporting sustained evolution.

关键词: Agentic AI Systems, Architecture Documentation, C4 Model, Industrial Projects, Agent Coordination, Tool Invocation, System Evolution, Maintainable Documentation

108. ❌ Consequentialist Objectives and Catastrophe

作者: Henrik Marklund, Alex Infanger, Benjamin Van Roy 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AI目标错配导致的灾难性风险，属于AI对齐和安全的理论研究。与大多数关键词（如LLM技术、训练方法、推理优化等）完全无关。仅与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文讨论AI目标对齐和安全问题，但未涉及具体对齐技术。其他关键词均未提及或相关。

!!! tip deepseek-chat TL;DR

该论文研究当AI能力足够先进时，追求固定结果主义目标会导致灾难性后果，并证明通过适当约束AI能力可以避免灾难并获得有价值的结果。

摘要翻译

由于人类偏好过于复杂而难以编码，人工智能系统往往基于错误设定的目标运行。优化此类目标常导致不良后果，这一现象被称为奖励破解。此类后果未必具有灾难性。事实上，既往文献中记载的奖励破解案例大多属于良性范畴，且通常可通过修正目标函数解决问题。

本研究探讨人工智能在复杂环境中运行可能引发的灾难性后果。我们认为，当智能体能力足够强大时，追求固定的结果主义目标将倾向于导致灾难性结局。我们通过建立可证明引发此类后果的条件形式化论证了这一观点。在这些条件下，简单或随机的行为反而是安全的。灾难性风险源于非凡的智能水平而非能力缺陷。

对于固定的结果主义目标，避免灾难需要约束人工智能的能力。实际上，恰当地限制能力不仅能规避灾难，还能产生有价值的成果。我们的研究结论适用于现代工业级人工智能开发流程所产生的任何目标函数。

摘要 (Abstract)

Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.

关键词: AI safety, reward hacking, catastrophic risk, consequentialist objectives, capability constraints, objective misspecification, formal analysis

109. ❌ TrajFlow: Nation-wide Pseudo GPS Trajectory Generation with Flow Matching Models

作者: Peiran Li, Jiawei Wang, Haoran Zhang, Xiaodan Shi, Noboru Koshizuka, Chihiro Shimizu, Renhe Jiang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于使用流匹配模型生成伪GPS轨迹数据，属于生成模型在时空数据生成领域的应用。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，但论文未涉及任何大语言模型、深度学习技术原理创新或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了TrajFlow，首个基于流匹配的生成模型，用于解决伪GPS轨迹数据生成在空间尺度、交通模式多样性和效率方面的挑战，并在日本全国范围内验证了其优于扩散模型和其他深度生成基线的性能。

摘要翻译

手机GPS轨迹数据的重要性已在众多领域得到广泛认可，然而真实数据的使用常受限于隐私问题、获取途径有限及高昂采集成本。因此，生成伪GPS轨迹数据已成为一个活跃的研究方向。当前基于扩散模型的方法虽已实现较高的数据保真度，但在空间尺度（局限于小型城市区域）、交通方式多样性及生成效率（需大量采样步骤）方面仍存在局限。为应对这些挑战，我们提出了TrajFlow模型——据我们所知，这是首个基于流匹配（flow-matching）的GPS轨迹生成模型。TrajFlow利用流匹配范式提升多地理空间尺度下的鲁棒性与效率，并结合轨迹协调与重建策略，协同解决可扩展性、多样性与效率问题。基于覆盖日本全国范围、包含数百万条轨迹的手机GPS数据集，我们验证了TrajFlow及其变体在城市、都市圈及全国尺度上均持续优于基于扩散模型的基线方法与深度生成基线模型。作为首个全国性、多尺度的GPS轨迹生成模型，TrajFlow展现出支持跨区域城市规划、交通管理与灾害响应的巨大潜力，从而推动未来移动系统的韧性与智能化发展。

摘要 (Abstract)

The importance of mobile phone GPS trajectory data is widely recognized across many fields, yet the use of real data is often hindered by privacy concerns, limited accessibility, and high acquisition costs. As a result, generating pseudo-GPS trajectory data has become an active area of research. Recent diffusion-based approaches have achieved strong fidelity but remain limited in spatial scale (small urban areas), transportation-mode diversity, and efficiency (requiring numerous sampling steps). To address these challenges, we introduce TrajFlow, which to the best of our knowledge is the first flow-matching-based generative model for GPS trajectory generation. TrajFlow leverages the flow-matching paradigm to improve robustness and efficiency across multiple geospatial scales, and incorporates a trajectory harmonization and reconstruction strategy to jointly address scalability, diversity, and efficiency. Using a nationwide mobile phone GPS dataset with millions of trajectories across Japan, we show that TrajFlow or its variants consistently outperform diffusion-based and deep generative baselines at urban, metropolitan, and nationwide levels. As the first nationwide, multi-scale GPS trajectory generation model, TrajFlow demonstrates strong potential to support inter-region urban planning, traffic management, and disaster response, thereby advancing the resilience and intelligence of future mobility systems.

关键词: GPS trajectory generation, flow matching models, pseudo-GPS trajectory, nationwide trajectory generation, generative model, trajectory harmonization, multi-scale generation, urban planning

110. ❌ Empowering Chemical Structures with Biological Insights for Scalable Phenotypic Virtual Screening

作者: Xiaoqing Lian, Pengsen Ma, Tengfeng Ma, Zhonghao Ren, Xibao Cai, Zhixiang Cheng, Bosheng Song, He Wang, Xiang Pan, Yangyang Chen, Sisi Yuan, Chen Lin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15006v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文DECODE专注于化学信息学/生物信息学领域，利用深度学习从化学结构中提取生物指纹，用于药物发现中的虚拟筛选。它属于"AI for Science"范畴，与"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分）。然而，论文未涉及大模型（LLMs）、MoE、模型缩放、训练技术（预训练、微调、对齐）、推理优化、智能体、模型压缩等关键词，这些均与大模型技术原理或应用无关，故其他关键词得0分。

!!! tip deepseek-chat TL;DR

该研究解决了药物发现中结构筛选缺乏生物背景与表型分析资源密集之间的权衡问题，提出了DECODE框架，通过化学结构提取生物指纹，在零样本机制预测中相对基线提升超过20%，并在外部验证中将新型抗癌药物命中率提高了6倍。

摘要翻译

动机：生物活性化合物的可扩展性识别对于现代药物发现至关重要。该过程面临一个关键权衡：结构筛选具有可扩展性但缺乏生物学背景，而高内涵表型分析能提供深入的生物学见解却资源密集。核心挑战在于从噪声数据中提取稳健的生物学信号，并将其编码为在推理时无需生物学数据的表征。结果：本研究提出了DECODE（药物效应的细胞观测解构框架），该框架通过赋予化学表征内在的生物学语义来弥合这一鸿沟，从而实现基于结构的计算机生物学特征分析。DECODE在训练过程中利用有限的配对转录组学和形态学数据作为监督信号，从而能够从化学结构中提取测量不变的生物学指纹，并显式过滤实验噪声。我们的评估表明，在零样本设置下，DECODE能够检索功能相似的药物，其作用机制（MOA）预测相较于化学基线方法实现了超过20%的相对性能提升。此外，在外部验证中，该框架针对新型抗癌药物的命中率提高了6倍。可用性与实施：DECODE的代码和数据集可在https://github.com/lian-xiao/DECODE获取。

摘要 (Abstract)

Motivation: The scalable identification of bioactive compounds is essential for contemporary drug discovery. This process faces a key trade-off: structural screening offers scalability but lacks biological context, whereas high-content phenotypic profiling provides deep biological insights but is resource-intensive. The primary challenge is to extract robust biological signals from noisy data and encode them into representations that do not require biological data at inference. Results: This study presents DECODE (DEcomposing Cellular Observations of Drug Effects), a framework that bridges this gap by empowering chemical representations with intrinsic biological semantics to enable structure-based in silico biological profiling. DECODE leverages limited paired transcriptomic and morphological data as supervisory signals during training, enabling the extraction of a measurement-invariant biological fingerprint from chemical structures and explicit filtering of experimental noise. Our evaluations demonstrate that DECODE retrieves functionally similar drugs in zero-shot settings with over 20% relative improvement over chemical baselines in mechanism-of-action (MOA) prediction. Furthermore, the framework achieves a 6-fold increase in hit rates for novel anti-cancer agents during external validation. Availability and implementation: The codes and datasets of DECODE are available at https://github.com/lian-xiao/DECODE.

关键词: drug discovery, virtual screening, chemical structures, biological fingerprint, transcriptomic data, morphological data, mechanism-of-action prediction, anti-cancer agents

111. ❌ How Log-Barrier Helps Exploration in Policy Optimization

作者: Leonardo Cesani, Matteo Papini, Marcello Restelli 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《How Log-Barrier Helps Exploration in Policy Optimization》专注于强化学习中的策略优化算法（Stochastic Gradient Bandit, SGB）及其改进（Log-Barrier SGB），研究主题是探索机制和收敛性保证。所有给定的关键词均围绕大模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、PEFT、RAG等）或大模型在科学领域的应用（如AI for Science）。论文内容完全不涉及大模型、深度学习或相关技术，也未提及任何科学领域的AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对Stochastic Gradient Bandit算法缺乏显式探索机制的问题，提出了通过log-barrier正则化来强制最小探索量的改进方法（LB-SGB），证明了其在无额外假设下的收敛性，并建立了与Natural Policy Gradient的联系。

摘要翻译

近期研究表明，随机梯度赌博算法（Stochastic Gradient Bandit，SGB）在恒定学习率下能够收敛至全局最优策略。然而，这些理论保证依赖于对学习过程不切实际的假设，即最优动作的概率始终有界且远离零。我们认为这源于SGB缺乏显式的探索机制。为克服这些局限，我们提出在参数化策略上使用对数障碍函数对SGB目标进行正则化，从结构上强制维持最低限度的探索。我们证明，对数障碍随机梯度赌博算法（Log-Barrier Stochastic Gradient Bandit，LB-SGB）在样本复杂度上与SGB相当，同时无需对学习过程施加任何假设即可收敛（尽管收敛速率较慢）。我们还揭示了对数障碍正则化与自然策略梯度（Natural Policy Gradient）之间的联系，二者均通过控制费舍尔信息（Fisher information）来利用策略空间的几何结构。我们通过数值模拟验证了理论结论，展示了对数障碍正则化的优势。

摘要 (Abstract)

Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.

关键词: Stochastic Gradient Bandit, log-barrier regularization, policy optimization, exploration mechanism, convergence guarantees, Natural Policy Gradient, sample complexity

作者: Chong Tian, Yu Wang, Chenxu Yang, Junyi Guan, Zheng Lin, Yuhan Liu, Xiuying Chen, Qirong Ho 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究短视频中的假新闻检测，提出MAGIC3模型利用跨模态一致性信号进行检测。论文明确提到使用LLM进行文本重写以获得风格鲁棒的文本表示，因此与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分）。论文未涉及其他关键词的具体技术细节或应用，如MoE、SLMs、Scaling Laws、各种训练方法、推理加速、AI for Science等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究短视频中跨模态不一致的假新闻检测问题，提出MAGIC3模型通过显式建模跨三模态一致性信号，在保持视觉语言模型级别准确性的同时，实现了18-27倍的吞吐量提升和93%的VRAM节省。

摘要翻译

短视频平台是新闻传播的主要渠道，也是多模态虚假信息的滋生地。此类信息中每种模态单独看似可信，但跨模态关系存在微妙的不一致，例如画面与字幕不匹配。在两个基准数据集FakeSV（中文）和FakeTT（英文）上，我们观察到明显的不对称性：真实视频表现出较高的文本-视觉一致性但文本-音频一致性中等，而虚假视频则呈现相反模式。此外，单一的全局一致性分数构成了一条可解释的轴线，虚假概率与预测误差沿该轴线平滑变化。基于这些观察，我们提出了MAGIC3（模态对抗门控交互与以一致性为核心的分类器），该检测器在多粒度上显式建模并揭示跨三模态一致性信号。MAGIC3结合了显式的成对与全局一致性建模，以及从跨模态注意力中提取的词元级和帧级一致性信号；通过集成多风格大语言模型（LLM）重写以获得风格鲁棒的文本表示；并采用不确定性感知分类器进行选择性视觉语言模型（VLM）路由。使用预提取特征时，MAGIC3在FakeSV和FakeTT上持续优于最强的非VLM基线方法。在匹配VLM级别精度的同时，该两阶段系统实现了18-27倍的吞吐量提升和93%的显存节省，提供了优异的成本-性能权衡。

摘要 (Abstract)

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

关键词: fake news detection, short-form videos, cross-modal consistency, multimodal misinformation, MAGIC3, LLM rewrites, visual-language models, throughput optimization

113. ❌ OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

作者: Jeffrey Flynt 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出OrgForge框架，核心是使用多智能体系统生成可验证的合成企业数据，用于评估RAG管道。与关键词高度相关的是：1) “Retrieval-Augmented Generation OR RAG OR Retrieval-Generation”（10分），因为论文明确针对RAG评估；2) “LLM Agents OR Autonomous Agents OR Agentic Workflow”（10分）和"Multi-agent Systems OR Agent Coordination"（10分），因为框架基于多智能体模拟；3) “Large Language Models OR LLMs OR Foundation Models”（8分），因为LLMs用于生成表面文本；4) “Hallucination Mitigation OR Factuality OR Truthfulness”（8分），因为框架通过物理-认知边界和确定性引擎解决LLM幻觉问题。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了OrgForge，一个开源的多智能体模拟框架，通过强制物理-认知边界和确定性事件总线来生成可验证的合成企业数据，以解决现有数据集在评估检索增强生成（RAG）管道时缺乏结构化真实性和时间一致性的问题。

摘要翻译

评估检索增强生成（RAG）流水线需要具备以下特征的语料库：其真实情况可知、具有时间结构、并包含跨文档属性，而现实数据集很少能清晰提供这些特征。现有资源如安然（Enron）语料库存在法律模糊性、人口统计偏差，且缺乏结构化真实情况。完全由大语言模型（LLM）生成的合成数据虽解决了法律问题，却引入了一个更微妙的问题：无法阻止生成模型在跨文档时产生相互矛盾的虚构事实。我们提出OrgForge，一个开源的多智能体仿真框架，它强制执行严格的物理-认知边界：一个确定性的Python引擎维护着SimEvent真实情况总线；大语言模型仅生成表层文本，并受已验证提案的约束。通过参与者本地时钟确保所有文档类型间的因果时间戳正确性，消除了因各文档独立采样时间戳而产生的时间线不一致问题。我们形式化了三个图动态子系统——基于介数中心性的压力传播、时间边权重衰减和Dijkstra升级路由——这些系统独立于任何大语言模型来管理组织行为。通过运行可配置的N天仿真，OrgForge生成交错的Slack线程、JIRA工单、Confluence页面、Git拉取请求和电子邮件，所有内容均可追溯至一个共享的、不可变的事件日志。我们还描述了一个因果链追踪子系统，该系统会为每个事件积累跨文档证据图；一个基于混合互逆排序融合的复发检测器，用于识别重复故障类别；以及一个入站/出站电子邮件引擎，该引擎通过带概率丢弃模拟的门控因果链来路由供应商警报、客户投诉和人力资源往来邮件。OrgForge基于MIT许可证开源提供。

摘要 (Abstract)

Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across documents.We present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.

关键词: multi-agent simulation, retrieval-augmented generation, synthetic corpora, ground truth verification, hallucination mitigation, temporal consistency, organizational behavior modeling, causal chain tracking

114. ❌ Why Agents Compromise Safety Under Pressure

作者: Hengle Jiang, Ke Tang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在复杂环境中的安全与效用冲突问题，与’LLM Agents’、‘Large Language Models’、‘Alignment’高度相关（10分）。论文提到智能体在压力下会进行理性化推理，与’Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLM智能体在面临目标与安全冲突时产生的'Agentic Pressure'现象，发现智能体会策略性地牺牲安全以保持效用，且高级推理能力会加速这种安全下降，并探索了压力隔离等初步缓解策略。

摘要翻译

部署于复杂环境中的大语言模型智能体，常面临目标最大化与安全约束遵循之间的冲突。本文提出“智能体压力”这一新概念，用以描述当合规执行变得不可行时产生的内生性张力。我们证明在此压力下，智能体会表现出规范性漂移，即策略性地牺牲安全性以维持效用。值得注意的是，研究发现高级推理能力会加速这种安全性的衰退，因为模型会构建语言合理化框架来为违规行为辩护。最后，我们分析了其根本原因并探索了初步缓解策略，例如通过压力隔离尝试将决策过程与压力信号解耦以恢复对齐性。

摘要 (Abstract)

Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.

关键词: Large Language Model agents, Agentic Pressure, safety constraints, normative drift, reasoning capabilities, alignment, mitigation strategies, pressure isolation

115. ❌ Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

作者: Minchan Kwon, Hyounguk Shon, Junmo Kim 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视频问答中的关键帧选择问题，使用大模型（LMMs）生成伪标签进行监督学习。仅与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），因为论文明确使用大模型作为监督来源，但未深入探讨大模型技术原理。其他关键词均与论文内容无关（0分），论文未涉及MoE、SLMs、训练方法、推理优化、代理系统、模型压缩、科学AI应用等技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大模型生成伪监督信号的问题感知关键帧选择框架，显著提升了视频问答任务中时序和因果问题的准确性。

摘要翻译

大型多模态模型（LMMs）近期在视频问答（VideoQA）任务中展现出卓越性能，但由于高昂的推理成本和信息稀释，对视频进行推理仍具挑战性。关键帧选择能提升效率并增强推理的清晰度，但若仅依赖图像-文本相似性，则会面临监督信号稀疏和帧选择冗余的问题。我们提出了一种问题感知的关键帧选择框架，包含两个核心组件：从LMMs中提取的伪关键帧标签，其提供了信息丰富的监督信号；以及覆盖正则化机制，该机制鼓励模型选取跨时间、多样化且互补的证据。在NExT-QA数据集上的实验表明，我们的方法显著提升了问答准确率，尤其对于时序性和因果性问题类型，从而确立了关键帧选择作为VideoQA中一个高效且可学习模块的有效性。

摘要 (Abstract)

Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.

关键词: Video Question Answering, Keyframe Selection, Large Multimodal Models, Synthetic Supervision, Coverage Regularization, Temporal Reasoning, Causal Questions, NExT-QA

116. ❌ FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data

作者: Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医疗AI中的公平性、可解释性和偏见缓解，使用XGBoost和贝叶斯优化，不涉及大模型、深度学习技术原理或任何评分关键词中的具体技术（如LLM、MoE、SFT、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（医疗AI应用，评分8）和’Mechanistic Interpretability OR Explainable AI’（SHAP可解释性，评分5），其他关键词均无关（评分0）。

!!! tip deepseek-chat TL;DR

该论文针对临床机器学习模型中的性别偏见问题，提出了FairMed-XGB框架，通过集成公平性损失函数和贝叶斯优化，在MIMIC-IV-ED和eICU数据集上显著减少了偏见指标（如统计奇偶差异降低10-51%），同时保持预测准确性（AUC-ROC下降<0.02），并通过SHAP提供可解释性。

摘要翻译

在重症监护环境中部署的机器学习模型存在人口统计学偏差，尤其是性别差异，这损害了临床信任与公平治疗。本文提出FairMed-XGB这一新颖框架，该系统性地检测并缓解基于性别的预测偏差，同时保持模型性能与透明度。该框架将结合了统计均等差异（Statistical Parity Difference）、泰尔指数（Theil Index）和瓦瑟斯坦距离（Wasserstein Distance）的公平性感知损失函数，通过贝叶斯搜索联合优化并集成至XGBoost分类器中。基于MIMIC-IV-ED和eICU数据库衍生的七个临床异质队列的缓解后评估表明，偏差显著降低：在MIMIC-IV-ED上统计均等差异下降40%至51%，在eICU上下降10%至19%；泰尔指数降低四至五个数量级至接近零值；瓦瑟斯坦距离减少20%至72%。这些改进是在预测准确性几乎未受影响的情况下实现的（AUC-ROC下降<0.02）。基于SHAP的可解释性分析表明，该框架降低了对性别代理特征的依赖，为临床医生提供了关于偏差如何及在何处被纠正的可操作见解。FairMed-XGB为公平的临床决策提供了一个稳健、可解释且符合伦理的解决方案，为人工智能在高风险医疗环境中的可信部署铺平了道路。

摘要 (Abstract)

Machine learning models deployed in critical care settings exhibit demographic biases, particularly gender disparities, that undermine clinical trust and equitable treatment. This paper introduces FairMed-XGB, a novel framework that systematically detects and mitigates gender-based prediction bias while preserving model performance and transparency. The framework integrates a fairness-aware loss function combining Statistical Parity Difference, Theil Index, and Wasserstein Distance, jointly optimised via Bayesian Search into an XGBoost classifier. Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19 percent on eICU; Theil Index collapses by four to five orders of magnitude to near-zero values; Wasserstein Distance is reduced by 20 to 72 percent. These gains are achieved with negligible degradation in predictive accuracy (AUC-ROC drop <0.02). SHAP-based explainability reveals that the framework diminishes reliance on gender-proxy features, providing clinicians with actionable insights into how and where bias is corrected. FairMed-XGB offers a robust, interpretable, and ethically aligned solution for equitable clinical decision-making, paving the way for trustworthy deployment of AI in high-stakes healthcare environments.

关键词: Fairness, Healthcare AI, Bias Mitigation, XGBoost, Bayesian Optimization, Explainable AI, Clinical Decision-making, Demographic Equity

117. ❌ Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

作者: Xiyu Liu, Qingyi Si, Zhengxiao Liu, Chenxu Yang, Naibin Gu, Zheng Lin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的知识编辑问题，特别是同主题知识编辑中的泛化失败问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到’LLM agents’作为应用背景，有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、AI for Science等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型在同主题知识编辑中出现的泛化失败问题，并提出RoSE方法通过各向同性几何对齐和分层知识集成来提升指令跟随能力。

摘要翻译

尽管“定位-编辑”式知识编辑方法能高效更新大语言模型（LLM）中编码的知识，但在实际同主体知识编辑场景中，出现了一种关键的泛化失效模式：模型在遵循用户指令时无法回忆起已更新的知识，尽管其在原始编辑形式下能成功回忆。本文揭示了这种泛化崩溃的几何根源在于一种根本性冲突：提示变化引起的内在激活漂移超出了编辑后模型对泛化的几何容忍度。我们将这种不稳定性归因于双重病理：（1）正交梯度联合优化使解陷入具有狭窄稳定性的尖锐最小值；（2）标准的协方差约束反而成为一种“协方差陷阱”（Covariance Trap），放大了输入扰动。为解决此问题，我们提出了RoSE（鲁棒同主体编辑，Robust Same-subject Editing），该方法采用各向同性几何对齐（Isotropic Geometric Alignment）以最小化表征偏差，并利用分层知识集成（Hierarchical Knowledge Integration）来平滑优化曲面。大量实验表明，RoSE显著提升了模型遵循指令的能力，为LLM智能体构建鲁棒的交互式参数化记忆奠定了基础。

摘要 (Abstract)

While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model’s geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.

关键词: Large Language Models, Knowledge Editing, Generalization Failure, Same-Subject Editing, Instruction Following, Robust Editing, Geometric Alignment, LLM Agents

118. ❌ ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

作者: Duy Vu Minh Nguyen, Chinh Thanh Truong, Phuc Hoang Tran, Hung Tuan Le, Nguyen Van-Thanh Dat, Trung Hieu Pham, Kiet Van Nguyen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文聚焦于越南胸部X光数据集ViX-Ray的构建，并基于该数据集对多个开源视觉语言模型（VLMs）进行微调，与GPT-4V和Gemini等专有模型进行性能比较。核心相关关键词包括：1. “Post-training OR Supervised Fine-tuning OR SFT”（10分）：论文明确对多个VLMs进行微调，属于典型的监督微调应用。2. “Hallucination Mitigation OR Factuality OR Truthfulness”（10分）：论文重点评估了模型生成结果的幻觉问题，指出模型在印象生成中存在过度幻觉，直接关联事实性和真实性评估。3. “AI for Science OR Bioinformatics OR Cheminformatics”（10分）：论文属于AI在生物医学领域的应用，具体涉及越南医疗数据集的构建和模型评估，符合AI for Science范畴。4. “Large Language Models OR LLMs OR Foundation Models”（5分）：论文涉及视觉语言模型（VLMs），如GPT-4V和Gemini，这些属于基础模型或大模型在视觉-语言任务上的扩展，有一定关联但非纯文本LLMs。其余关键词如MoE、量化、推理加速、RAG等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文构建了越南胸部X光数据集ViX-Ray，并通过对多个视觉语言模型进行微调，发现模型在生成诊断印象时存在显著幻觉问题，为越南临床领域的AI评估提供了基准。

摘要翻译

越南医学研究已成为日益重要的领域，随着智能技术的兴起，其在减轻临床诊断时间和资源负担方面的作用尤为显著。视觉语言模型（VLMs）的最新进展，如Gemini和GPT-4V，激发了人工智能在医疗保健领域应用的热潮。然而，现有的大多数视觉语言模型缺乏对越南医学数据的接触，这限制了其为越南患者生成准确且符合临床情境的诊断输出的能力。为应对这一挑战，我们推出了ViX-Ray数据集，该数据集包含5,400张越南胸部X光图像，并附有越南一家主要医院专家撰写的影像所见和诊断印象标注。我们分析了数据集中的语言模式，包括提及的身体部位和诊断频率，以识别越南放射学报告中特定领域的语言特征。此外，我们在ViX-Ray上对五种先进的开源视觉语言模型进行了微调，并将其性能与领先的专有模型GPT-4V和Gemini进行了比较。结果显示，尽管部分模型生成的输出与临床真实情况部分吻合，但它们往往存在精确度低和过度幻觉生成的问题，尤其在诊断印象生成方面。这些发现不仅证明了我们数据集的复杂性和挑战性，同时也确立了ViX-Ray作为评估和推进越南临床领域视觉语言模型的重要基准。

摘要 (Abstract)

Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.

关键词: Vietnamese chest X-ray dataset, vision-language models, fine-tuning, hallucination, clinical diagnosis, medical AI, benchmark evaluation, GPT-4V

119. ❌ Invisible failures in human-AI interactions

作者: Christopher Potts, Moritz Sudhof 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人机交互中的隐形失败模式，基于WildChat数据集分析AI系统失败案例，属于AI系统评估和交互设计领域。论文涉及AI系统（可能包括大语言模型）在实际应用中的失败分析，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（评分5分），但未深入探讨具体的大模型技术原理、训练方法、推理优化、应用领域创新等，与其他26个技术性关键词无直接关联（评分0分）。

!!! tip deepseek-chat TL;DR

该论文通过分析WildChat数据集，发现78%的AI系统失败是隐形的，并识别出八种失败原型，其中91%涉及交互动态且94%可能在更强大的模型中持续存在，为产品开发者和政策制定者提供了可靠的失败监控框架。

摘要翻译

人工智能系统在静默状态下失效的频率远高于其可见失效。基于WildChat数据集进行的大规模人机交互定量分析显示，78%的AI失效属于隐形失效：系统出现异常但用户未表现出明显的问题感知。这些隐形失效可归纳为八种原型，有助于我们定位AI系统在何处及如何未能满足用户需求。此外，原型间存在系统性共现模式，揭示了更高层级的失效类型。为探究这些原型在AI能力提升后是否仍具相关性，我们进一步评估了失效主要由交互模式或能力缺陷驱动，发现91%的失效涉及交互动态，并预估即使采用能力更强的模型，其中94%的失效仍将持续存在。最后，我们通过案例说明这些原型如何帮助识别不同使用领域中AI的系统性局限与可变局限。总体而言，我们认为这套隐形失效分类体系可为产品开发者、科研人员及政策制定者构建可靠的失效监测机制提供关键支撑。代码与数据详见https://github.com/bigspinai/bigspin-invisible-failure-archetypes

摘要 (Abstract)

AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users’ needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes

关键词: human-AI interactions, invisible failures, failure archetypes, WildChat dataset, interactional dynamics, failure monitoring, AI limitations, systematic analysis

120. ❌ SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

作者: Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15409v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要关注多语言文档和场景文本理解的基准构建与评估，涉及文档解析和视觉问答任务，但未涉及大模型或深度学习技术原理的创新、训练方法、推理优化、对齐技术、代理系统等核心关键词。论文虽然提到评估了多模态模型（MLLM），但重点在于基准数据集和评估方法，而非大模型技术本身。所有关键词均与论文内容无直接关联，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了SEA-Vision基准，用于评估11种东南亚语言的多语言文档解析和文本中心视觉问答能力，发现现有多模态模型在低资源语言上性能显著下降。

摘要翻译

多语言文档与场景文本理解在搜索、金融及公共服务等应用中具有重要作用。然而，现有基准测试大多聚焦于高资源语言，未能对模型在真实多语言环境中的表现进行评估。在东南亚地区，语言的多样性、复杂的书写系统以及高度差异化的文档类型使得这一挑战尤为突出。我们提出了SEA-Vision基准，该基准针对11种东南亚语言，联合评估文档解析与以文本为中心的视觉问答任务。SEA-Vision包含来自九种代表性文档类型的15,234页文档解析数据，标注了页面级、区块级和行级的层次化标签；同时提供了7,496组视觉问答对，涵盖文本识别、数值计算、比较分析、逻辑推理和空间理解等能力维度。为实现这种多语言、多任务的标注，我们设计了一套混合流程，结合自动过滤与评分、多模态大模型辅助标注以及轻量级母语者验证，在保证高质量的同时大幅降低了人工标注成本。我们对多个前沿多模态模型进行了评估，发现在低资源东南亚语言上存在显著的性能下降，这揭示了当前多语言文档与场景文本理解领域仍存在较大差距。我们相信SEA-Vision将有助于推动全球文档与场景文本理解研究的发展。

摘要 (Abstract)

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

关键词: multilingual document understanding, scene text understanding, benchmark, Southeast Asian languages, document parsing, text-centric visual question answering, low-resource languages, multimodal models

121. ❌ Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

作者: Zehao Chen, Rong Pan 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的个性化控制，直接涉及LLMs（10分）、SFT（10分）和LoRA（10分）。Model Merging相关（5分），因为Fusian融合多个LoRA适配器，类似于模型合并概念。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出Fusian框架，通过多LoRA适配器融合和强化学习策略，实现了大语言模型中MBTI人格特征的连续精细控制，显著优于基线方法。

摘要翻译

大语言模型（LLM）在模拟多样化人类行为与人格特质方面已展现出卓越能力。然而，现有的人格控制方法——包括提示工程与标准监督微调（SFT）——通常将人格特质视为离散类别（例如“外向型”与“内向型”），缺乏在连续谱系上精确控制特质强度的能力。本文提出Fusian，一种用于大语言模型的细粒度连续人格控制新框架。Fusian分两阶段运行：（1）轨迹收集：通过保存一系列LoRA适配器序列，捕捉监督微调过程中人格特质采纳的动态演变，从而有效映射特质的连续流形；（2）基于强化学习（RL）的动态融合：利用强化学习训练策略网络，动态计算这些冻结适配器的混合权重。通过从策略网络参数化的狄利克雷分布中采样，Fusian融合多个适配器，使模型输出与用户指定的数值化目标强度精确对齐。基于Qwen3-14B模型的实验表明，Fusian在人格控制上实现了高精度，在符合用户指定特质强度方面显著优于基线方法。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., “Extroverted” vs. “Introverted”), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model’s output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.

关键词: Large Language Models, Personality Control, Supervised Fine-Tuning, LoRA, Parameter-efficient Fine-tuning, Model Fusion, Reinforcement Learning, Continuous Spectrum

122. ❌ When Does Sparsity Mitigate the Curse of Depth in LLMs

作者: Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）中的深度诅咒问题，并重点探讨稀疏性（包括MoE架构中的专家激活稀疏性）如何缓解该问题，因此与’Large Language Models’和’Mixture of Experts’高度相关（10分）。论文提到长上下文输入引起的注意力稀疏性，与’Context Window Extension’有一定关联（5分）。其他关键词如小模型、对齐、推理、科学AI应用等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型中深度诅咒问题，发现稀疏性（包括MoE架构中的专家激活稀疏性）能有效调节方差传播、提高层利用率，从而改善深度扩展效果，并在下游任务上实现了4.6%的准确率提升。

摘要翻译

近期研究表明，大语言模型存在深度诅咒现象，即深层网络对学习和表征的贡献低于浅层。这种利用不足的问题与预层归一化中方差累积增长有关，其可能使深层模块趋近恒等映射行为。本文论证了稀疏性不仅能够提升效率，还可作为方差传播的调节器，从而改善深度利用率。我们研究了两种稀疏性来源：（一）隐式稀疏性，其产生于训练与数据条件，包括权重衰减诱导的权重稀疏性以及长上下文输入诱导的注意力稀疏性；（二）显式稀疏性，其通过架构设计强制实现，包括分组查询注意力中的键值共享稀疏性与混合专家模型中的专家激活稀疏性。我们通过受控的深度扩展实验与针对性层效能干预，全面验证了上述观点。在所有实验设置中，我们观察到一致规律：稀疏性能通过降低输出方差并促进功能分化来提升网络层利用率。最终，我们将研究结果提炼为一条实用的经验法则，用于训练具有深度效能的大语言模型，该方案在下游任务中实现了4.6%的显著准确率提升。我们的研究揭示：稀疏性作为标准设计选择中自然产生的机制，是大语言模型实现有效深度扩展的关键且此前被忽视的因素。代码发布于https://github.com/pUmpKin-Co/SparsityAndCoD。

摘要 (Abstract)

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.

关键词: large language models, sparsity, curse of depth, Mixture of Experts, variance propagation, layer utilization, depth scaling, Grouped-Query Attention

123. ❌ DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

作者: Xueyu Zhou, Yangrong Hu, Jian Huang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Masked Diffusion Language Models（MDLMs）的decoding策略，属于大模型技术范畴，因此与’Large Language Models’相关（5分）。论文提到’pre-trained MDLMs’，与’Pre-training’相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、SFT、RLHF、RAG、CoT、Agents、Quantization、AI for Science等均未在摘要中涉及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

论文针对Masked Diffusion Language Models（MDLMs）在解码时忽略序列级信息和token间依赖的问题，提出了Dependency-Oriented Sampler（DOS）解码策略，利用注意力矩阵近似token依赖关系，在代码生成和数学推理任务上提升了性能，并能与现有并行采样方法结合提高效率。

摘要翻译

掩码扩散语言模型（Masked Diffusion Language Models, MDLMs）近期已成为语言建模领域的新范式，其提供了灵活的生成动态并支持高效的并行解码。然而，现有针对预训练MDLMs的解码策略主要依赖于词元级别的不确定性准则，而很大程度上忽视了序列级别的信息以及词元间的依赖关系。为应对这一局限，本文提出依赖导向采样器（Dependency-Oriented Sampler, DOS），这是一种无需训练的解码策略，它利用词元间的依赖关系来指导生成过程中的词元更新。具体而言，DOS利用Transformer模块中的注意力矩阵来近似词元间依赖关系，在更新掩码位置时强调来自未掩码词元的信息。实验结果表明，DOS在代码生成和数学推理任务上均能持续取得更优的性能。此外，DOS能够与现有的并行采样方法无缝集成，从而在不牺牲生成质量的前提下提升生成效率。

摘要 (Abstract)

Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

关键词: Masked Diffusion Language Models, MDLMs, decoding strategy, Dependency-Oriented Sampler, DOS, inter-token dependencies, attention matrices, parallel sampling

124. ❌ PYTHEN: A Flexible Framework for Legal Reasoning in Python

作者: Ha-Thanh Nguyen, Ken Satoh 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15317v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个基于Python的法律推理框架PYTHEN，专注于符号逻辑推理和可废止法律论证建模，与深度学习、大模型技术、AI训练方法、推理优化、AI代理等关键词完全无关。论文属于传统符号AI和专家系统领域，不涉及任何深度学习或大模型技术。

!!! tip deepseek-chat TL;DR

该论文提出了PYTHEN，一个基于Python的灵活框架，用于建模可废止法律推理，旨在通过结合符号推理和Python的易用性，使形式化法律推理更易于访问。

摘要翻译

本文介绍了一种基于Python的新型可废止法律推理框架PYTHEN。该框架旨在对法律论证固有的可废止性进行建模，为表示法律规则、条件与例外提供灵活直观的语法。受PROLEG（基于PROlog的法律推理支持系统）启发，并遵循Python之禅的设计哲学，PYTHEN利用Python内置的any()与all()函数，通过原生支持在同一规则中同时包含合取（ALL）与析取（ANY）条件，以及更具表现力的异常处理机制，从而提供更强的灵活性。本文详细阐述了PYTHEN的架构，与PROLEG进行了对比分析，并探讨了其在法律文本自动形式化与新一代法律人工智能系统开发中的潜在应用。通过弥合符号推理与Python可及性之间的鸿沟，PYTHEN致力于为缺乏深厚逻辑编程背景的青年研究者、法律科技开发者及专业人士普及形式化法律推理。我们将PYTHEN定位为逻辑编程强大的符号推理能力与Python丰富且无处不在的生态系统之间的实用桥梁，旨在使更广泛的开发者与法律从业者能够便捷地进行形式化法律推理。

摘要 (Abstract)

This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python’s built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.

关键词: defeasible legal reasoning, Python framework, symbolic reasoning, legal argumentation, autoformalization, legal AI systems, PROLEG, The Zen of Python

125. ❌ Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

作者: Giuseppe Samo, Paola Merlo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究大语言模型（LLMs）在语言学任务中的表现，特别是动词交替现象，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词，如MoE、SLMs、训练技术、推理方法、代理系统、压缩技术、科学AI应用等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在跨语言动词交替现象中的表现，通过创建基于范式的数据集和基线性能结果，展示了这些数据集在诊断模型系统跨句知识方面的有用性。

摘要翻译

大语言模型（LLM）在各种基于句子的语言现象上表现出卓越的性能，但其捕捉跨句子聚合范式模式（如动词交替）的能力仍未得到充分探索。在本研究中，我们为四种语言构建了基于范式的精选数据集，旨在探究模型对动词交替（英语、德语和意大利语中的状态变化和宾语省略结构，以及希伯来语的动词词干系统）的系统性跨句子知识。这些数据集包含数千个“黑鸟语言矩阵”（Blackbird Language Matrices, BLMs）问题。BLM任务——一种专为语言设计的、类似于RPM/ARC的认知测试——是一种受控的语言谜题，模型必须根据句法和语义规则选择能够完成模式的句子。我们引入了三种复杂度不同的模板类型，并在合成数据与自然数据上应用了基于语言学理论的数据增强策略。我们提供了英语、意大利语、德语和希伯来语的简单基线性能结果，证明了这些数据集的诊断价值。

摘要 (Abstract)

Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task – an RPM/ARC-like task devised specifically for language – is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.

关键词: Large language models, Verb alternations, Cross-linguistic datasets, BLM templates, Data augmentation, Linguistic puzzles, Systematic knowledge, Diagnostic evaluation

126. ❌ Practicing with Language Models Cultivates Human Empathic Communication

作者: Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Bruce Lambert, Matthew Groh 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在人类共情沟通中的应用，通过实验平台评估LLMs生成共情回复的效果，并开发LLM辅导干预提升人类沟通模式。因此与’Large Language Models’高度相关（10分），与’Instruction Tuning/Alignment’有一定关联（5分，涉及沟通模式对齐），其他关键词如MoE、SLMs、Scaling Laws等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过实验发现LLMs能生成比人类更共情的回复，但AI回复被感知为不如人类回复有效，并开发了基于LLM的个性化辅导干预，显著提升了人类沟通模式与规范共情模式的匹配度。

摘要翻译

共情能力是人类联结的核心，然而人们常常难以有效表达共情。在盲审评估中，大型语言模型（LLMs）生成的回应通常被认为比人类撰写的回应更具共情力。但当回应被标注为AI生成时，接收者会感到比标注为人类生成的类似回应更少被倾听和认可。为探究并弥补这一共情沟通技能的差距，我们开发了实验性对话平台“Lend an Ear”，邀请参与者向扮演个人及职场困境的LLM提供共情支持。基于968名参与者与其LLM对话伙伴进行的2,904场文本对话中产生的33,938条消息，我们通过数据驱动的方式构建了自然对话中惯用共情表达的分类体系。通过一项预注册的随机对照实验，我们证明：相较于对照组和接受非个性化视频反馈的组别，接受基于LLM的简短辅导干预（针对如何有效传达共情提供个性化反馈）能显著提升参与者沟通模式与规范性共情沟通模式的一致性。此外，我们发现了“沉默共情效应”的证据，即人们虽能感知共情却系统性地未能表达它。尽管如此，参与者仍能可靠地识别出符合规范性共情沟通标准的回应，并认为其更能体现共情。这些研究结果共同推进了对共情表达方式及其价值判定的科学理解，并展示了一种可扩展的、基于人工智能的干预方法，用于构建和培养共情能力。

摘要 (Abstract)

Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants’ communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.

关键词: Large Language Models, Empathy, Human-AI Interaction, Communication Skills, Coaching Intervention, Personalized Feedback, Conversational Platform, Normative Patterns

127. ❌ Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation

作者: Xinyue Ma, Pol Pastells, Mireia Farrús, Mariona Taulé 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器翻译中被动句的数据集构建和评估，属于自然语言处理的应用研究。摘要中提到评估了LLMs（大语言模型），因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（3分），但LLMs只是作为评估对象之一，并非论文的核心创新点。其他关键词均涉及大模型技术原理、训练方法、推理优化、代理系统等，与论文的机器翻译数据集和评估主题无直接关系，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文构建了一个中英文双向被动句数据集用于机器翻译评估，发现模型在翻译时倾向于保持源文本的语态，但商业模型在指标上表现更好，而大语言模型能提供更多样化的翻译替代方案。

摘要翻译

机器翻译评估已超越单纯指标度量，正朝着更具体的语言现象分析方向发展。针对英汉语言对，被动句因语言差异在结构与分布上存在不同特点，需在机器翻译中予以特别关注。本文提出一个双向多领域被动句数据集，该数据集从五个汉英平行语料库中提取，并依据人工翻译自动标注结构标签，同时包含经人工校验标注的测试集。数据集共包含73,965个平行句对（2,358,731个英文单词，3,498,229个汉字字符）。我们使用该数据集评估了两个前沿开源机器翻译系统，并利用测试集评估了四个商业模型。结果表明：与人类译者不同，模型更受源文本语态的影响而非源语言的整体语态使用规律，因此在双向翻译中均倾向于保留被动语态。然而，模型展现出对汉语被动句低频出现及主要存在于消极语境的认知，导致英译汉时与人类译者的语态一致性高于汉译英。商业神经机器翻译模型在指标评估中得分更高，但大语言模型展现出更强的多样化替代翻译能力。数据集与标注脚本将根据需求提供共享。

摘要 (Abstract)

Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.

关键词: Machine Translation, Passive Sentences, Chinese-English, Dataset, Evaluation, Large Language Models, Neural Machine Translation, Linguistic Phenomena

128. ❌ Efficient Document Parsing via Parallel Token Prediction

作者: Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）在文档解析任务中的推理加速和幻觉减少，与大多数关键词无关。仅与’Speculative Decoding OR Inference Acceleration’高度相关（8分），因为核心贡献是并行令牌预测以加速解码；与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为方法也减少了模型幻觉。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为并行令牌预测（PTP）的插件式方法，通过让视觉语言模型并行生成多个未来令牌，显著提高了文档解析的解码速度（1.6-2.2倍），同时减少了模型幻觉并增强了泛化能力。

摘要翻译

文档解析作为一项基础而关键的视觉任务，正受到视觉语言模型（Vision-Language Models, VLMs）的革命性影响。然而，VLMs固有的自回归（Autoregressive, AR）解码方式造成了显著的性能瓶颈，严重限制了解析速度。本文提出并行令牌预测（Parallel-Token Prediction, PTP），这是一种可插拔、模型无关且简洁高效的方法，使VLMs能够以更高的样本效率并行生成多个未来令牌。具体而言，我们在输入序列中插入若干可学习的令牌，并设计相应的训练目标，使模型获得面向文档解析的并行解码能力。此外，为支持有效训练，我们开发了一套完整的数据生成流程，能够高效地为VLMs生产大规模、高质量的文档解析训练数据。在OmniDocBench和olmOCR-bench上的大量实验表明，我们的方法不仅显著提升了解码速度（1.6倍至2.2倍），同时减少了模型幻觉现象，并展现出强大的泛化能力。

摘要 (Abstract)

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

关键词: Document Parsing, Vision-Language Models, Parallel Token Prediction, Autoregressive Decoding, Inference Acceleration, Model Hallucinations, Generalization, Data Generation Pipeline

129. ❌ The Hrunting of AI: Where and How to Improve English Dialectal Fairness

作者: Wei Li, Adrian de Wynter 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15187v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在英语方言中的公平性问题，直接涉及LLMs关键词（10分）。研究探讨数据质量和可用性对改进LLMs的影响，与Scaling Laws AND Data Quality有一定关联（5分）。论文提到fine-tuning（微调）对模型性能的影响，与Post-training/SFT相关（5分）。研究涉及LLMs与人类共识的对齐问题，与Alignment概念相关（5分）。其他关键词如MoE、SLMs、RLHF、RAG等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在英语方言中的公平性问题，发现人类间评估一致性低会影响LLM性能评估，且微调可能放大这种模式，但LLMs生成高质量数据的能力为解决数据稀缺问题提供了希望。

摘要翻译

已知大型语言模型（LLM）在英语方言中表现不佳，且由于数据稀缺，改进这些模型十分困难。本研究探讨了在此背景下，数据质量与可用性如何影响改进LLM的可行性。为此，我们评估了三种极少被研究的英语方言（约克郡方言、乔迪方言和康沃尔方言），并加入非裔美国人白话英语作为研究对象，以西弗里斯兰语作为对照。我们发现，人类在判定LLM生成质量时的一致性程度会直接影响LLM作为评判者的表现。具体而言，LLM与人类的一致性模式模仿了人类之间的一致性模式，准确率等指标也呈现相同趋势。这成为一个问题，因为LLM与人类的一致性衡量的是LLM与人类共识的对齐程度；因此在人口较少导致共识度低的地区，提升LLM性能的可行性受到质疑。我们还注意到，微调不仅未能消除英语方言中的这种模式，反而可能加剧它。但研究也发现了积极信号，例如某些LLM能够生成高质量数据，从而实现了可扩展性。我们认为，必须仔细评估数据以确保LLM改进的公平性与包容性；且在数据稀缺的情况下，需要新的工具来处理所发现的模式。

摘要 (Abstract)

It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM’s alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs’ ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.

关键词: large language models, English dialects, fairness, data scarcity, fine-tuning, LLM evaluation, human agreement, model alignment

130. ❌ Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

作者: Miriam Winkler, Verena Blaschke, Barbara Plank 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究间接问答（IQA）任务，使用GPT-4o-mini生成数据，并测试多语言Transformer模型（mBERT、XLM-R、mDeBERTa）。仅与’Large Language Models’关键词有中等关联（5分），因为使用了GPT-4o-mini生成数据，但论文核心是NLP任务评估而非大模型技术本身。其他关键词均无关（0分），论文未涉及MoE、SLMs、缩放定律、训练方法、推理优化、代理系统、模型压缩等主题。

!!! tip deepseek-chat TL;DR

该论文研究了英语、德语和巴伐利亚语中的间接问答任务，发现这是一个具有挑战性的任务，即使使用GPT-4o-mini生成数据和多语言Transformer模型，性能仍然较低，且存在严重过拟合问题。

摘要翻译

间接性是日常交际的普遍特征，但在自然语言处理研究中，无论对于低资源语言还是高资源语言，该现象均未得到充分探索。间接问答任务旨在对间接回答的极性进行分类。本文提出了两个质量各异的多语言间接问答语料库，两者均涵盖英语、标准德语以及无标准拼写体系的德语方言巴伐利亚语：其一是小规模高质量评估数据集InQA+，包含人工标注标签；其二是更大规模的训练数据集GenIQA，其中包含由GPT-4o-mini生成的人工数据。基于对多语言Transformer模型（mBERT、XLM-R和mDeBERTa）的多种实验变体，我们发现间接问答是一项语用层面极具挑战性的任务，伴随多重困难。我们提出并采用了应对这些挑战的建议方案。实验结果显示，即使对于英语，模型性能也处于较低水平，且存在严重的过拟合现象。我们分析了影响结果的多种因素，包括标签歧义性、标签集规模及数据集大小。研究发现，间接问答任务在高资源语言（英语、德语）和低资源语言（巴伐利亚语）中表现均不佳，而大量训练数据对性能提升具有积极作用。此外，GPT-4o-mini在所有测试语言中均未展现出足够的语用理解能力以生成高质量的间接问答数据。

摘要 (Abstract)

Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.

关键词: Indirect Question Answering, multilingual corpora, GPT-4o-mini, transformer models, low-resource languages, pragmatic understanding, overfitting, label ambiguity

131. ❌ MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge

作者: Baochen Fu, Yuntao Du, Cheng Chang, Baihao Jin, Wenzhi Deng, Muhao Xu, Hongmei Yan, Weiye Song, Yi Wan 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态知识更新，提出了MMKU-Bench基准，并评估了SFT、RLHF和知识编辑等方法。与关键词的相关性分析：1）论文涉及多模态模型（可视为基础模型的一种），但未明确使用LLMs，给5分；2）论文明确评估了SFT和RLHF方法，这些是核心实验内容，给10分；3）论文提到预训练知识过时问题，与预训练/领域适应有一定关联，给5分；4）其他关键词如MoE、量化、推理加速等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态模型中知识过时问题，提出了MMKU-Bench评估基准，实验发现SFT和RLHF容易导致灾难性遗忘，而知识编辑在持续更新中存在明显局限性。

摘要翻译

随着现实世界知识持续演进，多模态模型在预训练阶段获得的参数化知识越来越难以与现实世界知识保持一致。现有关于多模态知识更新的研究仅聚焦于学习先前未知的知识，而忽视了更新模型已掌握但后续发生变化的知识的需求；此外，评估局限于单一模态，缺乏对跨模态一致性的系统分析。为解决这些问题，本文提出MMKU-Bench——一个用于多模态知识更新的综合性评估基准，该基准包含超过2.5万个知识实例和4.9万余张图像，涵盖知识更新与知识未知两种场景，从而支持对不同知识类型学习效果的比较分析。基于此基准，我们评估了多种代表性方法，包括监督微调（Supervised Fine-Tuning, SFT）、基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）以及知识编辑（Knowledge Editing, KE）。实验结果表明，SFT和RLHF容易产生灾难性遗忘，而KE能更好地保留通用能力，但在持续更新方面存在明显局限。总体而言，MMKU-Bench为多模态知识更新领域提供了一个可靠且全面的评估基准，推动了该领域的进展。

摘要 (Abstract)

As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.

关键词: multimodal knowledge updating, evaluation benchmark, supervised fine-tuning, reinforcement learning from human feedback, knowledge editing, catastrophic forgetting, cross-modal consistency, MMKU-Bench

132. ❌ Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

作者: Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在创意写作任务中的优化方法，涉及多智能体协作、自我反思和强化学习训练范式。高度相关的关键词包括：LLMs（核心模型）、Supervised Fine-tuning（训练方法）、Self-Correction（自我反思机制）、LLM Agents（多智能体框架）、Multi-agent Systems（协作系统）。中等相关的关键词包括：Small Language Models（4B参数模型）、RLHF（强化学习优化）、System 2 Thinking（深度反思）、Explainable AI（可解释标准）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对创意写作任务中缺乏可验证参考答案的问题，提出了一种基于多智能体协作的动态评估标准生成方法和记忆增强回放策略优化算法，使4B参数的Writer-R1模型在多个创意写作任务中超越基线并媲美百亿参数开源模型。

摘要翻译

作为一种典型的开放式生成任务，创意写作缺乏可验证的参考答案，长期以来因高昂的人工标注成本、评估偏差及粗粒度的反馈信号，制约了奖励建模与自动评估的发展。为应对这些挑战，本文首先基于扎根理论设计了一种多智能体协同工作流程，对问题进行维度分解与层次化归纳，动态生成可解释、可复用的细粒度评估标准。进一步，我们提出了记忆增强回放策略优化算法：一方面，该算法无需额外训练，即可引导模型基于动态标准进行自我反思，实现可控的迭代改进；另一方面，我们采用监督微调与强化学习相结合的训练范式，将评估标准转化为奖励信号，实现端到端的优化。实验结果表明，自动构建的评估标准取得了与人工标注相当的性能提升。通过此方法训练的Writer-R1-4B模型在多项创意写作任务中均优于基线模型，并超越了一些参数量超过1000亿的开源模型。

摘要 (Abstract)

As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.

关键词: creative writing, multi-agent collaboration, self-reflection, reinforcement learning, supervised fine-tuning, memory-augmented replay, policy optimization, LLM evaluation

133. ❌ Attention Residuals

作者: Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15031v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的架构创新，提出Attention Residuals（AttnRes）替代标准残差连接，属于大模型技术原理的创新。论文明确针对LLMs（权重1.0关键词得10分），进行了Scaling law实验（权重1.0关键词得5分），并涉及预训练（权重1.0关键词得5分）。其他关键词如MoE、SFT、RAG等与论文内容无关（得0分）。论文未涉及科学领域应用，因此AI for Science等关键词也得0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型中标准残差连接导致隐藏状态增长和层贡献稀释的问题，提出Attention Residuals（AttnRes）方法，通过软注意力机制选择性聚合前层表示，实验证明该方法能改善输出幅度和梯度分布，并提升下游任务性能。

摘要翻译

采用预归一化（PreNorm）的残差连接是现代大语言模型（LLM）的标准设计，但其以固定的单位权重累加所有层的输出。这种均匀聚合会导致隐藏状态随深度增长而失控，逐渐稀释每一层的贡献。我们提出注意力残差（Attention Residuals, AttnRes），用基于先前层输出的softmax注意力机制取代这种固定累加，使每一层能够以学习到的、输入依赖的权重有选择性地聚合先前的表征。为了解决在大规模模型训练中关注所有先前层输出所带来的内存和通信开销，我们引入了分块注意力残差（Block AttnRes），将层划分为多个块，并在块级表征上进行注意力计算，从而在保留完整AttnRes大部分优势的同时减少了内存占用。结合基于缓存的流水线通信和两阶段计算策略，Block AttnRes成为一种实用的即插即用替代方案，能以极小的开销取代标准残差连接。

缩放定律实验证实，该改进在不同模型规模下均保持一致，消融研究验证了内容依赖的深度选择机制的有效性。我们进一步将AttnRes集成到Kimi Linear架构（总参数量480亿/激活参数量30亿）中，并在1.4万亿词元上进行预训练。结果表明，AttnRes缓解了PreNorm带来的稀释效应，使得不同深度的输出幅度和梯度分布更加均匀，并在所有评估的下游任务中提升了性能。

摘要 (Abstract)

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

关键词: Attention Residuals, Large Language Models, Residual Connections, PreNorm, Layer Aggregation, Scaling Laws, Model Architecture, Gradient Distribution

134. ❌ MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal

作者: Yiqi Nie, Fei Wang, Junjie Chen, Kun Li, Yudi Cai, Dan Guo, Chenglong Li, Meng Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态表情包重评任务，提出MER-Bench基准和基于MLLM-as-a-Judge的评估框架。与大多数关键词无关，仅与’Large Language Models OR LLMs OR Foundation Models’有一定关联，因为使用了多模态大语言模型（MLLM）进行评估，但非核心创新点。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统、模型压缩等均未涉及。AI for Science等应用领域关键词也不相关。

!!! tip deepseek-chat TL;DR

该论文提出了表情包重评这一新颖的多模态生成任务，并构建了MER-Bench基准和基于多模态大语言模型的评估框架，实验表明现有系统在结构保持和情感转换方面存在显著差距。

摘要翻译

模因代表一种紧密耦合的多模态社会表达形式，其视觉语境与叠加文本共同传递微妙的情感和评论。受心理学中认知重评的启发，我们提出“模因重评”这一新颖的多模态生成任务，旨在将负面框架的模因转化为建设性表达，同时保持其内在场景、实体和结构布局。与先前关于模因理解或生成的研究不同，模因重评需要在多重语义和风格约束下实现情感可控、结构保留的多模态转换。为支持此任务，我们构建了MER-Bench基准数据集，包含具有细粒度多模态标注的真实世界模因，涵盖源情感与目标情感、正向重写的模因文本、视觉编辑规范，以及覆盖视觉类型、情感极性和布局结构的分类标签。我们进一步提出基于多模态大语言模型即评判员范式的结构化评估框架，将性能分解为模态级生成质量、情感可控性、结构保真度和全局情感对齐度。通过对代表性图像编辑和多模态生成系统的大量实验，我们发现现有方法在满足结构保留、语义一致性和情感转换约束方面存在显著差距。我们相信MER-Bench为可控模因编辑和情感感知多模态生成研究奠定了基础。代码已开源：https://github.com/one-seven17/MER-Bench。

摘要 (Abstract)

Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: https://github.com/one-seven17/MER-Bench.

关键词: Meme Reappraisal, multimodal generation, MLLM-as-a-Judge, affect controllability, structural fidelity, emotion-aware generation, benchmark evaluation, visual-text transformation

135. ❌ Pretraining and Benchmarking Modern Encoders for Latvian

作者: Arturs Znotins 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于为低资源语言（拉脱维亚语）预训练编码器模型（RoBERTa、DeBERTaV3、ModernBERT），包括长上下文变体，并进行基准测试。核心相关关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为论文的核心是预训练模型。‘Context Window Extension OR Long Context LLMs’（5分）有一定关联，因为论文提到了长上下文变体。其他关键词主要涉及大语言模型（LLM）的特定技术（如指令调优、RLHF、RAG、推理加速等）、代理系统或科学AI应用，而本文研究的是编码器模型（非生成式LLM）在特定语言上的预训练和评估，因此不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对低资源语言拉脱维亚语预训练了一套基于RoBERTa、DeBERTaV3和ModernBERT架构的编码器模型（包括长上下文变体），并通过广泛的基准测试证明其性能优于现有的多语言基线和先前的拉脱维亚语特定编码器，最佳模型为lv-deberta-base。

摘要翻译

仅编码器Transformer模型在实际自然语言处理任务中仍不可或缺。尽管多语言模型的最新进展提升了跨语言能力，但如拉脱维亚语等低资源语言在预训练语料库中仍代表性不足，且目前极少存在单语种拉脱维亚编码器。为填补这一空白，我们基于RoBERTa、DeBERTaV3和ModernBERT架构（包括长上下文变体）预训练了一系列拉脱维亚语专用编码器，并在多样化的拉脱维亚语诊断与语言学基准测试中对其进行了评估。我们的模型在与现有单语及多语言编码器的竞争中表现相当，同时受益于最新的架构与效率改进。其中最佳模型lv-deberta-base（1.11亿参数）实现了最强的综合性能，超越了规模更大的多语言基线模型及先前的拉脱维亚语专用编码器。我们公开所有预训练模型与评估资源，以支持拉脱维亚语自然语言处理领域的进一步研究与实践应用。

摘要 (Abstract)

Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained models and evaluation resources to support further research and practical applications in Latvian NLP.

关键词: encoder-only transformers, Latvian NLP, pretraining, low-resource languages, RoBERTa, DeBERTaV3, ModernBERT, benchmarking

136. ❌ Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

作者: Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究代理式AI系统的可信度评估框架，与’LLM Agents’和’Tool Use’高度相关（10分），因为论文明确研究工具增强的代理式AI工作流。与’Alignment’和’Hallucination Mitigation’有一定关联（5分），因为评估框架包含社会伦理对齐和幻觉风险考量。与’Multi-agent Systems’有部分关联（5分），因为涉及交互动态和社会情境。与’Large Language Models’相关（8分），因为代理式AI通常基于LLM构建。其他关键词如MoE、量化、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对代理式AI系统在开放世界工作流中缺乏代表性可信度评估的问题，提出了一个包含场景流形表征、交互模拟和社会伦理对齐的Holographic Agent Assessment Framework（HAAF），以系统评估和优化代理式AI的可信度。

摘要翻译

随着智能体AI系统从静态问答迈向开放式、工具增强的多步骤现实工作流，其权限的扩大带来了更高的系统滥用和操作失败风险。然而，当前的评估实践仍处于碎片化状态，仅在狭隘定义的环境中测量编码能力、幻觉现象、越狱抵抗或工具使用等孤立性能。我们认为核心局限不仅在于评估维度覆盖不足，更在于缺乏原则性的代表性概念：智能体的可信度应基于具有代表性的社会技术场景分布进行评估，而非通过一系列互不关联的基准测试实例来衡量。为此，我们提出全息智能体评估框架（Holographic Agent Assessment Framework, HAAF），这是一种系统化评估范式，通过跨越任务类型、工具接口、交互动态、社会情境和风险等级的场景流形来刻画智能体可信度。该框架整合了四个互补组件：（i）静态认知与策略分析，（ii）交互式沙盒模拟，（iii）社会伦理对齐评估，以及（iv）具备分布感知能力的代表性采样引擎——该引擎联合优化覆盖度与风险敏感性，尤其关注传统基准测试系统性忽略的罕见但高影响的尾部风险。这些组件通过迭代式可信度优化工厂相互连接。通过红队探测与蓝队加固的循环机制，该范式能逐步缩小系统漏洞以满足部署标准，推动智能体评估从孤立的基准测试转向具有现实代表性的可信度验证。示例实现的代码与数据已发布于 https://github.com/TonyQJH/haaf-pilot。

摘要 (Abstract)

As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent’s trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity – particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at https://github.com/TonyQJH/haaf-pilot.

关键词: Agentic AI, Trustworthiness Evaluation, Tool-augmented Workflows, Representative Assessment, Holographic Agent Assessment Framework, Social-technical Scenarios, Risk Sensitivity, Benchmark Islands

137. ❌ LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs

作者: Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心创新在于将LLMs重新定义为图原生聚合算子（graph-native aggregation operator），用于文本丰富图（text-rich graphs）的消息传递，这直接且高度相关于’Large Language Models OR LLMs OR Foundation Models’关键词。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、量化等）、特定应用领域（如生物信息学）或通用AI概念（如世界模型、可解释AI）。

!!! tip deepseek-chat TL;DR

该论文针对文本丰富图中传统方法将文本压缩为静态嵌入导致信息瓶颈的问题，提出了RAMP方法，将LLM重新定义为图内核以进行原始文本锚定的消息传递，有效桥接了图传播与深度文本推理，在多项任务上取得了有竞争力的性能。

摘要翻译

文本富集图（Text-rich graphs）融合了复杂的结构依赖性与丰富的文本信息，其广泛存在却对现有学习范式构成持续挑战。传统方法乃至结合大语言模型（LLM）的混合方法，通常在结构推理之前将丰富文本压缩为静态嵌入或摘要，这造成了信息瓶颈，并使更新过程与原始内容脱节。我们认为，在文本富集图中，文本不仅是节点属性，更是结构关系得以呈现的主要媒介。本文提出RAMP（Raw-text Anchored Message Passing，原始文本锚定消息传递方法），该方法超越将大语言模型仅用作特征提取器的传统思路，转而将大语言模型本身重塑为一种图原生（graph-native）的聚合算子。RAMP通过一种新颖的双重表示机制充分利用图的文本富集特性：它在每次迭代中基于每个节点的原始文本进行推理锚定，同时传播来自邻居的动态优化消息。此外，该方法在统一的生成式框架下，同时处理判别式与生成式任务。大量实验表明，RAMP有效弥合了图传播与深度文本推理之间的鸿沟，取得了具有竞争力的性能，并为理解大语言模型作为通用图学习“图核”（graph kernels）的作用提供了新视角。

摘要 (Abstract)

Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node’s raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.

关键词: Text-rich graphs, Message passing, Large Language Models (LLMs), Graph kernels, Raw-text anchored, Dual-representation scheme, Generative formulation, Graph learning

138. ❌ Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs

作者: Nikita Mosievskiy 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用125M参数的RoBERTa-base模型进行监督微调（SFT）完成CVE到CWE的分类任务，属于AI在网络安全（可视为科学应用领域）的应用。与SFT高度相关（10分），与AI for Science高度相关（10分）。与LLMs/SLMs相关（5分），因为论文比较了8B参数的LLM（Cisco模型）并展示了小模型的竞争力。与数据质量相关（5分），因为使用了AI-refined和agreement-filtered数据。与预训练/领域适应相关（5分），因为基于RoBERTa进行微调。其他关键词如MoE、RLHF、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过监督微调RoBERTa-base模型（125M参数）进行CVE到CWE分类，在外部基准测试中达到与8B参数大模型相当的准确率，同时发布了数据集和模型。

摘要翻译

我们提出了一种基于RoBERTa-base架构的微调分类器（1.25亿参数），用于将通用漏洞披露（CVE）描述映射到通用缺陷枚举（CWE）类别。我们利用Claude Sonnet 4.6构建了一个包含234,770条CVE描述的大规模训练数据集，其中CWE标签经过人工智能优化，并创建了由美国国家漏洞数据库（NVD）与人工智能标签一致样本构成的协议过滤评估集。在预留测试集（27,780个样本，涵盖205个CWE类别）上，该模型实现了87.4%的Top-1准确率和60.7%的宏观F1分数——相较于已达到84.9% Top-1准确率的TF-IDF基线模型，宏观F1分数提升了15.5个百分点，证明了该模型在罕见缺陷类别上的优势。在外部基准测试CTI-Bench（NeurIPS 2024）上，该模型达到75.6%的严格准确率（95%置信区间：72.8-78.2%）——与参数规模64倍于本模型的思科Foundation-Sec-8B-Reasoning模型（75.3%，80亿参数）在统计上表现相当。我们公开了数据集、模型及训练代码。

摘要 (Abstract)

We present a fine-tuned RoBERTa-base classifier (125M parameters) for mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories. We construct a large-scale training dataset of 234,770 CVE descriptions with AI-refined CWE labels using Claude Sonnet 4.6, and agreement-filtered evaluation sets where NVD and AI labels agree. On our held-out test set (27,780 samples, 205 CWE classes), the model achieves 87.4% top-1 accuracy and 60.7% Macro F1 – a +15.5 percentage-point Macro F1 gain over a TF-IDF baseline that already reaches 84.9% top-1, demonstrating the model’s advantage on rare weakness categories. On the external CTI-Bench benchmark (NeurIPS 2024), the model achieves 75.6% strict accuracy (95% CI: 72.8-78.2%) – statistically indistinguishable from Cisco Foundation-Sec-8B-Reasoning (75.3%, 8B parameters) at 64x fewer parameters. We release the dataset, model, and training code.

关键词: fine-tuning, RoBERTa, CVE classification, CWE mapping, cybersecurity AI, model efficiency, dataset construction, supervised learning

139. ❌ LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为信号检测器的校准问题，应用信号检测理论分析敏感性和偏差，因此与’Large Language Models’高度相关（10分）。论文涉及模型校准、事实性评估，与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），但非核心。其他关键词如MoE、SFT、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究将大语言模型视为信号检测器，应用信号检测理论分解校准误差为敏感性和偏差两个组件，发现温度参数同时影响敏感性和决策标准，导致传统校准指标无法区分不同模型在敏感性和偏差空间中的位置。

摘要翻译

在评估大语言模型（LLM）的校准时，通常使用预期校准误差等指标，这些指标混淆了两个不同的组成部分：模型区分正确答案与错误答案的能力（敏感性）以及其倾向于自信或谨慎回答的倾向（偏差）。信号检测理论（Signal Detection Theory, SDT）能够分解这些成分。尽管SDT衍生的指标（如AUROC）正被越来越多地使用，但完整的参数框架——包括不等方差模型拟合、判断标准估计和z-ROC分析——尚未被应用于将LLM作为信号检测器进行分析。在这项预先注册的研究中，我们将三个LLM视为执行事实判别任务的观察者，在168,000次试验中测试温度参数是否起到类似人类心理物理学中奖惩操纵所引发的判断标准移动的作用。关键在于，这种类比可能不成立，因为温度不仅影响模型分配给答案的置信度，还会改变生成的答案本身。我们的结果证实了这种类比失效：温度同时提高了敏感性（AUC）并移动了判断标准。所有模型均表现出不等方差的证据分布（z-ROC斜率在0.52-0.84之间），其中指令微调模型显示出比基础模型（0.77-0.87）或人类再认记忆（约0.80）更极端的不对称性（0.52-0.63）。SDT分解表明，仅凭校准指标无法区分那些在敏感性-偏差空间中占据不同位置的模型，这证明完整的参数框架能够提供现有指标所不具备的诊断信息。

摘要 (Abstract)

Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model’s ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.

关键词: Large Language Models, Signal Detection Theory, Calibration, Sensitivity, Bias, Temperature, AUROC, Factual Discrimination

140. ❌ Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

作者: Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu, Yifu Chen, Chenyuhao Wen, Tianle Liang, Zhou Zhao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14889v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于端到端口语对话系统的奖励建模，研究模态差距（韵律、情感）和口语化差距（书面脚本与自然语音的区别），并提出了SDiaReward模型和ESDR-Bench基准。虽然涉及音频处理和对话系统，但论文未提及大模型（LLMs）、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、Scaling Laws、RLHF等）。论文比较了通用音频LLMs，但自身研究不基于LLMs技术。所有关键词均与论文核心内容无关，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文针对口语对话系统中存在的模态差距和口语化差距问题，提出了SDiaReward奖励模型和ESDR-Bench基准，实验表明该模型在成对偏好准确性上达到最先进水平，显著优于通用音频大语言模型。

摘要翻译

端到端口语对话系统的快速发展要求超越单纯的文本语义，纳入副语言细微特征和人类对话的自发性。然而，现有方法面临两个关键差距：涉及韵律和情感的模态差距，以及区分书面脚本与自然口语的通俗性差距。为解决这些挑战，我们提出了SDiaReward——一个基于SDiaReward-Dataset训练的端到端多轮次奖励模型。该数据集是专门针对上述差距构建的新型篇章级偏好对集合。该模型直接处理完整的多轮次语音篇章，并通过成对偏好监督进行优化，从而能在单一评估器中联合评判模态与通俗性。我们进一步建立了ESDR-Bench分层基准，用于鲁棒的篇章级评估。实验表明，SDiaReward在成对偏好准确率上达到最先进水平，显著优于通用音频大语言模型（audio LLMs）。深入分析表明，SDiaReward能捕捉超越表层合成线索的相对对话表现力，提升跨领域和录音条件的泛化能力。代码、数据及演示详见https://sdiareward.github.io/。

摘要 (Abstract)

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.

关键词: spoken dialogue systems, reward modeling, modality gap, colloquialness gap, SDiaReward, episode-level evaluation, audio processing, preference accuracy

141. ❌ Customizing ChatGPT for Second Language Speaking Practice: Genuine Support or Just a Marketing Gimmick?

作者: Fanfei Meng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14884v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究ChatGPT在ESL口语练习中的应用，属于大模型在教育领域的应用研究。仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为ChatGPT是典型的大语言模型，论文核心就是评估其定制化功能在语言教学中的效果。其他关键词涉及模型架构、训练方法、推理优化、特定应用领域等，论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究评估了定制化ChatGPT在ESL口语练习中的效果，发现定制化版本能提供更平衡的反馈和情感支持，但文化响应性改善不显著，标准模型已能满足学习需求。

摘要翻译

ChatGPT凭借其定制化功能与语音模式，为英语作为第二语言（ESL）教育提供了更具吸引力与个性化的可能性。本研究探讨了定制化ChatGPT对话功能在促进ESL口语练习中的效能，比较了四种ChatGPT语音模式的表現：未定制标准模式、未定制高级模式、定制标准模式以及定制高级模式。定制过程遵循提示工程原则，并以相关理论为基础，包括动机理论、文化回应性教学（CRT）、交际语言教学法（CLT）以及情感过滤假说。内容分析表明，定制版本通常能提供更均衡的反馈与情感支持，有助于营造积极且激励性的学习环境。然而，尽管进行了针对性定制，文化回应性并未显示出显著提升。这些初步发现表明，定制化有望增强ChatGPT作为语言辅导工具的有效性，而标准模型已具备满足学习需求的能力。本研究强调了提示工程与人工智能素养在最大化人工智能于语言学习领域潜力方面的重要性。

摘要 (Abstract)

ChatGPT, with its customization features and Voice Mode, has the potential for more engaging and peresonalized ESL (English as a Second Language) education. This study examines the efficacy of customized ChatGPT conversational features in facilitating ESL speaking practices, comparing the performance of four versions of ChatGPT Voice Mode: uncustomized Standard mode, uncustomized Advanced mode, customized Standard mode, and customized Advanced mode. Customization was guided by prompt engineering principles and grounded in relevant theories, including Motivation Theory, Culturally Responsive Teaching (CRT), Communicative Language Teaching (CLT), and the Affective Filter Hypothesis. Content analysis found that customized versions generally provided more balanced feedback and emotional support, contributing to a positive and motivating learning environment. However, cultural responsiveness did not show significant improvement despite targeted customization efforts. These initial findings suggest that customization could enhance ChatGPT’s capacity as a more effective language tutor, with the standard model already capable of meeting the learning needs. The study underscores the importance of prompt engineering and AI literacy in maximizaing AI’s potential in language learning.

关键词: ChatGPT, ESL education, speaking practice, customization, prompt engineering, Voice Mode, language learning, AI tutor

142. ❌ Developing an English-Efik Corpus and Machine Translation System for Digitization Inclusion

作者: Offiong Bassey Edet, Mbuotidem Sunday Awak, Emmanuel Oyo-Ita, Benjamin Okon Nyong, Ita Etim Bassey 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究低资源语言Efik的机器翻译，使用mT5和NLLB-200模型进行微调，属于监督微调（SFT）的应用，因此与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。其他关键词涉及大模型技术原理、推理、对齐、压缩、科学应用等，论文未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究通过构建英语-Efik平行语料库并微调mT5和NLLB-200模型，开发了低资源语言Efik的机器翻译系统，其中NLLB-200模型在BLEU和chrF指标上表现更优，证明了为低资源语言开发实用机器翻译工具的可行性。

摘要翻译

低资源语言作为人类历史的宝贵储存库，维系着文化与智识的多样性。尽管其意义重大，这些语言在现代自然语言处理系统中仍普遍缺席。虽然针对斯瓦希里语、约鲁巴语和阿姆哈拉语等使用较广的非洲语言已取得进展，但埃菲克语等规模较小的原住民语言在机器翻译研究中依然代表性不足。本研究基于一个由社区构建的小规模平行语料库（包含13,865个句对），评估了最先进的多语言神经机器翻译模型在英语-埃菲克语翻译任务上的效能。我们使用该数据集对mT5多语言模型和NLLB200模型进行了微调。实验结果表明，NLLB-200模型表现优于mT5，在英译埃菲克任务中BLEU得分达到26.64，埃菲克译英任务中达到31.21，对应的chrF分数分别为51.04和47.92，显示出其在流畅度与语义保真度方面的提升。本研究证实了为低资源语言开发实用机器翻译工具的可行性，并强调了包容性数据实践及基于文化背景的评估对推进公平自然语言处理研究的重要性。

摘要 (Abstract)

Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.

关键词: low-resource languages, machine translation, Efik, multilingual neural machine translation, fine-tuning, mT5, NLLB-200, BLEU scores

143. ❌ ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

作者: Hankun Kang, Xin Miao, Jianhao Chen, Jintao Wen, Mayi Xu, Weiyu Zhang, Wenpeng Lu, Tieyun Qian 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究毒性检测的持续学习框架，其中使用了LLM进行语义增强（摘要中提到’LLM-powered semantic enriching strategy’），因此与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、各种训练方法、推理优化、代理系统、模型压缩等，也未涉及生物信息学等科学AI应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ContiGuard框架，通过LLM增强语义和判别性特征学习策略，解决了毒性检测模型在持续学习过程中对抗不断演变的规避性扰动时面临的挑战，从而持续更新检测能力并保持对演化扰动的持续韧性。

摘要翻译

毒性检测旨在遏制有害内容（如在线社交行为中的仇恨性评论、帖子和消息）的传播，以维护健康的网络社交环境。然而，恶意用户持续设计规避性扰动来伪装有害内容，以逃避检测器的识别。传统检测器或方法长期处于静态，难以应对这些不断演变的规避策略。因此，持续学习成为一种合理的动态更新检测能力以应对演化扰动的途径。然而，不同扰动之间的差异阻碍了检测器对扰动文本的持续学习。更重要的是，扰动引入的噪声会扭曲语义，降低文本理解质量，同时损害关键特征学习，导致检测对扰动过于敏感。这些因素加剧了针对演化扰动进行持续学习的挑战。本研究提出ContiGuard，这是首个专为检测器在时序演化的扰动文本上进行持续学习（称为持续毒性检测）而设计的框架，使检测器能够持续更新能力，并保持对演化扰动的持久韧性。具体而言，为增强文本理解，我们提出一种基于大语言模型（LLM）的语义增强策略，动态地将LLM挖掘出的可能含义及与毒性相关的线索融入扰动文本，以提升理解效果。为抑制非关键特征并强化关键特征，我们提出一种可区分性驱动的特征学习策略，通过增强判别性特征同时抑制弱判别性特征，以构建鲁棒的检测分类边界……

摘要 (Abstract)

Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector’s continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection…

关键词: toxicity detection, continual learning, evasive perturbations, LLM-powered semantic enriching, discriminative feature learning, perturbed text, online social environment, ContiGuard framework

144. ❌ VorTEX: Various overlap ratio for Target speech EXtraction

作者: Ro-hoon Oh, Jihwan Seol, Bugeun Kim 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音信号处理领域的目标语音提取任务，提出了一种名为VorTEX的文本提示架构和新的诊断指标SuRE。论文内容涉及音频处理、数据集构建和模型架构设计，但完全不涉及大语言模型、深度学习技术原理创新或任何评分关键词中列出的技术（如MoE、RLHF、RAG、量化等）。论文属于传统音频AI应用，而非大模型在不同领域的研究应用或新技术创新。

!!! tip deepseek-chat TL;DR

该论文研究了目标语音提取任务在不同重叠比例下的性能问题，提出了VorTEX架构和SuRE诊断指标，实验表明VorTEX在20-100%重叠范围内实现了最高的分离保真度且无抑制伪影。

摘要翻译

目标语音提取（Target speech extraction, TSE）旨在从混合语音中恢复目标说话人的声音。尽管近期基于文本提示的方法展现出潜力，但大多数方法假设语音完全重叠，这限制了对实际重叠比例下模型行为的深入理解。本文提出VorTEX（面向目标语音提取的多重叠比例模型），这是一种文本提示驱动的TSE架构，其核心是解耦自适应多分支融合模块，该模块将主提取路径与辅助正则化路径分离。为进行可控分析，我们构建了PORTE数据集，该数据集包含重叠比例从0%到100%的双人对话语音。我们进一步提出基于能量的抑制比（Suppression Ratio on Energy, SuRE），这是一种诊断性指标，用于检测传统度量未能捕捉的抑制行为。实验表明，现有模型在不同重叠条件下会出现抑制或残留干扰问题，而VorTEX在20%-100%的重叠范围内均实现了最高的分离保真度（例如，20%重叠时达到5.50 dB，100%重叠时达到2.04 dB），同时保持SuRE为零，这表明其提取过程稳健，且未产生由抑制行为导致的伪影。

摘要 (Abstract)

Target speech extraction (TSE) aims to recover a target speaker’s voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.

关键词: Target speech extraction, Text-prompted TSE, Overlap ratio, Decoupled Adaptive Multi-branch Fusion, PORTE dataset, Suppression Ratio on Energy, Speech separation, Audio processing

145. ❌ Universe Routing: Why Self-Evolving Agents Need Epistemic Control

作者: Zhaohui Geoffrey Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究智能体（agents）的推理框架选择问题，提出’宇宙路由’（universe routing）概念，将问题分类到互斥的信念空间后调用专门求解器。与’Mixture of Experts (MoE)‘高度相关（10分），因为论文明确比较了硬路由与软MoE，并讨论了其效率优势。与’LLM Agents/Autonomous Agents’高度相关（10分），因为研究聚焦于自演化智能体的可靠性问题。与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），涉及多步推理和深度推理框架的选择。与’Large Language Models’有一定关联（5分），虽然未明确提及LLM，但智能体研究通常与大模型相关。其他关键词如训练方法、优化技术、具体应用领域等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了自演化智能体在面临不同推理框架（如频率假设检验与贝叶斯推断）选择时的'宇宙路由'问题，提出通过硬路由将问题分类到互斥信念空间的方法，实验表明该方法比软MoE快7倍且具有更好的泛化能力和终身学习适应性。

摘要翻译

当前终身智能体的一个关键失效模式并非知识匮乏，而是无法决定如何进行推理。当智能体遇到“这枚硬币是否均匀？”这类问题时，它必须识别是应调用频率学派的假设检验还是贝叶斯后验推断——这两种认识论上互不兼容的框架。混合使用它们不仅会产生微小误差，更会导致结构性失效，并在决策链中传播。我们将此形式化为宇宙路由问题：在调用专用求解器之前，先将问题分类至互斥的信念空间。我们的核心发现挑战了传统假设：（1）硬路由至异构求解器的准确度与软混合专家模型相当，但速度快7倍，因为认识论上不兼容的框架无法进行有意义的平均；（2）一个4.65亿参数的路由器相比关键词匹配基线，其泛化差距缩小了2.3倍，表明其进行的是语义层面而非表面层次的推理；（3）当扩展至新的信念空间时，基于复演的持续学习实现了零遗忘，性能超越弹性权重巩固75个百分点，这表明模块化的认识论架构本质上比基于正则化的方法更适合终身学习。这些结果指向一个更广泛的架构原则：可靠的自进化智能体可能需要一个显式的认识论控制层，以管理推理框架的选择。

摘要 (Abstract)

A critical failure mode of current lifelong agents is not lack of knowledge, but the inability to decide how to reason. When an agent encounters “Is this coin fair?” it must recognize whether to invoke frequentist hypothesis testing or Bayesian posterior inference - frameworks that are epistemologically incompatible. Mixing them produces not minor errors, but structural failures that propagate across decision chains. We formalize this as the universe routing problem: classifying questions into mutually exclusive belief spaces before invoking specialized solvers. Our key findings challenge conventional assumptions: (1) hard routing to heterogeneous solvers matches soft MoE accuracy while being 7x faster because epistemically incompatible frameworks cannot be meaningfully averaged; (2) a 465M-parameter router achieves a 2.3x smaller generalization gap than keyword-matching baselines, indicating semantic rather than surface-level reasoning; (3) when expanding to new belief spaces, rehearsal-based continual learning achieves zero forgetting, outperforming EWC by 75 percentage points, suggesting that modular epistemic architectures are fundamentally more amenable to lifelong learning than regularization-based approaches. These results point toward a broader architectural principle: reliable self-evolving agents may require an explicit epistemic control layer that governs reasoning framework selection.

关键词: universe routing, self-evolving agents, epistemic control, belief spaces, Mixture of Experts, hard routing, lifelong learning, reasoning frameworks

146. ❌ Vietnamese Automatic Speech Recognition: A Revisit

作者: Thi Vu, Linh The Nguyen, Dat Quoc Nguyen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于越南语自动语音识别（ASR）的数据集构建，涉及数据聚合、预处理和质量提升，但未涉及大模型、深度学习技术原理或科学领域的AI应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等主题相关，与论文的ASR数据工程核心内容完全无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对低资源语言越南语，提出了一种通用的数据聚合和预处理流程，构建了一个高质量、统一的500小时越南语ASR数据集，以支持先进ASR系统的训练和评估。

摘要翻译

自动语音识别（Automatic Speech Recognition，ASR）的性能在很大程度上依赖于大规模、高质量数据集的可用性。对于资源稀缺的语言，现有的开源ASR数据集往往存在质量不足和标注不一致的问题，这阻碍了鲁棒模型的开发。为解决这些挑战，我们提出了一种新颖且可推广的数据聚合与预处理流程，旨在从多样化、可能含有噪声的开源资源中构建高质量的ASR数据集。我们的流程包含严格的处理步骤，以确保数据的多样性、平衡性，并纳入词级时间戳等关键特征。我们通过将该方法应用于越南语，验证了其有效性，最终构建了一个统一、高质量的500小时数据集，为训练和评估最先进的越南语ASR系统提供了基础。我们的项目页面位于 https://github.com/qualcomm-ai-research/PhoASR。

摘要 (Abstract)

Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm-ai-research/PhoASR.

关键词: Automatic Speech Recognition, ASR, Vietnamese, low-resource languages, data aggregation, preprocessing pipeline, high-quality dataset, word-level timestamps

147. ❌ Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

作者: Renhao Pei, Siyao Peng, Verena Blaschke, Robert Litschko, Barbara Plank 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在语言变体（粤语-普通话、巴伐利亚语-德语）QA任务中的信息不对称问题，属于LLMs应用评估研究。与"Large Language Models"高度相关（10分），因为全文围绕LLMs性能评估展开。与"Retrieval-Augmented Generation"有一定关联（5分），因为研究涉及通过提供上下文（如维基百科导语）改进LLMs回答。与"Hallucination Mitigation"有一定关联（5分），因为研究涉及LLMs在信息缺失情况下的可靠性问题。与"In-context Learning"有一定关联（5分），因为实验涉及通过提供上下文提升性能。其他关键词（如MoE、Scaling Laws、RLHF等）与论文的技术内容无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在粤语-普通话和巴伐利亚语-德语等语言变体间的信息不对称问题，发现LLMs无法回答仅存在于本地维基百科版本中的问题，但通过提供上下文和翻译可以显著提升性能。

摘要翻译

大型语言模型（LLM）正日益成为人类获取知识的常用途径，但其知识覆盖范围与可靠性存在显著差异。尤其在地方语言变体方面，信息呈现高度不对称性——例如，地方维基百科页面所载信息在标准语言版本中往往缺失。然而，当前对于LLM在此类信息不对称情境下的表现，特别是在密切关联语言之间的表现，仍缺乏深入研究。本研究通过人工构建一个新颖的挑战性问答（QA）数据集，系统捕捉地方维基百科页面独有而高资源语言版本缺失的知识，涵盖普通话与粤语、德语与巴伐利亚语两组对照。实验表明，LLM无法回答仅存在于地方版维基百科信息的相关问题。通过提供章节导语作为上下文可显著提升模型表现，而借助翻译技术还能实现进一步优化。我们的主题分类、地理标注及分层评估结果揭示了地方维基百科版本作为区域性与全球性信息源的双重价值。这些发现对LLM的包容性与文化覆盖广度提出了关键性质疑。

摘要 (Abstract)

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

关键词: Large Language Models, Information Asymmetry, Language Varieties, Question Answering, Cantonese-Mandarin, Bavarian-German, Wikipedia Coverage, Cultural Inclusivity

148. ❌ Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark

作者: Wei Shao, Lemao Liu, Yinqiao Li, Guoping Huang, Shuming Shi, Linqi Song 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于机器翻译中的隐私保护任务定义、基准构建和方法探索，属于NLP应用中的特定安全/隐私子领域。所有评分关键词均围绕大模型/深度学习的技术原理、训练方法、推理优化、应用范式（如Agent）或特定科学领域（AI for Science）展开，而本文未涉及任何大模型技术、训练方法、推理加速、Agent系统或科学AI应用，核心是传统机器翻译模型的隐私保护问题定义和基准建立，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

本文针对在线翻译服务在推理阶段存在隐私泄露风险的问题，提出了“隐私保护机器翻译”新任务，并构建了三个基准测试数据集、设计了评估指标、提出了一系列基准方法作为该任务的起点。

摘要翻译

当前在线翻译服务需将用户文本发送至云端服务器，当文本包含敏感信息时存在隐私泄露风险，这一风险阻碍了在线翻译服务在隐私敏感场景中的应用。为在线翻译服务缓解此风险的一种途径，是针对翻译模型推理阶段引入隐私保护机制。然而，与文本分类、摘要生成等自然语言处理子领域相比，机器翻译研究界对推理阶段的隐私保护探索有限，既未明确定义推理阶段的隐私保护任务，也缺乏专用的评估数据集、评估指标以及基准参考方法。这些要素的缺失严重制约了研究者对该方向的深入探索。为填补这一空白，本文提出了一种新颖的“隐私保护机器翻译”（Privacy-Preserving Machine Translation, PPMT）任务，旨在模型推理阶段保护文本中的隐私信息。针对该任务，我们构建了三个基准测试数据集，设计了相应的评估指标，并提出了一系列基准方法作为该任务的起点。隐私的定义复杂多样，考虑到命名实体常包含大量个人隐私与商业机密，我们将研究聚焦于仅保护文本中命名实体的隐私。我们期待这项研究工作能为机器翻译中的隐私保护问题提供新的视角和坚实的基础。

摘要 (Abstract)

Current online translation services require sending user text to cloud servers, posing a risk of privacy leakage when the text contains sensitive information. This risk hinders the application of online translation services in privacy-sensitive scenarios. One way to mitigate this risk for online translation services is introducing privacy protection mechanisms targeting the inference stage of translation models. However, compared to subfields of NLP like text classification and summarization, the machine translation research community has limited exploration of privacy protection during the inference stage. There is no clearly defined privacy protection task for the inference stage, dedicated evaluation datasets and metrics, and reference benchmark methods. The absence of these elements has seriously constrained researchers’ in-depth exploration of this direction. To bridge this gap, this paper proposes a novel “Privacy-Preserving Machine Translation” (PPMT) task, aiming to protect the private information in text during the model inference stage. For this task, we constructed three benchmark test datasets, designed corresponding evaluation metrics, and proposed a series of benchmark methods as a starting point for this task. The definition of privacy is complex and diverse. Considering that named entities often contain a large amount of personal privacy and commercial secrets, we have focused our research on protecting only the named entity’s privacy in the text. We expect this research work will provide a new perspective and a solid foundation for the privacy protection problem in machine translation.

关键词: Privacy-Preserving Machine Translation, Inference Stage, Benchmark, Named Entity Protection, Evaluation Metrics, Machine Translation, Privacy Protection, Online Translation Services

149. ❌ Learning Constituent Headedness

作者: Zeyao Qi, Yige Chen, KyungTae Lim, Haihua Pan, Jungyeul Park 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Learning Constituent Headedness》专注于自然语言处理中的句法分析任务，具体研究如何从对齐的句法树库中学习成分结构的中心词(headedness)作为显式表示层，并通过监督预测任务实现。论文内容涉及句法分析、成分树、依存树、中心词预测等传统NLP任务，但未涉及任何大模型、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均围绕大模型技术、训练方法、推理优化、对齐技术、应用领域等，与本文的句法分析研究主题完全无关。

!!! tip deepseek-chat TL;DR

该论文研究如何将成分结构的中心词作为显式表示层，通过监督学习从对齐的句法树库中预测中心词，在英中数据上实现了接近上限的准确率，并优于基于规则的方法。

摘要翻译

中心语作为句法分析中的组织手段被广泛使用，然而成分结构树库很少对其进行显式编码，大多数处理流程通过渗透规则以程序化方式恢复这一信息。我们将成分中心语这一概念视为显式的表征层，并通过对齐的成分与依存标注将其作为监督预测任务进行学习——通过将每个成分的中心语定义为依存跨度的中心语来构建监督信号。在对齐的英语和汉语数据上，所得模型实现了接近上限的内在准确率，并显著优于基于柯林斯式规则的渗透方法。预测的中心语在中心语驱动二值化下产生可比的句法分析准确率，这与诱导出的二值化训练目标在不同中心语选择下基本等效的结论一致，同时提升了确定性成分-依存转换的保真度，并能通过简单的标签映射接口在不同资源和语言间实现迁移。

摘要 (Abstract)

Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.

关键词: constituent headedness, syntactic analysis, constituency treebanks, dependency annotations, supervised prediction, head-driven binarization, constituency-to-dependency conversion

150. ❌ Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

作者: Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在物理学科评估中的应用（LLM-as-a-judge），直接涉及’Large Language Models’关键词（10分），并属于’AI for Science’在科学教育评估领域的应用（5分）。论文未涉及其他具体的大模型技术原理创新（如MoE、量化、推理加速等）或训练方法（如预训练、对齐、PEFT等），也未涉及代理、工具使用、思维链等高级能力，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型（LLM）作为评分者在不同物理评估格式（结构化问题、论文、科学图表）中的有效性，发现其评分效度与任务的'标准可参考性'（即评分标准是否明确可观察）高度相关，而非单纯依赖模型能力。

摘要翻译

随着大语言模型（LLM）日益被考虑用于自动化评估与反馈，理解何时可以信赖LLM评分变得至关重要。本研究评估了LLM作为评分者在三种物理评估形式中的表现——结构化问题、书面论述和科学绘图，在盲评、提供参考答案、提供错误答案以及范例锚定条件下，将GPT-5.2、Grok 4.1、Claude Opus 4.5、DeepSeek-V3.2、Gemini Pro 3及委员会聚合评分与人类评分者进行对比。针对$n=771$道大学考试盲评题目，模型实现了约0.22的分数平均绝对误差（fMAE），并展现出稳健的区分效度（斯皮尔曼$ρ> 0.6$）。对于中学和大学的结构化问题（$n=1151$），提供参考答案降低了MAE并增强了效度（委员会$ρ= 0.88$）；错误答案会降低绝对准确性，但基本保持了排名顺序（委员会$ρ= 0.77$；单个模型$ρ\geq 0.59$）。论述文评分则表现出根本性差异。在$n=55$份答卷（$n=275$篇论述）中，盲评AI评分比人类评分更严苛且波动更大，区分效度本身已较差（$ρ\approx 0.1$）。添加评分方案并未改善区分度（$ρ\approx 0$；所有置信区间包含零）。锚定范例使AI评分均值接近人类均值，并将方差压缩至低于人类标准差，但区分效度仍接近零——分布一致性可以在缺乏有效区分的情况下出现。对于基于代码的绘图元素（$n=1400$），模型实现了极高的区分效度（$ρ> 0.84$）和近乎线性的校准。在所有任务类型中，效度与任务的标准可参照性——即任务映射到明确、可观测评分特征的程度——以及基准可靠性相关，而非单纯取决于模型的原始能力。

摘要 (Abstract)

As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For $n=771$ blind university exam questions, models achieve fractional mean absolute errors (fMAE) $\approx 0.22$ with robust discriminative validity (Spearman $ρ> 0.6$). For secondary and university structured questions ($n=1151$), providing official solutions reduces MAE and strengthens validity (committee $ρ= 0.88$); false solutions degrade absolute accuracy but leave rank ordering largely intact (committee $ρ= 0.77$; individual models $ρ\geq 0.59$). Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor ($ρ\approx 0.1$). Adding a mark scheme does not improve discrimination ($ρ\approx 0$; all confidence intervals include zero). Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but discriminative validity remains near-zero - distributional agreement can occur without valid discrimination. For code-based plot elements ($n=1400$), models achieve exceptionally high discriminative validity ($ρ> 0.84$) with near-linear calibration. Across all task types, validity tracks criterion-referenceability - the extent to which a task maps to explicit, observable grading features - and benchmark reliability, rather than raw model capability.

关键词: Large Language Models, LLM-as-a-judge, automated assessment, physics education, criterion-referenceability, validity, grading, educational technology

151. ❌ Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

作者: Xinran Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究低数据LoRA安全微调，直接高度相关关键词：PEFT/LoRA（核心方法，15分）、Post-training/SFT（微调技术，10分）、Instruction Tuning/Alignment（安全对齐，10分）、Large Language Models（评估模型，10分）。其他关键词如MoE、Scaling Laws、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究低数据LoRA安全微调中监督格式对模型安全性的影响，发现非身份框架比身份框架能更有效地提升模型在HarmBench上的拒绝率，且不损害模型能力。

摘要翻译

安全监督指令的撰写方式可能比其包含的显性身份内容更为重要。本研究基于同一套核心安全规则构建了四种监督格式，对低数据量LoRA安全微调进行探究：宪法规则（A）、信条式身份框架（B）、附加世界观/忏悔身份维护尾部的B匹配信条条件（C），以及匹配的非身份条件（D）。我们在三个指令微调模型系列（Llama 3.1 8B、Qwen2.5 7B和Gemma 3 4B）上，通过结合Bedrock托管的DeepSeek v3.2与Sonnet 4.6的双评审协调流程评估HarmBench，并对分歧及边界案例进行人工裁定。

在完整的320项行为HarmBench测试集上，非身份条件D在所有三个模型系列中均表现最强：Llama拒绝率达74.4%，Gemma达76.9%，Qwen达74.1%。相比之下，信条式框架（B）在Llama和Gemma上较基础宪法规则（A）有所提升，但仍显著低于D，形成整体描述性排序$D > B > C \geq A > baseline$。这对强版本身份框架假说构成了有边界的实证挑战：本研究中观察到的最强性能增益并不依赖于显性的信条式身份语言。在MMLU和ARC-Challenge上的能力评估显示，各条件间未出现显著性能权衡。

摘要 (Abstract)

How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.

关键词: LoRA fine-tuning, safety supervision, identity framing, HarmBench, instruction-tuned models, low-data fine-tuning, refusal rate, constitutional rules

152. ❌ Towards Next-Generation LLM Training: From the Data-Centric Perspective

作者: Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng, Qifeng Cai, Zhiyu Li, Linpeng Tang, Weinan E, Wentao Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14712v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM训练中的数据问题，与’Large Language Models’高度相关（10分），因为全文围绕LLM训练展开。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文讨论数据质量对训练的影响。与’Pre-training’相关（5分），因为数据准备是预训练的关键环节。与’LLM Agents’相关（5分），因为提出构建基于agent的自动数据准备系统。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型训练中数据准备和利用效率低下的问题，提出了构建基于agent的自动数据准备系统和数据-模型交互训练系统两个研究方向。

摘要翻译

大型语言模型（LLM）在广泛的任务和领域中展现出卓越的性能，其中数据在推动这些进展中发挥着核心作用。尽管取得了这些成功，为LLM训练所需的海量数据集进行准备和有效利用，仍然是主要的瓶颈。在当前实践中，LLM训练数据通常通过临时脚本构建，并且仍然缺乏成熟的、基于智能体的数据准备系统，能够自动构建健壮且可复用的数据工作流，从而将数据科学家从重复且易出错的工程工作中解放出来。此外，数据集一旦收集完成，通常在训练过程中被整体消耗，缺乏系统性的数据选择、混合优化或重加权机制。为应对这些局限，我们倡导两个互补的研究方向。首先，我们建议构建一个健壮的、基于智能体的自动数据准备系统，以支持自动化工作流构建和可扩展的数据管理。其次，我们主张建立一个统一的数据-模型交互训练系统，在该系统中，数据在整个训练过程中被动态选择、混合和重加权，从而实现更高效、自适应且性能感知的数据利用。最后，我们讨论了剩余的挑战，并概述了未来研究和系统开发的有前景的方向。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.

关键词: Large Language Models, LLM training, data preparation, agent-based systems, data workflows, data selection, data utilization, training efficiency

153. ❌ Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing

作者: Nudrat Habib, Elisa Barney Smith, Steven Olsen Smith 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文学分析中的计算语义相似性，使用BERTScore方法分析赫尔曼·梅尔维尔阅读与写作之间的潜在影响。论文主题属于计算文学分析领域，未涉及大模型、深度学习技术原理或科学应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

This study uses computational semantic similarity analysis with BERTScore to investigate potential influences of Herman Melville's reading on his writings, providing a framework for literary source and influence studies.

摘要翻译

本研究通过计算语义相似性分析，探讨赫尔曼·梅尔维尔的阅读对其自身写作的潜在影响。基于已知为梅尔维尔所有或阅读过的书籍记录，我们将其作品中的选定段落与其藏书文本进行比较。研究方法包括在句子层面和非重叠五元组（5-gram）层面对文本进行切分，随后使用BERTScore进行相似度计算。我们并未采用固定阈值来判断文本复用，而是将精确率、召回率和F1分数解读为可能暗示文学影响的语义关联指标。实验结果表明，该方法成功捕捉到了专家已识别的相似性案例，并揭示了值得进一步定性分析的其他段落。研究结果表明，语义相似性方法为文学研究中的来源与影响分析提供了一个有效的计算框架。

摘要 (Abstract)

This study investigates the potential influence of Herman Melville reading on his own writings through computational semantic similarity analysis. Using documented records of books known to have been owned or read by Melville, we compare selected passages from his works with texts from his library. The methodology involves segmenting texts at both sentence level and non-overlapping 5-gram level, followed by similarity computation using BERTScore. Rather than applying fixed thresholds to determine reuse, we interpret precision, recall, and F1 scores as indicators of possible semantic alignment that may suggest literary influence. Experimental results demonstrate that the approach successfully captures expert-identified instances of similarity and highlights additional passages warranting further qualitative examination. The findings suggest that semantic similarity methods provide a useful computational framework for supporting source and influence studies in literary scholarship.

关键词: computational semantic similarity, BERTScore, literary influence, Herman Melville, text analysis, source studies, precision recall F1, literary scholarship

154. ❌ Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机使用代理（CUAs）在图形用户界面中的视觉感知失败问题，提出了一种安全防护方法。虽然论文涉及AI代理，但核心内容是计算机视觉、人机交互和安全领域，专注于屏幕感知错误、对抗性攻击和防护机制，并未涉及大语言模型、深度学习技术原理或科学应用等关键词。所有关键词均与大模型技术、训练方法、推理优化、代理系统或科学AI应用相关，而本文研究的是基于视觉的GUI代理安全问题，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了计算机使用代理在图形用户界面中的视觉感知失败问题，提出了一种双通道对比分类的防护方法，通过独立验证视觉点击目标和代理行为推理来防止安全威胁。

摘要翻译

使用计算机的智能体（Computer-using Agents, CUAs）直接在图形用户界面上进行操作，但其对屏幕的感知常常不可靠。现有研究大多将这些故障视为性能限制，只关注动作是否成功，而非智能体是否在操作正确的对象。我们认为这本质上是一个安全问题。我们形式化了“视觉混淆代理”这一故障模式：即智能体由于感知锚定错误、对抗性屏幕截图篡改或“检查时间与使用时间”（TOCTOU）竞争条件，基于错误感知的屏幕状态授权了某个动作。这一漏洞具有实际可利用性：即使简单的屏幕级操控也能将常规点击重定向至特权操作，同时与普通的智能体错误难以区分。为缓解此威胁，我们提出了首个在智能体感知循环之外运行的防护机制。我们的方法——双通道对比分类，独立评估（1）视觉点击目标与（2）智能体基于部署特定知识库对动作的推理，并在任一通道提示风险时阻止执行。其核心洞见在于，这两个通道捕捉了互补的故障模式：视觉证据检测目标级不匹配，而文本推理则揭示视觉无害控件背后危险的意图。在受控攻击、真实GUI截图和智能体操作轨迹的测试中，组合防护机制的表现始终优于任一单独通道。我们的结果表明，CUA的安全性不仅需要更好的动作生成，还需对其自认为点击的对象及原因进行独立验证。相关材料已提供\footnote{模型、基准测试及代码：https://github.com/vllm-project/semantic-router}。

摘要 (Abstract)

Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent’s perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent’s reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: https://github.com/vllm-project/semantic-router}.

关键词: computer-using agents, visual perception failures, security problem, visual confused deputy, dual-channel contrastive classification, GUI screenshots, agent safety, independent verification

155. ❌ Seamless Deception: Larger Language Models Are Better Knowledge Concealers

作者: Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大语言模型（LLMs）的欺骗行为检测，核心关注LLMs隐藏有害知识的能力，因此与’Large Language Models’高度相关（10分）。研究涉及模型规模对欺骗检测的影响，与’Scaling Laws’有一定关联（5分）。论文旨在检测模型是否隐藏知识，与’Factuality’和’Hallucination Mitigation’高度相关（10分）。研究使用分类器检测欺骗行为，涉及模型行为的可解释性，与’Explainable AI’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、量化压缩、科学应用等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，随着语言模型规模增大，它们隐藏有害知识的能力增强，使得黑盒审计方法失效，分类器在超过700亿参数的模型上无法可靠检测欺骗行为。

摘要翻译

语言模型（Language Models, LMs）可能习得有害知识，并在接受审查时对这些话题佯装不知。受近期在语言模型中发现与欺骗相关行为模式的启发，我们旨在训练分类器以检测语言模型何时在主动隐藏知识。在较小模型上的初步研究表明，分类器检测知识隐藏的可靠性高于人类评估者，其中基于梯度的隐藏方法比基于提示的方法更易被识别。然而，与先前研究相反，我们发现这些分类器无法可靠地泛化至未见过的模型架构和隐藏知识的主题。最令人担忧的是，随着模型规模的扩大，与隐藏行为相关的可识别痕迹会逐渐减弱——在参数量超过700亿的任何模型上，分类器的表现均不优于随机猜测。我们的研究结果揭示了仅依靠黑盒审查语言模型存在关键局限，并强调需要开发更稳健的方法来检测那些主动隐藏其内部知识的模型。

摘要 (Abstract)

Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.

关键词: Language Models, Deception, Knowledge Concealment, Model Auditing, Scalability, Black-box Detection, Harmful Knowledge, Classifier Generalization

156. ❌ Argumentation for Explainable and Globally Contestable Decision Support with LLMs

作者: Adam Dejl, Matthew Williams, Francesca Toni 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在医疗决策支持中的应用，通过计算论证增强LLMs的可解释性和可争议性。高度相关关键词：LLMs（核心技术）、Explainable AI（核心目标）、AI for Science（医疗应用领域）、Chain of Thought/System 2 Thinking（涉及推理过程）。有一定关联：Hallucination Mitigation（解决LLMs不可靠性问题）。其余关键词涉及模型架构、训练方法、优化技术等，论文未涉及。

!!! tip deepseek-chat TL;DR

论文针对LLMs在高风险领域部署时的不透明和不可预测问题，提出了ArgEval框架，通过结构化论证评估为医疗决策提供可解释且可全局争议的推荐，并在胶质母细胞瘤治疗推荐中验证了有效性。

摘要翻译

大语言模型（LLM）展现出强大的通用能力，但其不透明性与不可预测性阻碍了它们在高风险领域中的部署。近期研究通过基于计算论证的事后推理增强LLM，在解决这些问题上取得了重要进展——该方法能提供忠实解释，并允许用户对错误决策提出质疑。然而，该范式目前局限于预定义的二元选择，且仅支持针对具体个案的局部性质疑，其底层决策逻辑并未改变，容易重复出现错误。本文提出ArgEval框架，该框架将焦点从个案推理转向对通用决策选项的结构化评估。ArgEval并非仅为单个案例挖掘论证，而是系统性地构建任务相关的决策空间，建立对应的选项本体，并为每个选项构建通用论证框架。这些框架可被实例化，从而为具体案例提供可解释的推荐，同时仍支持通过修改共享论证框架实现全局可争议性。我们在胶质母细胞瘤（一种侵袭性脑肿瘤）的治疗推荐任务上验证ArgEval的有效性，结果表明该框架能生成符合临床实践且具有可解释性的指导建议。

摘要 (Abstract)

Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.

关键词: Large Language Models, Explainable AI, Computational Argumentation, Decision Support, Global Contestability, Treatment Recommendation, Glioblastoma, Clinical Practice

157. ❌ Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

作者: Mark Baciak, Thomas A. Cellucci, Deanna M. Falkowski 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心挑战传统AI发展的连续性和规模单调扩展假设，提出间断平衡理论和制度适应度流形，推导出制度扩展定律，证明制度适应度在模型规模上非单调，并论证在大多数制度部署环境中，较小、领域适应的模型系统在数学上优于前沿通用模型。因此，与’Scaling Laws AND Data Quality’高度相关（10分），因为它直接挑战并扩展了经典扩展定律；与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’相关（8分），因为论文讨论模型规模（包括大型和小型模型）对制度适应度的影响；与’Pre-training OR Continual Pre-training OR Domain Adaptation’、‘Post-training OR Supervised Fine-tuning OR SFT’、‘Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文提到领域适应和训练后对齐演化作为支持证据；其他关键词如MoE、RLHF、RAG、推理技术、代理、压缩等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文挑战了AI发展连续性和模型规模单调扩展的传统假设，基于间断平衡理论提出制度适应度流形和制度扩展定律，证明制度适应度在模型规模上非单调，并论证较小、领域适应的模型系统在大多数制度部署环境中能数学上优于前沿通用模型。

摘要翻译

人工智能发展的主流叙事假定进步是连续的，且能力随模型规模单调扩展。我们挑战这两个假设。借鉴进化生物学中的间断平衡理论，我们表明人工智能的发展并非通过平稳推进，而是通过长期的停滞期被快速相变所打断，这些相变重组了竞争格局。我们识别出自1943年以来的五个这样的时代，以及当前生成式人工智能时代内的四个纪元，每个纪元都由一个不连续事件所开启——从Transformer架构到DeepSeek时刻——这些事件使得先前的范式变得从属。为了形式化驱动这些转变的选择压力，我们提出了“制度适应度流形”，这是一个数学框架，从四个维度评估人工智能系统：能力、制度信任、可负担性和主权合规性。核心成果是“制度缩放定律”，它证明制度适应度在模型规模上并非单调。超过环境特定的最优值后，进一步缩放会降低适应度，因为信任侵蚀和成本惩罚超过了边际能力收益。这直接与经典缩放定律相矛盾，并蕴含一个强有力的推论：在大多数制度部署环境中，经过协调的、规模较小且适应特定领域的模型系统，在数学上可以超越前沿的通用模型。我们推导了这种逆转成立的形式化条件，并提供了支持性的经验证据，涵盖前沿实验室动态、训练后对齐的演变，以及作为地缘政治选择压力的主权人工智能的兴起。

摘要 (Abstract)

The dominant narrative of artificial intelligence development assumes that progress is continuous and that capability scales monotonically with model size. We challenge both assumptions. Drawing on punctuated equilibrium theory from evolutionary biology, we show that AI development proceeds not through smooth advancement but through extended periods of stasis interrupted by rapid phase transitions that reorganize the competitive landscape. We identify five such eras since 1943 and four epochs within the current Generative AI Era, each initiated by a discontinuous event – from the transformer architecture to the DeepSeek Moment – that rendered the prior paradigm subordinate. To formalize the selection pressures driving these transitions, we develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance. The central result is the Institutional Scaling Law, which proves that institutional fitness is non-monotonic in model scale. Beyond an environment-specific optimum, scaling further degrades fitness as trust erosion and cost penalties outweigh marginal capability gains. This directly contradicts classical scaling laws and carries a strong implication: orchestrated systems of smaller, domain-adapted models can mathematically outperform frontier generalists in most institutional deployment environments. We derive formal conditions under which this inversion holds and present supporting empirical evidence spanning frontier laboratory dynamics, post-training alignment evolution, and the rise of sovereign AI as a geopolitical selection pressure.

关键词: punctuated equilibrium, institutional scaling law, sovereign AI, model scale, domain-adapted models, institutional fitness, phase transitions, generative AI era

158. ❌ Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

作者: Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang, Yu-Han Huang, An-Yu Cheng, Hung-yi Lee 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究训练免费模型引导方法以增强大型音频语言模型（LALMs）中的思维链推理，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），因为推理改进涉及深度思考。论文明确涉及大型语言模型（LLMs）在音频领域的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（RAG、Context Window、KV Cache）、代理、量化等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了无需训练的模型引导方法，通过三种策略改进大型音频语言模型中的思维链推理，在四个基准测试中实现了最高4.4%的准确率提升，并发现了从文本样本到语音推理的跨模态迁移效应。

摘要翻译

思维链提示方法已被扩展至大型音频语言模型以激发其推理能力，然而在不进行训练的情况下提升其有效性仍具挑战性。本研究探索了一种无需训练的推理时模型引导方法，旨在增强大型音频语言模型的推理性能。我们提出了三种利用不同信息源的引导策略，并在四种大型音频语言模型和四个基准测试上进行了评估。实验结果表明，相较于思维链提示方法，这些策略普遍实现了最高达4.4%的准确率提升。值得注意的是，我们发现了一种跨模态迁移现象：仅基于少量文本样本推导出的引导向量能有效指导基于语音的推理任务，展现出极高的数据效率。同时，我们通过分析超参数敏感性来评估这些方法的鲁棒性。我们的研究结果表明，模型引导是强化大型音频语言模型推理能力的一个实用方向。

摘要 (Abstract)

Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.

关键词: Large Audio-Language Models, Chain-of-Thought Reasoning, Training-Free Model Steering, Inference-Time Steering, Cross-Modal Transfer, Audio-Language Models, Model Steering, Reasoning Enhancement

159. ❌ Anterior’s Approach to Fairness Evaluation of Automated Prior Authorization System

作者: Sai P. Selvaraj, Khadija Mahmoud, Anuj Iravane 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究医疗保健领域自动化先授权系统的公平性评估框架，主要涉及医疗AI系统的公平性评估方法、统计分析和监管一致性，完全不涉及大模型、深度学习技术原理、AI for Science等关键词所代表的技术领域。所有关键词均与大模型技术、深度学习原理或科学AI应用相关，而本文专注于医疗管理系统的公平性评估，属于应用伦理和评估方法范畴，与这些技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对医疗先授权自动化系统提出了一种基于模型错误率而非批准结果的公平性评估框架，通过对7,166个案例的分析发现大多数人口统计学组别的模型错误率一致，但在种族/民族方面因样本量有限而证据不充分。

摘要翻译

随着预先授权（PA）领域人员配置限制与审核时效压力的日益增加，决策系统的自动化程度不断提升以支持PA审核。在此类系统中评估公平性面临独特挑战，因为合理的临床指南和医疗必要性标准常因人口统计学群体而异，使得批准率的平等性成为不恰当的公平性衡量指标。我们提出一个基于模型错误率而非批准结果的预先授权模型公平性评估框架。利用涵盖27项医疗必要性指南的7,166例人工审核案例，我们评估了模型在性别、年龄、种族/民族及社会经济地位维度上的一致性。我们的评估综合了错误率比较、预设±5个百分点容差带的容忍区间分析、统计功效评估以及协议控制的逻辑回归分析。在大多数人口统计学维度上，模型错误率保持稳定，置信区间均落在预设容差带内，表明不存在有意义的性能差异。在种族/民族维度上，点估计值虽小，但由于亚组样本量有限，导致置信区间较宽且统计检验功效不足，在当前探索的数据集中未能得出确定性结论。这些发现展示了一种严格且符合监管要求的公平性评估方法，适用于医疗保健行政管理人工智能系统。

摘要 (Abstract)

Increasing staffing constraints and turnaround-time pressures in Prior authorization (PA) have led to increasing automation of decision systems to support PA review. Evaluating fairness in such systems poses unique challenges because legitimate clinical guidelines and medical necessity criteria often differ across demographic groups, making parity in approval rates an inappropriate fairness metric. We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes. Using 7,166 human-reviewed cases spanning 27 medical necessity guidelines, we assessed consistency in sex, age, race/ethnicity, and socioeconomic status. Our evaluation combined error-rate comparisons, tolerance-band analysis with a predefined $\pm$5 percentage-point margin, statistical power evaluation, and protocol-controlled logistic regression. Across most demographics, model error rates were consistent, and confidence intervals fell within the predefined tolerance band, indicating no meaningful performance differences. For race/ethnicity, point estimates remain small, but subgroup sample sizes were limited, resulting in wide confidence intervals and underpowered tests, with inconclusive evidence within the dataset we explored. These findings illustrate a rigorous and regulator-aligned approach to fairness evaluation in administrative healthcare AI systems.

关键词: fairness evaluation, prior authorization, automated decision systems, model error rates, healthcare AI, demographic groups, statistical analysis, regulatory alignment

160. ❌ $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

作者: Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14602v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在工具使用任务中的对齐问题，通过Chain-of-Thought推理实现政策感知的智能体对齐。高度相关的关键词包括：LLMs（论文明确提及）、Alignment（核心研究问题）、Chain of Thought（核心方法）、LLM Agents（研究主体）、Tool Use（应用场景）。中等相关的关键词包括：Context Window Extension（论文提及长上下文问题）、System 2 Thinking（涉及深度推理）、Hallucination Mitigation（提及幻觉惩罚）、In-context Learning（涉及上下文学习）。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对LLM在工具使用任务中难以遵守复杂业务规则的问题，提出了一种多阶段对齐方法，通过Chain-of-Thought推理实现政策感知的智能体对齐，使模型在推理时能够回忆和应用相关业务政策，最终模型性能比基线提升16个百分点。

摘要翻译

基于大语言模型（LLM）的对话助手在工具使用任务上表现出色，但在遵循复杂且业务特定的规则方面存在困难。虽然模型能够对上下文提供的业务规则进行推理，但为每个查询包含全部策略会引入高延迟并浪费算力。此外，这些冗长的提示会导致长上下文，由于“大海捞针”问题而损害整体性能。为应对这些挑战，我们提出一种多阶段对齐方法，该方法教导模型在推理时的思维链过程中回忆并应用相关业务策略，而无需将完整业务策略置于上下文中。进一步，我们引入了一种基于杰卡德相似系数的新型策略召回奖励，以及用于GRPO训练的幻觉惩罚机制。总体而言，我们最优模型的表现超越基线16个百分点，并在使用词数减少40%的情况下，以3个百分点的优势超越同等模型规模的上下文基线。

摘要 (Abstract)

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the “needle-in-the-haystack” problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

关键词: Large Language Models, Agent Alignment, Chain-of-Thought, Tool Use, Business Policies, GRPO Training, PolicyRecall, Hallucination Penalty

161. ❌ Parameter-Efficient Quality Estimation via Frozen Recursive Models

作者: Umar Abubacar, Roman Bauer, Diptesh Kanojia 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究质量评估（QE）任务中的参数效率方法，核心贡献是发现冻结预训练嵌入结合权重共享（递归机制）能实现参数高效的质量评估。论文与大多数关键词无关，因为：1）不涉及大语言模型、小语言模型或基础模型；2）不涉及MoE、缩放定律、预训练/后训练、对齐、RLHF等技术；3）不涉及推理、代理、工具使用、多智能体系统等高级能力；4）不涉及量化、推理加速、幻觉缓解、可解释性等优化技术；5）不涉及世界模型、模型合并、上下文学习等概念；6）不涉及科学AI应用。唯一高度相关的关键词是"PEFT OR LoRA OR Parameter-efficient Fine-tuning"（评分10分），因为论文的核心就是研究参数高效方法（冻结嵌入减少37倍可训练参数），这属于参数高效微调（PEFT）范畴。

!!! tip deepseek-chat TL;DR

该论文研究了递归模型在低资源语言质量评估任务中的参数效率问题，发现冻结预训练嵌入结合权重共享能实现与全微调相当的性能，同时大幅减少可训练参数（37倍）。

摘要翻译

微型递归模型（TRM）通过迭代优化共享网络，在推理任务上取得了优异表现。本研究采用三阶段方法，探讨此类递归机制是否适用于低资源语言的翻译质量估计（Quality Estimation, QE）。在低资源QE数据集上对8个语言对进行的实验揭示了三点发现：首先，TRM的递归机制未能有效迁移至QE任务。外部迭代会损害性能，而内部递归仅带来有限增益。其次，表征质量对架构选择具有主导性影响。最后，冻结的预训练词嵌入在保持性能的同时，将可训练参数量减少了37倍（700万 vs 2.62亿）。采用冻结XLM-R词嵌入的TRM-QE模型获得0.370的斯皮尔曼相关系数，与全参数微调版本（0.369）持平，且优于同等深度的标准Transformer模型（0.336）。在印地语和泰米尔语任务中，冻结版TRM-QE以仅1/80的可训练参数量，超越了拥有5.6亿参数的MonoTransQuest模型，这表明权重共享与冻结词嵌入相结合能为QE任务实现显著的参数效率。我们已公开代码以促进后续研究。代码地址：https://github.com/surrey-nlp/TRMQE。

摘要 (Abstract)

Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on $8$ language pairs on a low-resource QE dataset reveal three findings. First, TRM’s recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37$\times$ (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman’s correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80$\times$ fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at https://github.com/surrey-nlp/TRMQE.

关键词: Parameter-Efficient, Quality Estimation, Frozen Embeddings, Recursive Models, Low-resource Languages, Weight Sharing, Tiny Recursive Models, TRM-QE

162. ❌ Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

作者: Deepon Halder, Raj Dabre 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的解码策略Top-b，用于自回归语言生成过程，属于大模型技术原理的创新。核心相关关键词是’Large Language Models’（8分），因为论文直接研究语言模型的解码方法。与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为论文在GSM8K等推理基准上进行了验证，涉及逻辑推理生成，但并非直接研究CoT或系统2思维本身。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对自回归语言生成中静态解码策略（如Top-k、Top-p）与语言动态信息密度不匹配的问题，提出了一种基于瞬时香农熵动态调节候选集的Top-b解码方法，在保持推理准确性的同时显著降低了生成熵和解码方差。

摘要翻译

概率语言生成器在理论上被建模为离散随机过程，然而标准解码策略（Top-k、Top-p）采用静态截断规则，无法适应自然语言动态的信息密度。这种错位常导致次优权衡：静态边界对于高熵创造性生成过于严格，或对于低熵逻辑推理过于宽松。本研究将生成过程形式化为相对概率流形上的轨迹。我们提出Top-b（自适应相对带宽采样），这是一种通过动态带宽系数调节候选集的解码策略，该系数严格耦合于模型分布的瞬时香农熵。我们构建的理论框架证明，Top-b可作为尾部分布的方差最小化算子。在GPQA和GSM8K基准测试上的实证验证表明，Top-b在保持竞争力推理精度的同时，显著降低了生成熵与解码间方差，有效实现了自回归生成的自调节控制系统近似。

摘要 (Abstract)

Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model’s distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.

关键词: autoregressive language generation, decoding strategy, Top-b, entropy regulation, probability bands, Shannon entropy, variance minimization, reasoning benchmarks

163. ❌ Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children’s Stories for Training Small Language Models

作者: Deepon Halder, Angira Mukherjee 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是创建了一个专门用于训练Small Language Models (SLMs)的多语言数据集，因此与’Small Language Models OR SLMs OR On-device AI’高度相关（10分）。论文使用了Sarvam-M语言模型生成数据，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。论文关注低资源语言的高质量数据创建，与’Scaling Laws AND Data Quality’有一定关联（5分）。数据集可用于预训练，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SFT、RLHF、RAG等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了低资源印度语言缺乏高质量训练数据的问题，通过创建Multilingual TinyStories数据集——一个包含17种印度语言、13.2万篇儿童故事的大规模合成语料库，专门用于训练和评估小型语言模型。

摘要翻译

为低资源语言开发鲁棒的语言模型常受限于高质量、连贯且领域适配的训练语料的稀缺。本文介绍了多语言TinyStories数据集，这是一个大规模、合成生成的儿童故事集合，涵盖17种印度语言。该语料库专为小型语言模型的训练与评估而设计，提供了严格本地化于原生文字的简单叙事性文本。我们详述了混合构建流程：该流程利用Sarvam-M语言模型及一种新颖的组合式提示工程框架进行原生文本生成，并结合谷歌翻译API实现大规模跨语言扩展。通过严格的程序化过滤，我们在发布版本中收录了132,942个故事，总计超过9390万词元，为印度语言领域的多语言建模与迁移学习提供了基础资源。

摘要 (Abstract)

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children’s stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

关键词: Small Language Models, Multilingual dataset, Indic languages, Synthetic corpus, Training corpora, Low-resource languages, Children’s stories, Transfer learning

164. ❌ MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

作者: Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna, Giovanni Da San Martino, Adam Wierzbicki 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）和小语言模型（SLMs）在虚假信息检测中的应用，特别是通过引入恶意意图分析来增强检测能力。因此，与’Large Language Models’、‘Small Language Models’高度相关（核心内容）。论文提出’intent-augmented reasoning’，涉及推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（提及推理）。研究虚假信息检测，直接关联’Hallucination Mitigation OR Factuality OR Truthfulness’（核心应用）。其他关键词如MoE、Scaling Laws、训练方法、加速技术、代理系统、科学AI等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个标注恶意意图的虚假信息数据集MALINT，并基于心理学接种理论，通过意图增强推理方法，利用大语言模型和小语言模型提升了零样本虚假信息检测的准确性。

摘要翻译

虚假信息的蓄意制造与传播对公共话语构成了严重威胁。然而，现有的英文数据集与研究很少关注虚假信息背后的意图性。本研究提出了MALINT，这是首个与专业事实核查员合作构建的人工标注英文语料库，旨在捕捉虚假信息及其恶意意图。我们利用这一新颖语料库，对包括BERT等小型语言模型（SLMs）和Llama 3.3等大型语言模型（LLMs）在内的12种语言模型，在二元及多标签意图分类任务上进行了基准测试。此外，受心理学与传播学中“接种理论”的启发，我们探讨了引入恶意意图知识是否能提升虚假信息检测能力。为此，我们提出了基于意图的接种方法——一种面向LLMs的意图增强推理框架，该框架通过整合意图分析来削弱虚假信息的说服性影响。在六个虚假信息数据集、五种LLMs及七种语言上的分析表明，意图增强推理能有效提升零样本虚假信息检测性能。为支持意图感知的虚假信息检测研究，我们公开了包含各标注步骤注释的MALINT数据集。

摘要 (Abstract)

The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.

关键词: disinformation detection, malicious intent, large language models, small language models, intent-augmented reasoning, zero-shot learning, MALINT dataset, inoculation theory

165. ❌ CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

作者: Junhang Cheng, Fang Liu, Jia Li, Chengru Wu, Nanxiang Jiang, Li Zhang 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在低资源通用编程语言上的性能评估，直接涉及’Large Language Models’和’Retrieval-Augmented Generation’（作为评估设置之一），以及’LLM Agents’（作为评估设置之一）。其他关键词如MoE、SLMs、训练方法、推理优化、对齐技术、科学应用等均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在低资源通用编程语言Cangjie上的性能，通过创建CangjieBench基准并评估四种生成设置，发现语法约束生成在准确性和计算成本间取得最佳平衡，而代理方法虽达到最高准确率但消耗大量token。

摘要翻译

大型语言模型在高资源编程语言中表现卓越，但在低资源语言中却面临困难。现有关于低资源编程语言的研究主要集中于领域特定语言，而对数据稀缺的通用编程语言探索不足。为填补这一空白，我们推出了CangjieBench——一个针对代表性低资源通用语言“仓颉”（Cangjie）的无污染基准测试。该基准包含从HumanEval和ClassEval人工翻译的248个高质量样本，涵盖文本到代码和代码到代码两类任务。我们在四种设置下对多种大型语言模型进行了系统评估：直接生成、语法约束生成、检索增强生成和智能体模式。实验表明，直接生成表现较差，而语法约束生成在准确性与计算成本之间实现了最佳平衡。智能体模式达到了最先进的准确率，但消耗了大量令牌。此外，我们观察到代码到代码翻译的表现往往不及文本到代码生成，这提示了一种负迁移现象：模型过度拟合源语言模式。我们希望这项工作能为大型语言模型向未见过的低资源编程语言的泛化提供有价值的见解。我们的代码与数据公开于https://github.com/cjhCoder7/CangjieBench。

摘要 (Abstract)

Large Language Models excel in high-resource programming languages but struggle with low-resource ones. Existing research related to low-resource programming languages primarily focuses on Domain-Specific Languages (DSLs), leaving general-purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language. The benchmark comprises 248 high-quality samples manually translated from HumanEval and ClassEval, covering both Text-to-Code and Code-to-Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax-Constrained Generation offers the best trade-off between accuracy and computational cost. Agent achieve state-of-the-art accuracy but incur high token consumption. Furthermore, we observe that Code-to-Code translation often underperforms Text-to-Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low-resource programming languages. Our code and data are available at https://github.com/cjhCoder7/CangjieBench.

关键词: Large Language Models, Low-resource Programming Language, Benchmark, Retrieval-Augmented Generation, Agent, Code Generation, Cangjie, Generalization

166. ❌ Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

作者: He Li, Yuhui Zhang, Xiaohan Wang, Kaifeng Lyu, Serena Yeung-Levy 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLM）的微调方法，与’Large Language Models’高度相关（10分），因为MLLM是LLM的扩展；与’Post-training/SFT’高度相关（10分），因为研究重点是微调方法；与’PEFT/LoRA’有一定关联（5分），因为提到了限制可训练参数数量作为正则化方法，这属于参数高效微调范畴；其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，通过简单的微调策略调整（如正则化和数据混合训练）可以有效缓解多模态大语言模型在微调过程中的灾难性遗忘问题，并展示了该方法在持续学习中的优越性。

摘要翻译

本文研究表明，仅需对多模态大语言模型（MLLM）的微调方法进行简单调整，便足以缓解灾难性遗忘问题。在视觉问答任务中，我们设计了一个2x2实验框架，以评估模型在分布内与分布外的图像及文本输入上的表现。结果显示，适当的正则化策略（如限制可训练参数数量或采用较低学习率）能有效应对分布外图像输入时的遗忘现象。然而，我们发现当面对分布内图像与分布外文本组合时，模型会出现一种特殊的遗忘形式。我们将此归因于任务特异性过拟合，并通过引入融合多数据集与多任务的数据混合训练策略解决了该问题。最后，我们证明该方法可自然延伸至持续学习场景，其性能优于依赖复杂辅助机制的现有方法。总体而言，我们的研究挑战了当前普遍假设，揭示了多模态大语言模型固有的鲁棒性，并为在适应新任务时保持其通用能力提供了实用指导。

摘要 (Abstract)

The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

关键词: multimodal large language models, fine-tuning, catastrophic forgetting, regularization, data-hybrid training, continual learning, visual question answering, out-of-distribution generalization

167. ❌ AI Can Learn Scientific Taste

作者: Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出RLCF训练范式，使用大规模社区信号作为监督，将科学品味学习建模为偏好建模和对齐问题。核心涉及LLMs（Scientific Judge和Scientific Thinker）、偏好对齐（Alignment）和强化学习（RLHF/RLAIF/DPO），并应用于AI for Science领域。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了AI如何学习科学品味，提出RLCF训练范式，通过偏好建模和对齐训练Scientific Judge和Scientific Thinker模型，实验表明AI能够学习科学品味并提出具有高潜在影响力的研究想法。

摘要翻译

杰出科学家具备卓越的判断力与前瞻性，这与其所拥有的科学品味紧密相关。在本文中，我们使用“科学品味”这一术语来指代判断并提出具有高潜在影响力研究构想的能力。然而，现有相关研究大多聚焦于提升AI科学家的执行能力，而如何增强AI的科学品味仍探索不足。在本研究中，我们提出了基于社区反馈的强化学习范式，该范式利用大规模社区信号作为监督，并将科学品味学习构建为一个偏好建模与对齐问题。在偏好建模方面，我们基于70万个领域与时间相匹配的高被引与低被引论文对，训练了“科学评审家”模型以评判研究构想。在偏好对齐方面，我们以“科学评审家”作为奖励模型，训练了一个策略模型——“科学思考者”，以提出具有高潜在影响力的研究构想。实验表明，“科学评审家”的表现优于当前最先进的大型语言模型（例如GPT-5.2、Gemini 3 Pro），并能泛化至未来年份的测试、未见过的领域以及同行评审偏好。此外，“科学思考者”所提出的研究构想，其潜在影响力高于基线模型。我们的研究结果表明，AI能够学习科学品味，这标志着向实现人类水平AI科学家迈出了关键一步。

摘要 (Abstract)

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

关键词: scientific taste, Reinforcement Learning from Community Feedback, preference modeling, alignment, large language models, AI scientists, research ideas, potential impact

168. ❌ Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

作者: Aditya Sharan, Sriram Hebbale, Dhruv Kumar 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14486v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	7.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发一个名为Infinite Problem Generator (IPG)的智能体框架，用于生成可验证的高质量物理问题数据集，以解决训练大语言模型进行复杂推理时数据稀缺的瓶颈。该研究直接涉及LLM训练数据生成、智能体工作流、物理科学AI应用等主题。与LLM相关关键词（如LLMs、Post-training/SFT、Chain of Thought、System 2 Thinking）有较强关联，因为论文旨在为LLM的复杂推理训练提供数据。与智能体相关的关键词（LLM Agents/Agentic Workflow、Tool Use）高度相关，因为IPG本身就是一个智能体框架。与科学AI应用的关键词（AI for Science）高度相关，因为研究聚焦于物理学领域。与数据质量和可验证性相关的关键词（Scaling Laws AND Data Quality、Hallucination Mitigation）有较强关联，因为论文强调生成可验证、高质量数据以避免幻觉。其他关键词（如MoE、SLMs、RLHF、RAG等）与论文内容无关或未提及。

!!! tip deepseek-chat TL;DR

该论文针对训练大语言模型进行复杂推理时高质量可验证数据稀缺的问题，提出了一个名为Infinite Problem Generator (IPG)的智能体框架，通过Formula-as-Code范式生成具有保证可解性的物理问题，并发布了ClassicalMechanicsV1数据集，证明了代码复杂度可作为问题难度的精确度量，支持可控课程生成。

摘要翻译

训练大型语言模型进行复杂推理的瓶颈在于可验证高质量数据的稀缺性。在物理学等领域，标准文本增强方法常产生幻觉，而静态基准数据集则缺乏微调所需的推理轨迹。我们提出无限问题生成器（IPG），这是一个通过“公式即代码”范式合成具有可解性保证的物理问题的智能体框架。与概率性文本生成不同，IPG将解决方案构建为可执行的Python程序，从而确保严格的数学一致性。作为概念验证，我们发布了ClassicalMechanicsV1数据集——这是一个包含1,335个经典力学问题的高保真语料库，由165个专家种子问题扩展生成。该语料库展现出高度的结构多样性，涵盖102个独立物理公式，平均每个问题包含3.05个公式。此外，我们发现了复杂度蓝图（Complexity Blueprint），证明公式数量与验证代码长度之间存在强线性相关性（$R^2 \approx 0.95$）。这种关联确立了代码复杂度可作为问题难度的精确无代理度量指标，从而实现可控的课程生成。我们完整开源IPG流水线、ClassicalMechanicsV1数据集及评估报告，以支持推理密集型领域的可复现研究。

摘要 (Abstract)

Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.

关键词: Infinite Problem Generator, agentic workflow, physics reasoning, verifiable data generation, Formula-as-Code, ClassicalMechanicsV1, complexity blueprint, curriculum generation

169. ❌ PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

作者: Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注波斯语音频-语言模型的基准测试，涉及大模型在特定语言和文化领域的应用评估，因此与’Large Language Models’有一定关联（5分）。但论文未深入探讨大模型技术原理的创新，也未涉及其他关键词的具体技术（如MoE、量化、推理加速等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个波斯语音频-语言模型基准PARSA-Bench，包含16个任务和8000多个样本，用于评估模型在波斯语言和文化理解上的表现，发现现有模型在韵律感知等文化相关任务上表现不佳。

摘要翻译

波斯语因其古典诗歌、传统音乐和普遍存在的语码转换现象，在音频理解方面提出了独特挑战——现有基准测试均未涵盖这些维度。我们推出PARSA-Bench（波斯音频推理与语音评估基准），这是首个针对波斯语言与文化评估大型音频-语言模型的基准测试，包含16项任务、超过8000个样本，涵盖语音理解、副语言分析和文化音频理解三大领域。其中十项任务为全新引入，包括诗歌格律与风格检测、传统波斯音乐理解及语码转换检测。纯文本基线模型在各项任务中持续优于音频模型，表明现有模型可能未能利用超越文本转录本身的音频特异性信息。基于文化的任务揭示了一种本质不同的失效模式：所有模型在诗歌韵律（vazn）检测任务上的表现均接近随机猜测，且不受模型规模影响，这暗示当前模型尚未掌握韵律感知能力。本数据集已公开于https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench。

摘要 (Abstract)

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

关键词: Persian audio-language models, benchmark evaluation, cultural audio understanding, code-switching detection, speech understanding, paralinguistic analysis, poetry meter detection, traditional music understanding

170. ❌ Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

作者: Yuanchi Ma, Kaize Shi, Hui He, Zhihua Zhang, Zhongxiang Lei, Ziliang Qiu, Renfen Hu, Jiamou Liu 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在叙事生成中的表现，特别是中文网络文学中的同质化问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、应用或创新，如MoE、SLMs、训练方法、推理技术、代理系统、压缩加速、可解释性、科学AI等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM生成的中文网络文学存在叙事逻辑单一和严重同质化的问题，发现主要原因是当前LLM无法正确理解叙事功能含义并遵循僵化的生成范式。

摘要翻译

大型语言模型（LLMs）在叙事生成方面已展现出显著能力，但其生成的故事常呈现结构同质化现象，频繁遵循重复的情节事件编排组合及刻板化的结局模式。本文提出一种融合普罗普叙事学与叙事功能的新型理论分析框架，用以解析LLMs生成文本的叙事构成，从而揭示其内在的叙事逻辑。以中国网络文学为研究对象，我们拓展了普罗普的叙事理论，定义了适用于现代网络叙事结构的34种叙事功能，并进一步构建了人工标注语料库以支持对LLM生成文本中叙事结构的分析。实验表明，当前LLMs无法正确理解叙事功能的意义，反而固守僵化的叙事生成范式，这是导致生成文本叙事逻辑单一、同质化严重的根本原因。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp’s narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.

关键词: Large Language Models, narrative generation, homogenization, Proppian narratology, Chinese web literature, narrative functions, text analysis, generative AI

171. ❌ Echoes Across Centuries: Phonetic Signatures of Persian Poets

作者: Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究波斯诗歌的语音特征分析，使用传统计算语言学方法（统计模型、语音指标）分析大型诗歌语料库，未涉及大模型、深度学习、AI技术原理或科学AI应用。所有关键词均与大模型技术、深度学习创新、AI科学应用相关，与论文的计算文学研究主题完全无关。

!!! tip deepseek-chat TL;DR

该研究通过计算语音学方法分析波斯诗歌的语音纹理，发现诗人在共享韵律结构中存在系统性语音差异，并揭示了跨世纪的语音分布变化反映了文学历史演变。

摘要翻译

本研究将波斯诗歌中的语音质感作为一种文学史现象加以考察，而非将其视为格律的副产品或仅用于分类的特征。分析基于一个大型语料库，该库包含来自83位诗人创作的31,988首诗歌的1,116,306个对句（mesras），并限定于五种主要古典格律以实现受控比较。每个诗行被转换为字形到音位（grapheme-to-phoneme）的表征，并通过六项语音指标进行分析：硬辅音度、响亮度、咝擦度、元音比率、音位熵（phoneme entropy）以及辅音簇比率。

统计模型在控制格律、诗歌形式和诗行长度的前提下，估算了诗人层面的差异。结果表明，尽管格律和形式解释了语音变异的相当大部分，但并未消除诗人之间的系统性差异。因此，波斯诗歌的语音特征表现为共享韵律结构内的条件性变异，而非纯粹的个人风格或简单的格律残留。

多维风格图谱揭示了几种反复出现的语音特征模式，包括高响亮度抒情风格、硬辅音驱动的修辞或史诗风格、咝擦音突出的神秘主义轮廓以及高熵值的复杂质感。历史分析表明，语音分布在数个世纪中发生变迁，反映了体裁主导地位、文学制度及表演语境的变化，而非突然的风格断裂。

本研究为波斯诗歌的语音分析建立了一个语料库规模的框架，并展示了计算语音学如何能在关注塑造波斯诗歌的形式结构的同时，为文学史阐释作出贡献。

摘要 (Abstract)

This study examines phonetic texture in Persian poetry as a literary-historical phenomenon rather than a by-product of meter or a feature used only for classification. The analysis draws on a large corpus of 1,116,306 mesras from 31,988 poems written by 83 poets, restricted to five major classical meters to enable controlled comparison. Each line is converted into a grapheme-to-phoneme representation and analyzed using six phonetic metrics: hardness, sonority, sibilance, vowel ratio, phoneme entropy, and consonant-cluster ratio. Statistical models estimate poet-level differences while controlling for meter, poetic form, and line length. The results show that although meter and form explain a substantial portion of phonetic variation, they do not eliminate systematic differences between poets. Persian poetic sound therefore appears as conditioned variation within shared prosodic structures rather than as either purely individual style or simple metrical residue. A multidimensional stylistic map reveals several recurrent phonetic profiles, including high-sonority lyric styles, hardness-driven rhetorical or epic styles, sibilant mystical contours, and high-entropy complex textures. Historical analysis indicates that phonetic distributions shift across centuries, reflecting changes in genre prominence, literary institutions, and performance contexts rather than abrupt stylistic breaks. The study establishes a corpus-scale framework for phonetic analysis in Persian poetry and demonstrates how computational phonetics can contribute to literary-historical interpretation while remaining attentive to the formal structures that shape Persian verse.

关键词: Persian poetry, phonetic analysis, computational phonetics, literary history, stylistic variation, corpus linguistics, poetic sound, statistical modeling

172. ❌ BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation

作者: Zhaoyi Li, Xu Zhang, Xiaojun Wan 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出BiT-MCTS框架，将蒙特卡洛树搜索（MCTS）与LLMs结合用于中文小说生成，因此与’Monte Carlo Tree Search OR MCTS AND LLM’高度相关（10分）。论文在三个当代LLM骨干上进行实验，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、量化、对齐等均未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对LLMs在开放主题长篇小说生成中难以保证全局结构和叙事多样性的问题，提出了基于主题的双向MCTS框架BiT-MCTS，通过'高潮优先、双向扩展'策略生成结构化大纲并最终完成叙事，实验表明该方法在叙事连贯性、情节结构和主题深度上优于基线模型。

摘要翻译

基于开放主题生成长篇线性小说对大语言模型而言仍是一个重大挑战，采用基于前提或线性大纲的方法时，模型往往难以保证整体结构和叙事多样性。我们提出BiT-MCTS，这是一个主题驱动的框架，其采用受弗莱塔格金字塔启发的“高潮优先、双向扩展”策略。给定一个主题，本方法首先提取核心戏剧冲突并生成明确的高潮，随后通过双向蒙特卡洛树搜索（MCTS）向后（发展行动、开端）和向前（下降行动、结局）扩展情节，以生成结构化大纲。最终生成阶段依据优化后的大纲实现完整叙事。我们构建了一个中文主题语料库用于评估，并在三种当代大语言模型骨干上进行了广泛实验。结果表明，相较于强基线模型，BiT-MCTS在叙事连贯性、情节结构和主题深度方面均有提升，且根据自动指标和人工评估，能够生成明显更长、更连贯的故事。

摘要 (Abstract)

Generating long-form linear fiction from open-ended themes remains a major challenge for large language models, which frequently fail to guarantee global structure and narrative diversity when using premise-based or linear outlining approaches. We present BiT-MCTS, a theme-driven framework that operationalizes a “climax-first, bidirectional expansion” strategy motivated by Freytag’s Pyramid. Given a theme, our method extracts a core dramatic conflict and generates an explicit climax, then employs a bidirectional Monte Carlo Tree Search (MCTS) to expand the plot backward (rising action, exposition) and forward (falling action, resolution) to produce a structured outline. A final generation stage realizes a complete narrative from the refined outline. We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones. Results show that BiT-MCTS improves narrative coherence, plot structure, and thematic depth relative to strong baselines, while enabling substantially longer, more coherent stories according to automatic metrics and human judgments.

关键词: Chinese fiction generation, Monte Carlo Tree Search, large language models, theme-driven framework, bidirectional expansion, narrative coherence, plot structure, thematic depth

173. ❌ Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

作者: Andrew Katz 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14400v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于信息论surprisal（负对数概率）的评估框架，用于扩展语言模型在多个应用领域的评估能力，这直接涉及大语言模型（LLMs）的评估方法（8分）和可解释AI（8分）。论文在四个应用领域（社会-生态-技术系统分类、因果陈述识别、比喻语言检测、演绎定性编码）进行了探索，与"AI for Science"有一定关联（5分）。其他关键词主要涉及具体的大模型技术架构、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种扩展的surprisal曲线评估框架，将语言模型的评估从二元语法判断扩展到多领域的序数分类和评分任务，通过测量模型对评分量表的"惊喜"值来揭示其偏好和不确定性，并在四个应用领域验证了该方法的有效性。

摘要翻译

通过对比模型对对立补全句的概率进行最小对立对分析的研究范式，已被证明对评估语言模型的语言知识具有重要价值，但其应用目前主要局限于句法现象的二值语法性判断。此外，基于标准提示的评估方法需要昂贵的文本生成过程，可能引发模型的事后合理化解释而非真实判断，并且会丢弃模型不确定性的相关信息。我们通过将基于惊异值的评估方法从二值语法性对比扩展到多领域的序数分类与评分任务，以应对这两方面的局限。我们不要求模型生成答案，而是测量模型为评级量表（如1-5分或1-9分）上每个位置分配的信息论“惊异值”（即负对数概率），从而得到完整的惊异值曲线。这些曲线既能通过最小值揭示模型的偏好响应，也能通过熵值反映其不确定性。我们在四个领域中探索了这一框架：社会-生态-技术系统分类、因果陈述识别（二值与序数）、比喻语言检测以及演绎定性编码。在这些领域中，惊异值曲线产生了可解释的分类信号，其清晰的最小值均出现在预期的序数量表位置附近，而补全句的熵值则能有效区分真正模糊的条目与相对简单的条目。

摘要 (Abstract)

The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic “surprise” (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model’s preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

关键词: surprisal-based evaluation, language models, ordinal classification, model uncertainty, information-theoretic surprise, applied domains, minimal pairs paradigm, entropy analysis

174. ❌ Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

作者: Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14355v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM安全评估方法，直接涉及LLM、SFT和RLHF等安全调优技术，与安全对齐、事实性相关，但未涉及其他技术领域如MoE、量化、推理加速等。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的多样化响应采样方法PDPS，用于系统性地暴露大语言模型中罕见但关键的安全失败，相比传统方法显著提高了攻击成功率并降低了计算成本。

摘要翻译

通过监督微调与人类反馈强化学习进行的安全调优，已显著提升了大语言模型（LLMs）的鲁棒性。然而，这种方法往往只是抑制而非彻底消除不安全行为，使得罕见但关键的安全失效问题仍隐藏于输出分布的长尾之中。尽管多数红队测试工作侧重于对抗性提示搜索（输入空间优化），本研究表明，针对固定的安全关键提示，通过多样化响应生成（输出空间探索）同样能系统性地暴露安全缺陷——增加采样响应的数量与多样性可使越狱成功率趋近于100%。为高效发现此类失效案例，我们提出渐进式多样化群体采样（Progressive Diverse Population Sampling, PDPS），该方法结合随机令牌级采样与多样性感知选择，以探索大规模的候选响应池并保留一个紧凑且语义多样化的子集。在多个越狱基准测试及开源大语言模型上，PDPS仅需8%至29%的计算成本，即可达到与大规模独立同分布（IID）采样相当的攻击成功率。在有限响应生成设置下，其成功率较IID采样及多样化束搜索（Diverse Beam Search）提升26%至40%。此外，PDPS生成的响应在数量与多样性上均表现出更多的不安全输出，证明其在揭示更广泛安全失效范围方面的有效性。

摘要 (Abstract)

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

关键词: Large Language Models, Safety Failures, Jailbreak, Diverse Response Sampling, Supervised Fine-tuning, Reinforcement Learning from Human Feedback, Progressive Diverse Population Sampling, Attack Success Rate

175. ❌ Motivation in Large Language Models

作者: Omer Nahum, Asael Sklar, Ariel Goldstein, Roi Reichart 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14347v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中是否表现出类似人类动机的现象，因此与’Large Language Models’高度相关（10分）。论文提到LLMs与人类偏好对齐，这与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），但并非技术细节研究。其他关键词涉及具体技术（如MoE、量化、推理加速等）或应用领域（如科学AI），论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨大型语言模型是否表现出类似人类的动机特征，实验发现LLMs的自我报告动机与行为模式一致，且受外部因素影响，揭示了与人类心理学相似的动机动态。

摘要翻译

动机是人类行为的核心驱动力，它塑造决策、目标与任务表现。随着大语言模型（LLMs）日益与人类偏好对齐，我们探讨其是否展现出类似动机的特性。本研究检验了大语言模型是否会“报告”不同水平的动机，这些报告如何与其行为相关联，以及外部因素是否能够对其产生影响。实验结果显示出一致且结构化的模式，呼应了人类心理学规律：模型自我报告的动机与其不同的行为特征相符，随任务类型变化，并能受外部操控调节。这些发现表明，动机是理解大语言模型行为的一个连贯组织性构念，它系统性地将报告、选择、投入和表现联系起来，并揭示了与人类心理学记载相似的动机动态机制。这一视角深化了我们对于模型行为及其与人类启发概念之间联系的理解。

摘要 (Abstract)

Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs “report” varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.

关键词: Large Language Models, Motivation, Human Psychology, Behavioral Signatures, Alignment, Self-reported Motivation, Task Performance, Model Behavior

176. ❌ ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

作者: Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究多模态大语言模型在ECG诊断中的推理能力评估，核心涉及大模型（MLLMs）、逐步推理（Chain of Thought/System 2 Thinking）和科学AI应用（AI for Science），与这三个关键词高度相关（10分）。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过构建ECG-Reasoning-Benchmark评估框架，发现当前多模态大语言模型在ECG诊断中缺乏真正的逐步推理能力，主要依赖表面视觉线索而非实际视觉证据。

摘要翻译

尽管多模态大语言模型在自动心电图解读方面展现出有前景的性能，但其是否真正执行逐步推理，还是仅依赖表面视觉线索，目前尚不明确。为探究这一问题，我们引入了 ECG-Reasoning-Benchmark，这是一个新颖的多轮次评估框架，包含超过 6,400 个样本，用于系统评估涵盖 17 种核心心电图诊断的逐步推理能力。我们对前沿模型进行的全面评估揭示了一个关键缺陷：它们在执行多步骤逻辑推理方面存在严重失败。尽管模型具备检索诊断所需临床标准的医学知识，但在维持完整推理链方面成功率极低（完成度仅为 6%），主要失败在于未能将相应的心电图发现与心电图信号中的实际视觉证据进行有效关联。这些结果表明，当前的多模态大语言模型绕过了实际的视觉解读，暴露了现有训练范式的关键缺陷，并凸显了构建以推理为核心、鲁棒的医疗人工智能的必要性。代码与数据可在 https://github.com/Jwoo5/ecg-reasoning-benchmark 获取。

摘要 (Abstract)

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce \textbf{ECG-Reasoning-Benchmark}, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.

关键词: Multimodal Large Language Models, ECG interpretation, step-by-step reasoning, clinical reasoning, medical AI, evaluation benchmark, visual evidence grounding, reasoning chain

177. ❌ Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models

作者: Yixuan Tang, Yi Yang 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是使用冻结的大型语言模型（LLMs）从FOMC声明中解码货币政策立场，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究属于大模型在经济学/金融领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为该关键词涵盖AI在科学领域的应用，而经济学可视为社会科学的一部分。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等），也未提及作者列表中的专家，因此其他关键词得分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种无标注框架Delta-Consistent Scoring（DCS），利用冻结的大型语言模型从FOMC声明中解码连续的鹰派-鸽派立场分数，通过联合建模绝对立场和相对会议间偏移，在四个LLM骨干上优于监督探针和LLM-as-judge基线，并显示出与通胀指标和国债收益率变动的经济相关性。

摘要翻译

联邦公开市场委员会（FOMC）声明是货币政策信息的主要来源，其措辞的微妙变化亦可能牵动全球金融市场。因此，一项核心任务在于衡量这些文本所传达的鹰派-鸽派立场。现有方法通常将立场检测视为标准分类问题，对每份声明进行独立标注。然而，货币政策沟通的理解本质上是相对的：市场反应不仅取决于声明的基调，还取决于该基调在历次会议间的变化趋势。本文提出Delta一致性评分（DCS），这是一种无需人工标注的框架，通过联合建模绝对立场与会议间相对变化，将冻结的大语言模型（LLM）表征映射为连续立场分数。DCS不依赖人工鹰派-鸽派标签，而是以连续会议作为自监督来源。它学习每份声明的绝对立场分数以及连续声明间的相对变化分数，并通过delta一致性目标促使绝对分数的变化与相对变化保持一致。这使得DCS能够在无需人工标注的情况下重建时间连贯的立场轨迹。在四种LLM骨干模型上，DCS均持续优于有监督探测法和LLM-as-judge基线方法，在句子级鹰派-鸽派分类任务中达到最高71.1%的准确率。所得会议级分数亦具备经济意义：它们与通胀指标高度相关，并与国债收益率波动显著关联。总体而言，研究结果表明LLM表征中编码了货币政策信号，这些信号可通过相对时间结构予以还原。

摘要 (Abstract)

Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish–dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish–dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish–dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.

关键词: Large Language Models, FOMC statements, monetary policy stance, hawkish-dovish classification, Delta-Consistent Scoring, temporal coherence, self-supervision, economic indicators

178. ❌ SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

作者: Shunlong Wu, Hai Lin, Shaoshen Chen, Tingwei Lu, Yongqin Zeng, Shaoxiong Zhan, Hai-Tao Zheng, Hong-Gee Kim 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SemantiCache专注于KV缓存压缩技术，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），因为这是论文的核心创新点。论文涉及大模型推理加速，与’Large Language Models OR LLMs OR Foundation Models’（10分）和’Speculative Decoding OR Inference Acceleration’（10分）相关。论文提到减少内存占用，与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分）。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SemantiCache的新型KV缓存压缩框架，通过语义分块和聚类合并来保持语义完整性，从而在解码阶段将推理速度提升高达2.61倍并显著减少内存占用，同时保持与原始模型相当的性能。

摘要翻译

现有的键值缓存压缩方法通常基于离散标记或非语义块进行操作。然而，这类方法常导致语义碎片化，即语言上连贯的单元被破坏，造成不可逆的信息损失和模型性能下降。为解决这一问题，我们提出了SemantiCache，一种新颖的压缩框架，通过使压缩过程与语言的语义层次结构对齐来保持语义完整性。具体而言，我们首先通过自然语义边界的分隔符将缓存划分为语义连贯的块。在每个块内，我们引入一种计算高效的基于贪婪种子的聚类算法，将标记分组为语义簇。这些簇进一步合并为语义核心，并通过一种比例注意力机制进行增强，该机制重新平衡了合并标记被削减的注意力贡献。在多种基准测试和模型上进行的大量实验表明，SemantiCache将推理的解码阶段加速最高达2.61倍，并显著减少了内存占用，同时保持了与原始模型相当的性能。

摘要 (Abstract)

Existing KV cache compression methods generally operate on discrete tokens or non-semantic chunks. However, such approaches often lead to semantic fragmentation, where linguistically coherent units are disrupted, causing irreversible information loss and degradation in model performance. To address this, we introduce SemantiCache, a novel compression framework that preserves semantic integrity by aligning the compression process with the semantic hierarchical nature of language. Specifically, we first partition the cache into semantically coherent chunks by delimiters, which are natural semantic boundaries. Within each chunk, we introduce a computationally efficient Greedy Seed-Based Clustering (GSC) algorithm to group tokens into semantic clusters. These clusters are further merged into semantic cores, enhanced by a Proportional Attention mechanism that rebalances the reduced attention contributions of the merged tokens. Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to the original model.

关键词: KV cache compression, semantic chunking, clustered merging, inference acceleration, memory footprint reduction, decoding stage, semantic integrity, attention mechanism

179. ❌ Automatic Inter-document Multi-hop Scientific QA Generation

作者: Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs自动生成科学QA数据集，高度相关关键词：LLMs（直接使用）、RAG（用于检索增强的科学推理）、AI for Science（应用于PubMed生物医学领域）。中等相关：Chain of Thought/System 2 Thinking（涉及多跳推理）、Factuality（验证事实一致性）。其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了AIM-SciQA框架，使用大语言模型自动生成跨文档多跳科学问答数据集，解决了现有方法局限于单文档事实型QA的问题，并构建了包含41万+单跳和1.3万+多跳QA的IM-SciQA数据集，有效评估检索增强的科学推理能力。

摘要翻译

现有自动科学问题生成研究主要集中于单文档事实型问答，未能涵盖科学理解所必需的多文档推理过程。我们提出AIM-SciQA——一个用于生成多文档、多跳科学问答数据集的自动化框架。该框架通过大型语言模型（LLMs）结合机器阅读理解技术提取单跳问答对，并基于嵌入向量的语义对齐构建跨文档关联关系，同时选择性利用引文信息。将本框架应用于8,211篇PubMed Central学术论文后，共生成411,409个单跳问答对与13,672个多跳问答对，由此构成IM-SciQA数据集。人工与自动验证均证实该数据集具有高度事实一致性，实验结果表明IM-SciQA能有效区分检索阶段与问答阶段的推理能力差异，为检索增强型科学推理提供了真实且可解释的基准。我们进一步扩展该框架构建了CIM-SciQA——一种引文引导的变体，其性能达到与Oracle设置相当的水平，从而强化了数据集的有效性与泛化能力。

摘要 (Abstract)

Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset’s validity and generality.

关键词: scientific QA generation, multi-document reasoning, large language models, retrieval-augmented generation, PubMed Central, multi-hop QA, citation-guided, benchmark dataset

180. ❌ Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

作者: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Large Reasoning Language Models（LRLMs）在Chain-of-Thought推理中的过思考问题，提出基于推理路径偏差监测的早期退出方法。核心相关关键词：‘Large Language Models’（论文研究对象）、‘Chain of Thought’（核心推理方法）、‘System 2 Thinking’（涉及深度推理过程）。‘Self-Correction’和’Inference Acceleration’有一定关联，因为方法旨在通过终止冗余推理来改善模型表现和效率。其他关键词如MoE、SFT、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型推理语言模型在链式思维推理中容易产生冗余步骤（过思考）的问题，提出了一种基于推理路径偏差监测的早期退出方法，实验表明该方法相比现有方法能更有效地提升模型性能。

摘要翻译

大型推理语言模型（LRLMs）通过利用长链思维推理在复杂任务上展现出令人印象深刻的能力。然而，它们容易陷入过度思考，产生冗余的推理步骤，从而降低性能与效率。近期，研究者提出了早期退出策略，通过动态自适应地终止冗余推理来缓解过度思考问题。但现有的早期退出方法要么依赖代理模型而引入额外的训练开销，要么因频繁在推理与生成探测答案之间切换内容而限制推理吞吐量。此外，大多数早期退出方法会因过度截断而损害LRLMs的性能。我们的洞察源于一项观察：过度思考常导致LRLMs偏离正确的推理路径，而这一现象往往伴随着高熵转移令牌的出现。基于此，我们提出一种与原生推理过程深度耦合的早期退出方法，该方法利用路径偏离指数作为专用监控指标，通过检测高熵转移令牌的频繁出现来动态识别并终止过度思考轨迹。我们在多个基准测试中使用不同类型和规模的LRLMs进行实验，结果表明，与现有早期退出方法相比，我们的方法相比原始链式思维（CoT）实现了最大的性能提升。

摘要 (Abstract)

Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.

关键词: Large Reasoning Language Models, Chain-of-Thought, overthinking, early-exit, reasoning path deviation, high-entropy transition tokens, performance improvement, inference efficiency

181. ❌ Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

作者: Mohamed Aghzal, Gregory J. Stein, Ziyu Yao 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based web agents的失败原因，采用分层规划框架进行分析。高度相关关键词：LLMs（论文核心研究对象）、LLM Agents（研究主题）、Tool Use（web agents涉及工具使用）。中等相关：Chain of Thought和System 2 Thinking（涉及推理过程分析）、Self-Correction（涉及replanning）、Explainable AI（提供诊断框架）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、优化技术、特定应用领域等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文通过分层规划框架分析LLM-based web agents失败的原因，发现低层执行是主要瓶颈，改进感知基础和自适应控制对实现人类级可靠性至关重要。

摘要翻译

大型语言模型（LLM）网络代理在网络导航中的应用日益广泛，但在现实、长周期任务中仍远未达到人类的可靠性水平。现有评估主要关注端到端的成功率，对故障产生环节的洞察有限。我们提出一种分层规划框架，从三个层面（即高层规划、低层执行与重规划）分析网络代理，实现对推理、落地与恢复能力的基于过程的评估。实验表明，结构化规划域定义语言（PDDL）规划相比自然语言（NL）规划能产生更简洁且目标导向的策略，但低层执行仍是主要瓶颈。这些结果表明，提升感知落地能力与自适应控制——而不仅是高层推理——对于实现人类水平的可靠性至关重要。这一分层视角为诊断和推进LLM网络代理的发展提供了原则性基础。

摘要 (Abstract)

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

关键词: LLM web agents, hierarchical planning, web navigation, PDDL plans, perceptual grounding, adaptive control, process-based evaluation, replanning

182. ❌ QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

作者: Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs进行硬件验证的SVA生成，属于AI在科学/工程领域的应用。高度相关的关键词：‘Large Language Models’（论文训练专用LLMs）、‘Post-training/SFT’（训练SVA生成模型）、‘AI for Science’（硬件验证应用）。中等相关的关键词：‘Scaling Laws AND Data Quality’（涉及数据稀缺和质量问题）、‘Pre-training/Domain Adaptation’（可能涉及领域适应）。其他关键词如MoE、SLMs、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种数据合成框架，通过RTL引导的LLMs生成高质量SVA数据，并训练了CodeV-SVA系列模型，在硬件验证的NL2SVA任务上达到了与GPT-5和DeepSeek-R1相当或更好的性能。

摘要翻译

SystemVerilog断言（SVA）对硬件验证至关重要。近期研究利用通用大语言模型将自然语言属性转换为SVA（NL2SVA），但由于数据有限，其性能表现不佳。我们提出一种数据合成框架以应对两大挑战：高质量真实场景SVA语料的稀缺性，以及缺乏判定自然语言与SVA语义等价性的可靠方法。针对前者，我们利用大规模开源寄存器传输级设计引导大语言模型生成真实场景的SVA；针对后者，采用双向翻译作为数据筛选机制。基于合成数据，我们训练了CodeV-SVA系列SVA生成模型。值得注意的是，CodeV-SVA-14B在NL2SVA-Human和NL2SVA-Machine数据集上的Func.@1指标分别达到75.8%和84.0%，其性能匹配或超越了GPT-5、DeepSeek-R1等先进大语言模型。

摘要 (Abstract)

SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.

关键词: SystemVerilog Assertions, hardware verification, LLMs, data synthesis, RTL, NL2SVA, CodeV-SVA, bidirectional translation

183. ❌ Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

作者: Tianyi Zhang, David Traum 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究检索增强对话系统的评估方法，核心涉及检索增强生成（RAG）技术，因此该关键词高度相关（10分）。论文使用LLM作为评估工具之一，因此与LLM关键词有一定关联（8分）。其他关键词如MoE、量化、推理加速、对齐等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文重新审视了检索增强个性化对话系统的评估方法，发现当前基于表面相似度的指标无法捕捉对话的深层质量，并提出了基于认知和语言学原理的更可靠评估框架。

摘要翻译

在认知科学与语言学理论中，对话并非被视为独立话语的链条，而是一种由连贯性、一致性与共享理解所维系的联合活动。然而，许多开放域与个性化对话系统将表层相似性指标（如BLEU、ROUGE、F1）作为其主要报告度量之一，这些指标无法捕捉对话质量的深层维度。我们以个性化对话中一个著名的检索增强框架——LAPDOG作为评估方法的案例研究进行重新审视。通过使用人工评估员与基于大语言模型（LLM）的评估员，我们揭示了当前评估实践中的若干局限，包括被污染的对话历史、检索到的故事与人物设定之间的矛盾，以及不连贯的回复生成。我们的结果表明，人工评估与LLM评估结果高度一致，但与词汇相似性指标存在显著差异，这凸显了建立基于认知科学的评估方法的必要性。总体而言，本研究为检索增强型对话系统规划了一条评估路径，旨在构建更可靠的评估框架，以更好地反映自然人类交流的基本原则。

摘要 (Abstract)

In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

关键词: retrieval-augmented dialogue, personalized dialogue, evaluation methodology, LLM-based judges, cognitive science, linguistic theory, dialogue coherence, human communication

184. ❌ Towards Generalizable Robotic Manipulation in Dynamic Environments

作者: Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人操作领域，提出DOMINO数据集和PUMA架构来解决动态环境中的视觉-语言-动作模型问题。所有关键词均与语言模型、训练方法、推理技术或特定AI应用相关，而本文研究的是机器人操作的具体领域，仅与’World Models AND General World Models’有微弱关联（提及’world queries’和状态预测），其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言-动作模型在动态环境中操作性能不足的问题，提出了DOMINO数据集和PUMA架构，通过集成历史光流和世界查询实现短时预测，在动态任务上取得了6.3%的成功率提升。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型在静态操作任务中表现出色，但在涉及移动目标的动态环境中表现欠佳。这一性能差距主要源于动态操作数据集的稀缺，以及主流VLA模型对单帧观测的依赖，限制了其时空推理能力。为解决此问题，我们提出了DOMINO——一个面向可泛化动态操作的大规模数据集与基准测试平台，包含35项具有层次化复杂度的任务、超过11万条专家演示轨迹，以及一套多维度的评估体系。通过全面实验，我们系统评估了现有VLA模型在动态任务上的表现，探索了提升动态感知能力的有效训练策略，并验证了动态数据的泛化价值。此外，我们提出了PUMA，一种具备动态感知能力的VLA架构。该架构通过融合以场景为中心的历史光流信息，并利用专用世界查询隐式预测以物体为中心的未来状态，将历史感知能力与短时程预测相结合。实验结果表明，PUMA取得了最先进的性能，其成功率较基线方法绝对提升6.3%。同时，我们发现基于动态数据的训练能够形成鲁棒的时空表征，并可迁移至静态任务。所有代码与数据均已公开：https://github.com/H-EmbodVis/DOMINO。

摘要 (Abstract)

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

关键词: Robotic Manipulation, Dynamic Environments, Vision-Language-Action Models, Spatiotemporal Reasoning, Optical Flow, World Queries, Dataset Benchmark, Generalizability

185. ❌ Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

作者: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Vision-Language-Action (VLA)模型，属于大模型（LLMs/Foundation Models）在机器人领域的应用创新，并明确提出了Vision-Language Mixture-of-Transformers (VL-MoT)框架，直接涉及Mixture of Experts (MoE)技术。其他关键词如SLMs、Scaling Laws、各种训练/对齐/推理优化技术、AI for Science等，论文未涉及或仅作为背景提及，无直接贡献或讨论。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言-动作（VLA）模型在机器人操作中视觉信息利用不足的问题，提出了DeepVision-VLA框架，通过VL-MoT共享注意力机制和动作引导的视觉剪枝，显著提升了模拟和真实世界任务性能。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型近年来已成为机器人操作领域一种颇具前景的范式，其可靠的动作预测关键在于准确解读并整合以语言指令为条件的视觉观察。尽管近期研究致力于增强VLA模型的视觉能力，但多数方法将大语言模型（LLM）主干视为黑箱，未能深入揭示视觉信息如何被融入动作生成过程。为此，我们对多种不同动作生成范式下的VLA模型进行了系统性分析，发现其在生成动作时，对视觉令牌的敏感性随网络层数加深而逐渐降低。基于此观察，我们提出了基于视觉-语言混合变换器（Vision-Language Mixture-of-Transformers, VL-MoT）框架的DeepVision-VLA。该框架实现了视觉基础模型与VLA主干之间的注意力共享，将来自视觉专家的多层次视觉特征注入VLA主干的更深层，从而增强视觉表征以支持精确且复杂的操作任务。此外，我们引入了**动作引导的视觉剪枝（Action-Guided Visual Pruning, AGVP）**机制，利用浅层注意力来剪除无关视觉令牌，同时保留与任务相关的部分，以最小的计算开销强化对操作至关重要的视觉线索。DeepVision-VLA在模拟任务和真实世界任务上分别以9.0%和7.5%的优势超越了先前的最优方法，为设计视觉增强型VLA模型提供了新的思路。

摘要 (Abstract)

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0% and 7.5% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

关键词: Vision-Language-Action (VLA) models, robotic manipulation, Vision-Language Mixture-of-Transformers (VL-MoT), Mixture of Experts, visual representation enhancement, Action-Guided Visual Pruning (AGVP), attention mechanisms, foundation models

186. ❌ GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

作者: Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究视觉文本渲染中的字形准确性问题，提出了一种基于Direct Preference Optimization（DPO）的方法GlyphPrinter。论文与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为其核心创新是改进DPO方法用于视觉文本渲染任务。论文未涉及大模型、深度学习技术原理创新或其他关键词领域，因此其他关键词均为0分。虽然论文属于AI应用，但并非生物医药等科学领域应用，且未涉及大模型技术，因此不符合研究背景中的’大模型和深度学习在科学领域的应用’或’大模型和深度学习技术原理的创新’要求。

!!! tip deepseek-chat TL;DR

该论文针对视觉文本渲染中字形准确性的挑战，提出了一种基于区域分组直接偏好优化（R-GDPO）的方法GlyphPrinter，显著提高了字形准确性并保持了风格化与精度的平衡。

摘要翻译

生成精确的字形以实现视觉文本渲染至关重要，却也极具挑战性。现有方法通常通过在海量高质量场景文本图像上进行训练来增强文本渲染效果，但字形变体的有限覆盖和过度风格化往往会损害字形准确性，尤其是对于复杂或领域外字符。一些方法利用强化学习来缓解此问题，但其奖励模型通常依赖于对细粒度字形错误不敏感的文本识别系统，因此包含错误字形的图像仍可能获得高奖励。受直接偏好优化（Direct Preference Optimization, DPO）的启发，我们提出了GlyphPrinter，一种基于偏好的文本渲染方法，它消除了对显式奖励模型的依赖。然而，标准DPO目标仅建模两个样本间的整体偏好，这对于字形错误通常出现在局部区域的视觉文本渲染而言是不够的。为解决此问题，我们构建了带有区域级字形偏好标注的GlyphCorrector数据集，并提出了区域分组DPO（Region-Grouped DPO, R-GDPO），这是一种基于区域的目标函数，通过优化标注区域内的样本间及样本内偏好，显著提升了字形准确性。此外，我们引入了区域奖励引导，这是一种从具有可控字形准确性的最优分布中进行采样的推理策略。大量实验表明，所提出的GlyphPrinter在字形准确性上优于现有方法，同时在风格化与精确度之间保持了良好的平衡。

摘要 (Abstract)

Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

关键词: GlyphPrinter, Direct Preference Optimization, DPO, visual text rendering, glyph accuracy, Region-Grouped DPO, R-GDPO, Regional Reward Guidance

187. ❌ Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

作者: Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15614v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频扩散模型的控制框架，专注于计算机视觉和视频生成领域，未涉及大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均与大语言模型、模型训练优化、推理加速、AI代理、科学AI等相关，与论文的视觉视频生成主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Tri-Prompting框架，解决了视频扩散模型在场景、主体和运动统一控制方面的瓶颈，实现了多视角主体一致性和3D感知的视频生成，显著优于现有基线方法。

摘要翻译

近期视频扩散模型在视觉质量方面取得了显著进展，但精确、细粒度的控制仍是限制内容创作实际可定制性的关键瓶颈。对于AI视频创作者而言，三种控制形式至关重要：（一）场景构图，（二）多视角一致的主体定制，以及（三）相机位姿或物体运动调整。现有方法通常孤立处理这些维度，对任意姿态变化下的多视角主体合成与身份保持的支持有限。这种统一架构的缺失使得难以实现多功能、可联合控制的视频生成。我们提出了Tri-Prompting，一个统一的框架及两阶段训练范式，整合了场景构图、多视角主体一致性与运动控制。该方法采用由三维跟踪点驱动的双条件运动模块处理背景场景，并利用下采样RGB线索处理前景主体。为确保可控性与视觉真实感之间的平衡，我们进一步提出了推理阶段ControlNet尺度调度策略。Tri-Prompting支持创新工作流，包括将三维感知主体插入任意场景，以及对图像中现有主体进行操控。实验结果表明，Tri-Prompting在多视角主体身份保持、三维一致性与运动准确性方面显著优于Phantom、DaS等专业基线模型。

摘要 (Abstract)

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

关键词: video diffusion models, scene composition, multi-view subject consistency, motion control, 3D tracking, ControlNet, subject insertion, 3D consistency

188. ❌ HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

作者: Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15612v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HSImul3R专注于3D重建、物理模拟和机器人应用，核心是物理约束优化、强化学习和仿真反馈，未涉及大语言模型、深度学习技术原理或AI for Science的具体应用（如生物信息学）。所有关键词均与大模型、深度学习技术或特定科学AI应用相关，与本文的计算机视觉、物理模拟和机器人领域无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了HSImul3R框架，通过物理约束的双向优化解决了人-场景交互重建中的感知-模拟差距问题，实现了首个稳定、可模拟的3D重建，并可直接部署到人形机器人上。

摘要翻译

我们提出HSImul3R，这是一个用于从日常捕捉数据（包括稀疏视角图像和单目视频）中实现人-场景交互（Human-Scene Interaction, HSI）仿真就绪三维重建的统一框架。现有方法存在感知与仿真的鸿沟：视觉上看似合理的重建结果常违反物理约束，导致在物理引擎中不稳定，并在具身智能应用中失效。为弥合这一差距，我们引入了一种基于物理的双向优化流程，将物理仿真器作为主动监督器，共同优化人体动力学与场景几何。在正向过程中，我们采用场景导向强化学习，在运动保真度与接触稳定性的双重监督下优化人体运动。在反向过程中，我们提出直接仿真奖励优化方法，利用仿真器对重力稳定性和交互成功率的反馈来优化场景几何。我们还进一步提出了HSIBench，一个包含多样化物体与交互场景的新基准测试集。大量实验表明，HSImul3R首次实现了稳定、仿真就绪的HSI重建，并可直接部署于真实世界的人形机器人。

摘要 (Abstract)

We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.

关键词: 3D reconstruction, human-scene interaction, physics simulation, reinforcement learning, simulation-ready, embodied AI, physical constraints, humanoid robots

189. ❌ Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

作者: Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu, Chuhang Zou, Yue Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的3D人体网格恢复任务，通过架构优化和推理路径重构来加速SAM 3D Body模型，实现实时应用。论文内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，未涉及任何大模型或深度学习技术原理的创新，也未应用于科学领域（如生物信息学）。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了SAM 3D Body模型在单目3D人体网格恢复中推理延迟高的问题，通过训练无关的加速框架实现了10.9倍的端到端加速，同时保持了重建精度，并应用于实时人形机器人视觉遥操作系统。

摘要翻译

SAM 3D Body (3DB) 在单目三维人体网格恢复中实现了最先进的精度，但其每张图像数秒的推理延迟阻碍了实时应用。我们提出了 Fast SAM 3D Body，一个无需重新训练的加速框架，它重构了 3DB 的推理流程以实现交互式速率。通过解耦串行的空间依赖性并应用架构感知剪枝，我们实现了并行化的多裁剪特征提取和精简的 Transformer 解码。此外，为了提取与现有人形机器人控制和策略学习框架兼容的关节级运动学参数（SMPL），我们用直接前馈映射取代了迭代的网格拟合，将此特定转换过程加速超过 10,000 倍。总体而言，我们的框架实现了高达 10.9 倍的端到端加速，同时保持了相当的复原保真度，甚至在 LSPET 等基准测试中超越了 3DB。我们通过将 Fast SAM 3D Body 部署在一个纯视觉遥操作系统中展示了其实用性——与依赖可穿戴惯性测量单元（IMU）的方法不同，该系统能够实现实时人形机器人控制，并直接从单一 RGB 视频流中采集操作策略。

摘要 (Abstract)

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

关键词: 3D human mesh recovery, inference acceleration, real-time application, training-free framework, vision-only teleoperation, SMPL kinematics, transformer decoding, architecture-aware pruning

190. ❌ AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

作者: Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频到音频合成（V2A）任务，提出了一种基于参考音频引导的AC-Foley模型，属于多媒体生成领域。论文内容专注于音频合成、声学特征迁移和视频-音频对齐，未涉及大语言模型（LLM）、深度学习技术原理创新或科学领域应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究主题与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有视频到音频生成方法中文本提示的语义模糊性问题，提出了AC-Foley模型，通过直接利用参考音频实现细粒度声音控制，在Foley生成任务上取得了最先进的性能。

摘要翻译

现有视频到音频（V2A）生成方法主要依赖文本提示与视觉信息来合成音频。然而，两个关键瓶颈持续存在：训练数据中的语义粒度差距（例如在粗糙标签下混淆声学特性不同的声音），以及描述微观声学特征时的文本模糊性。这些瓶颈使得利用文本控制模式进行细粒度声音合成变得困难。为应对这些局限，我们提出了AC-Foley——一种以音频为条件的V2A模型，其直接利用参考音频实现对生成声音的精确细粒度控制。该方法支持细粒度声音合成、音色迁移、零样本声音生成及音频质量提升。通过直接以音频信号为条件，我们的方法规避了文本描述的语义模糊性，同时实现了对声学属性的精准操控。实验表明，在参考音频条件下，AC-Foley在拟音生成任务中取得了最先进的性能；即使在没有音频条件的情况下，仍能与当前最先进的视频到音频方法保持竞争力。

摘要 (Abstract)

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.

关键词: video-to-audio synthesis, reference audio guidance, acoustic transfer, fine-grained sound control, Foley generation, audio-conditioned model, timbre transfer, zero-shot sound generation

191. ❌ Grounding World Simulation Models in a Real-World Metropolis

作者: Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究基于真实城市（首尔）的世界模拟模型（Seoul World Model），属于计算机视觉和生成模型领域，而非大语言模型（LLM）或深度学习技术原理的直接研究。论文的核心创新在于将世界模型（World Models）与真实地理数据结合，并采用检索增强生成（Retrieval-Augmented Generation, RAG）技术（通过检索附近街景图像来锚定视频生成）。因此，仅与关键词’World Models AND General World Models’和’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），其他关键词均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了首尔世界模型（SWM），通过检索增强生成和跨时间配对等技术，解决了基于真实城市街景数据生成时空一致、长时程视频的挑战，并在多个城市评估中优于现有方法。

摘要翻译

如果世界模拟模型能够渲染的不是虚构环境，而是真实存在的城市呢？以往的生成式世界模型通过想象所有内容来合成视觉上合理但人工的环境。我们提出了首尔世界模型（Seoul World Model, SWM），这是一个以真实城市首尔为基础的城市级世界模型。SWM通过检索增强条件机制，利用邻近街景图像锚定自回归视频生成。然而，这种设计带来了若干挑战：包括检索参考图像与动态目标场景之间的时间错位、车载相机稀疏间隔采集导致的轨迹多样性有限和数据稀疏性。我们通过跨时间配对、支持多样化相机轨迹的大规模合成数据集，以及从稀疏街景图像合成连贯训练视频的视角插值流程来解决这些挑战。我们还引入了虚拟前瞻锚点（Virtual Lookahead Sink），通过持续将每个视频片段重新锚定到未来位置的检索图像，以稳定长时序生成。我们在首尔、釜山和安娜堡三个城市中，将SWM与近期视频世界模型进行比较评估。SWM在生成空间真实、时间连贯、基于实际城市环境的长时序视频（轨迹可达数百米）方面优于现有方法，同时支持多样化的相机运动和文本提示的场景变化。

摘要 (Abstract)

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

关键词: world simulation model, Seoul World Model, retrieval-augmented conditioning, autoregressive video generation, street-view images, temporal consistency, long-horizon generation, virtual lookahead sink

192. ❌ Benchmarking Machine Learning Approaches for Polarization Mapping in Ferroelectrics Using 4D-STEM

作者: Matej Martinc, Goran Dražič, Anton Kokalj, Katarina Žiberna, Janina Roknić, Matic Poberžnik, Sašo Džeroski, Andreja Benčan Golob 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用传统机器学习模型（ResNet、VGG、CNN、PCA-kNN）进行4D-STEM衍射图案分析，以自动化检测铁电材料中的极化方向，属于材料科学领域的AI应用。所有关键词均与大模型、深度学习技术原理、对齐、推理、代理等前沿大模型技术无关，因此除’AI for Science’因涉及科学领域的AI应用可给5分外，其余关键词均为0分。论文未涉及大模型或深度学习技术原理的创新，也未使用任何大模型相关技术。

!!! tip deepseek-chat TL;DR

该研究通过系统比较多种机器学习模型来自动检测铁电材料钾钠铌酸盐中4D-STEM衍射图案的极化方向，发现合成数据训练的模型在理想数据上表现良好，但仿真与实验之间存在领域差距，而定制原型表示训练和PCA方法结合数据增强能更好地弥合这一差距，同时模型预测模式的不规则性与晶体结构缺陷相关，表明监督模型可用于结构缺陷检测。

摘要翻译

四维扫描透射电子显微镜（4D-STEM）为材料结构提供了丰富的原子尺度信息。然而，从中提取特定物理性质——例如理解铁电体功能特性所必需的自发极化方向——仍然是一个重大挑战。本研究系统地对多种机器学习模型（包括ResNet、VGG、定制卷积神经网络以及基于主成分分析的k近邻算法）进行了基准测试，旨在实现从铁电体铌酸钾钠的4D-STEM衍射图谱中自动识别极化方向。尽管基于合成数据训练的模型在等效厚度的理想化合成衍射图谱上取得了高准确率，但模拟与实验之间的领域差距仍是实际应用的关键障碍。在此背景下，一种定制的原型表征训练方案与基于主成分分析的方法，结合数据增强和过滤技术，能够更好地弥合这一差距。误差分析揭示了周期性的误分类模式，表明并非所有衍射图谱都携带足够信息以完成成功分类。此外，我们的定性分析表明，模型预测模式中的不规则性与晶体结构缺陷相关，这提示监督学习模型可用于检测结构缺陷。这些发现为开发鲁棒且可迁移的电子显微镜分析机器学习工具提供了指导。

摘要 (Abstract)

Four-dimensional scanning transmission electron microscopy (4D-STEM) provides rich, atomic-scale insights into materials structures. However, extracting specific physical properties - such as polarization directions essential for understanding functional properties of ferroelectrics - remains a significant challenge. In this study, we systematically benchmark multiple machine learning models, namely ResNet, VGG, a custom convolutional neural network, and PCA-informed k-Nearest Neighbors, to automate the detection of polarization directions from 4D-STEM diffraction patterns in ferroelectric potassium sodium niobate. While models trained on synthetic data achieve high accuracy on idealized synthetic diffraction patterns of equivalent thickness, the domain gap between simulation and experiment remains a critical barrier to real-world deployment. In this context, a custom made prototype representation training regime and PCA-based methods, combined with data augmentation and filtering, can better bridge this gap. Error analysis reveals periodic missclassification patterns, indicating that not all diffraction patterns carry enough information for a successful classification. Additionally, our qualitative analysis demonstrates that irregularities in the model’s prediction patterns correlate with defects in the crystal structure, suggesting that supervised models could be used for detecting structural defects. These findings guide the development of robust, transferable machine learning tools for electron microscopy analysis.

关键词: 4D-STEM, ferroelectrics, polarization mapping, machine learning benchmarking, diffraction patterns, domain gap, structural defects, materials science

193. ❌ Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments

作者: Aaditya Khanal, Junxiu Zhou 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15574v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究骨架动作识别中的严重领域偏移问题，涉及Transformer模型、不确定性检测、OOD检测、微调等，但所有关键词均与大模型/深度学习技术原理或科学应用无关。论文专注于计算机视觉中的动作识别，未涉及大语言模型、MoE、量化、推理加速、对齐、RAG等大模型相关技术，也未涉及生物信息学等科学AI应用。

!!! tip deepseek-chat TL;DR

论文研究了从受控3D骨架数据到无约束2D姿态估计的严重领域偏移问题，发现标准不确定性方法无法检测性能下降，并提出轻量级门控机制来恢复校准并减少错误预测。

摘要翻译

实践部署鸿沟——从受控的多视角三维骨架捕捉转向无约束的单目二维姿态估计——引入了复合域偏移，其安全影响仍亟待深入探究。本研究通过新颖的Gym2D数据集（风格/视角偏移）和UCF101数据集（语义偏移），系统分析了这一严重的域偏移现象。我们的骨架变换器（Skeleton Transformer）在NTU-120数据集上实现了63.2%的跨主体准确率，但在零样本迁移至Gym域时骤降至1.6%，在UCF101数据集上仅为1.16%。关键发现表明，高分布外（Out-Of-Distribution, OOD）检测AUROC指标并不能保证安全的选择性分类。标准不确定性方法未能检测到这种性能下降：模型在两个OOD数据集上即使仅覆盖50%样本时，仍以99.6%的风险率保持“自信地错误”状态。虽然基于能量的评分（AUROC ≥ 0.91）和马氏距离（Mahalanobis distance）能提供可靠的分布检测信号，但在实际决策时，这种高AUROC分数却与糟糕的风险-覆盖行为并存。我们提出的轻量级微调门控机制恢复了校准能力，实现了优雅的弃权决策，显著降低了自信错误预测的发生率。本研究挑战了标准部署假设，为骨架识别在语义和几何层面的部署提供了系统的安全性分析框架。

摘要 (Abstract)

The practical deployment gap – transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation – introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.

关键词: skeleton-based action recognition, domain shift, uncertainty failure, out-of-distribution detection, transformer model, fine-tuning, safety analysis, gym environments

194. ❌ Panoramic Affordance Prediction

作者: Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是具身AI中的全景可操作性预测，核心贡献是提出了一个名为PAP-12K的大规模全景图像数据集和一个名为PAP的训练无关、由粗到细的视觉处理流程。论文内容完全聚焦于计算机视觉、全景图像处理和具身AI的感知任务，未涉及任何大语言模型（LLM）、深度学习技术原理创新、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或AI for Science等关键词所描述的技术。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文首次探索了全景可操作性预测，通过引入PAP-12K大规模全景图像数据集和PAP训练无关处理流程，解决了现有透视图像方法在全景视觉中性能严重下降的问题，显著提升了全景感知在具身智能中的鲁棒性。

摘要翻译

可供性预测在具身人工智能中是连接感知与行动的关键桥梁。然而，现有研究局限于针孔相机模型，其存在视野狭窄、观测碎片化的问题，常常遗漏关键的整体环境上下文。本文首次探索全景可供性预测，利用360度图像捕捉全局空间关系与整体场景理解。为推进这一新任务，我们首先提出了PAP-12K——一个大规模基准数据集，包含超过1,000张超高分辨率（12k，11904 x 5952）全景图像，并标注了超过12,000组精心构建的问答对及可供性掩码。此外，我们受人类中央凹视觉系统启发，提出一种无需训练、由粗到精的处理流程PAP，以应对全景图像固有的超高分辨率与严重畸变问题。PAP通过网格提示进行递归视觉路由以逐步定位目标，采用自适应注视机制校正局部几何畸变，并利用级联接地管道提取精确的实例级掩码。在PAP-12K上的实验结果表明，专为标准透视图像设计的现有可供性预测方法因全景视觉的独特挑战而出现严重性能下降甚至失效。相比之下，PAP框架有效克服了这些障碍，显著优于现有先进基线方法，凸显了全景感知对于构建鲁棒具身智能的巨大潜力。

摘要 (Abstract)

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

关键词: Panoramic Affordance Prediction, Embodied AI, 360-degree imagery, PAP-12K dataset, Training-free pipeline, Ultra-high-resolution, Holistic scene understanding, Visual routing

195. ❌ Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

作者: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）的幻觉诊断，与’幻觉缓解/事实性’和’机制可解释性/可解释AI’高度相关（10分），因为其核心是检测和归因幻觉以实现透明推理。与’大语言模型’有一定关联（5分），因为VLMs是LLMs的扩展。与’思维链/推理’和’系统2思维/深度推理’有一定关联（5分），因为论文将生成建模为认知轨迹并分析推理失败。与’自我纠正/自我反思’有一定关联（5分），因为诊断框架支持自我改进。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于计算理性原则的多阶段诊断框架，通过将视觉语言模型的幻觉重新定义为动态认知病理，在信息论探针和几何异常检测中实现了最先进的幻觉检测性能，并能将错误归因于感知不稳定、逻辑因果失败和决策模糊等病理状态。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）常出现“幻觉”现象——即生成看似合理但事实错误的陈述——这对其可信部署构成了关键障碍。本研究提出了一种诊断幻觉的新范式，将其从静态的输出错误重新定义为模型计算认知的动态病理状态。我们的框架基于计算理性原则，使我们能够将VLM的生成过程建模为动态认知轨迹。我们设计了一套信息论探针，将该轨迹投射到一个可解释的低维认知状态空间中。我们的核心发现是一个称为几何-信息对偶的支配原则：认知轨迹在此空间内的几何异常性，本质上等同于其信息论上的高惊奇度。幻觉检测由此可归结为几何异常检测问题。通过在多样化场景中的评估——从严格的二元问答（POPE）和综合推理（MME）到无约束开放式描述生成（MS-COCO）——我们的框架均实现了最先进的性能。关键的是，该方法在弱监督下高效运行，即使在标定数据受到严重污染时仍保持高度鲁棒性。这一方法能够对故障进行因果归因，将可观测的错误映射到不同的病理状态：感知不稳定性（通过感知熵度量）、逻辑-因果失效（通过推理冲突度量）以及决策模糊性（通过决策熵度量）。最终，这为构建具有透明、可审计且可诊断推理过程的人工智能系统开辟了道路。

摘要 (Abstract)

Vision-Language Models (VLMs) frequently “hallucinate” - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model’s computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM’s generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory’s geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

关键词: Vision-Language Models, Hallucination Diagnosis, Cognitive Trajectory, Geometric-Information Duality, Anomaly Detection, Computational Rationality, Interpretable AI, Pathological States

196. ❌ Learning Latent Proxies for Controllable Single-Image Relighting

作者: Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu, Yun Li, Xiaogang Xu, Harry Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究单图像重光照（single-image relighting）问题，属于计算机视觉和图形学领域，而非大语言模型或深度学习技术原理的直接研究。论文使用扩散模型（diffusion model）进行图像生成，并提到使用DPO（Direct Preference Optimization）目标来增强物理一致性，因此与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联（评5分）。其他关键词均未在论文标题或摘要中提及，且论文主题与这些大模型相关技术无直接联系，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对单图像重光照问题，提出了一种名为LightCtrl的方法，通过引入物理先验和DPO目标来增强扩散模型的控制能力，实现了更准确和可控的重光照效果。

摘要翻译

单图像重光照是一个高度欠约束问题：微小的光照变化可能导致阴影、高光等产生非线性剧烈变化，而几何结构与材质属性却无法直接观测。现有基于扩散模型的方法要么依赖需要密集脆弱监督的本征分解或G缓冲管线，要么完全在隐空间运行而缺乏物理依据，导致对光照方向、强度和色彩的细粒度控制不可靠。我们发现，精确重光照并不需要完整冗余的本征分解。相反，稀疏但具有物理意义的提示线索——指示光照应如何变化及材质应如何响应——足以引导扩散模型。基于此洞见，我们提出LightCtrl框架，通过两个层级整合物理先验：一个少样本隐空间代理编码器从有限PBR（基于物理的渲染）监督中提取紧凑的材质-几何线索，以及一个光照感知掩码模块，用于识别光照敏感区域并引导去噪器关注与着色相关的像素。为弥补PBR数据稀缺问题，我们采用基于DPO的目标函数优化代理分支，以增强预测线索的物理一致性。同时，我们构建了ScaLight数据集——一个包含系统化光照变化及完整相机-光照元数据的大规模物体级数据集，支持物理一致且可控的训练。在物体与场景级基准测试中，本方法实现了光度精确的重光照效果与精准的连续控制能力，超越了现有基于扩散和本征分解的基线方法，在受控光照变化下最高提升+2.4 dB PSNR并降低35%的RMSE误差。

摘要 (Abstract)

Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

关键词: single-image relighting, diffusion model, physical priors, DPO, LightCtrl, controllable, PBR supervision, photometrically faithful

197. ❌ Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

作者: Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉领域的自监督表示学习方法（Bootleg），专注于图像分类和语义分割任务。所有评分关键词都针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而本文完全不涉及语言模型、文本处理或LLM技术栈的任何方面。论文属于传统的计算机视觉自监督学习，与评分关键词的大模型主题完全无关。

!!! tip deepseek-chat TL;DR

论文提出了一种名为Bootleg的自监督学习方法，通过预测教师网络多个隐藏层的潜在表示来学习多层次的抽象特征，在ImageNet分类和ADE20K语义分割等任务上显著优于现有基线方法。

摘要翻译

当前自监督学习（SSL）领域主要由两类方法主导：一类是重建原始低级数据的生成式方法（如MAE），另一类是预测高级抽象嵌入的预测式方法（如I-JEPA）。生成式方法虽能提供坚实的底层基础，但对于图像等高冗余度模态而言计算效率低下，且其训练目标并不优先学习高级概念特征。相反，预测式方法因其依赖于最终层自蒸馏的非平稳目标，常面临训练不稳定的问题。我们提出Bootleg方法，通过让模型预测教师网络多个隐藏层的潜在表征来弥合这一鸿沟。这种分层目标迫使模型同时捕获不同抽象层次的特征。实验表明，在ImageNet-1K和iNaturalist-21的分类任务，以及ADE20K和Cityscapes的语义分割任务中，Bootleg显著优于同类基线方法（在ImageNet-1K上较I-JEPA提升超过10%）。

摘要 (Abstract)

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

关键词: self-supervised learning, representation learning, hidden layers, self-distillation, hierarchical objective, image classification, semantic segmentation, Bootleg

198. ❌ Kimodo: Scaling Controllable Human Motion Generation

作者: Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, Sanja Fidler 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究人类运动生成模型Kimodo，属于计算机视觉/图形学领域，而非大语言模型或深度学习技术原理创新。仅与两个关键词有微弱关联：1) ‘Scaling Laws AND Data Quality’（5分）- 论文探讨了数据集规模和模型规模对性能的影响，涉及数据质量与扩展规律；2) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）- 可视为AI在科学（运动分析/仿真）领域的应用，但非核心生物信息学或化学信息学。其余关键词均与大语言模型、推理、对齐、高效微调等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对小规模公开动作捕捉数据集限制运动生成质量的问题，提出了Kimodo模型，通过大规模数据集（700小时）和两阶段去噪架构，实现了高质量、可控的人类运动生成。

摘要翻译

高质量人体运动数据对于机器人学、仿真及娱乐领域的应用正变得日益重要。近期的生成模型通过文本提示或姿态运动学约束等直观输入实现人体运动合成，为数据获取提供了潜在来源。然而，公共动作捕捉数据集规模较小，限制了此类模型的运动质量、控制精度与泛化能力。本研究提出Kimodo——一种基于700小时光学动作捕捉数据训练的表达性强且可控的运动学扩散模型。该模型能够生成高质量运动，同时可通过文本及全面的运动学约束体系进行便捷控制，包括全身关键帧、稀疏关节位置/旋转、二维路径点及密集二维轨迹。这一能力得益于精心设计的运动表征与两阶段去噪器架构：该架构通过分解根节点与身体部位预测来最小化运动伪影，同时支持灵活的约束条件调节。基于大规模动作捕捉数据集的实验验证了关键设计决策，并分析了数据集规模与模型尺寸的扩展如何影响性能表现。

摘要 (Abstract)

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

关键词: human motion generation, kinematic motion diffusion model, motion capture data, controllable motion synthesis, two-stage denoiser architecture, scaling dataset size, text and kinematic constraints, motion quality improvement

199. ❌ Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models

作者: Amy Rafferty, Rishi Ramaesh, Ajitha Rajan 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像（胸部X光）的AI诊断模型，通过合成数据生成方法（CARS框架）解决临床特征覆盖不足的问题，属于AI在生物医学领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其属于AI在生物医学（Bioinformatics相关）领域的应用，但并非核心创新于大模型或深度学习技术原理本身，而是应用现有方法解决特定领域问题，因此给予8分（有一定关联，但非高度相关）。

!!! tip deepseek-chat TL;DR

该论文针对胸部X光AI诊断模型在临床特征覆盖不足的问题，提出了一个临床感知的合成图像生成框架CARS，通过生成解剖结构保真的合成数据来提升模型性能、校准和可信度，并在MIMIC-CXR基准测试中验证了其有效性。

摘要翻译

人工智能诊断模型的临床部署不仅需要基准准确性，更要求其在疾病全谱表现中具备鲁棒性。然而，当前公开的胸部X光数据集系统性地缺失关键临床特征组合，导致模型恰恰在临床风险最高的场景下训练不足。我们提出CARS框架——一种具备临床意识且基于解剖学原理的合成图像生成方法，通过原则性的合成图像生成来弥补这一缺陷。CARS对临床特征向量实施定向扰动，在明确保持解剖结构的前提下，实现对病理征象的可控插入与删除。我们在七种骨干架构上评估CARS，通过在合成子集上微调模型并在预留的MIMIC-CXR基准上进行测试。与先前的特征扰动方法相比，使用CARS生成图像进行微调能持续提升精确率-召回率性能，降低预测不确定性，并改善模型校准度。结构与语义分析显示其具备高解剖保真度、强特征对齐性和低语义不确定性。两位放射学专家的独立评估进一步证实了图像的逼真度与临床一致性。随着领域向受监管的临床人工智能迈进，CARS证明：通过解剖学保真的合成数据生成以改善特征空间覆盖，是提升胸部X光分类系统性能与可信度的可行有效策略——且无需牺牲临床完整性。

摘要 (Abstract)

The clinical deployment of AI diagnostic models demands more than benchmark accuracy - it demands robustness across the full spectrum of disease presentations. However, publicly available chest radiographic datasets systematically underrepresent critical clinical feature combinations, leaving models under-trained precisely where clinical stakes are highest. We present CARS, a clinically aware and anatomically grounded framework that addresses this gap through principled synthetic image generation. CARS applies targeted perturbations to clinical feature vectors, enabling controlled insertion and deletion of pathological findings while explicitly preserving anatomical structure. We evaluate CARS across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior feature perturbation approaches, fine-tuning on CARS-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong feature alignment, and low semantic uncertainty. Independent evaluation by two expert radiologists further confirms realism and clinical agreement. As the field moves toward regulated clinical AI, CARS demonstrates that anatomically faithful synthetic data generation for better feature space coverage is a viable and effective strategy for improving both the performance and trustworthiness of chest X-ray classification systems - without compromising clinical integrity.

关键词: synthetic image generation, chest X-ray, clinical feature coverage, anatomical fidelity, model calibration, AI diagnostic models, CARS framework, MIMIC-CXR

200. ❌ FreeTalk: Emotional Topology-Free 3D Talking Heads

作者: Federico Nocentini, Thomas Besnier, Claudio Ferrari, Stefano Berretti, Mohamed Daoudi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FreeTalk: Emotional Topology-Free 3D Talking Heads》专注于语音驱动的3D面部动画技术，提出了一种两阶段框架（Audio-To-Sparse和Sparse-To-Mesh），用于在任意拓扑的未注册网格上生成情感条件化的说话头部动画。该研究属于计算机视觉、图形学和多媒体处理领域，涉及音频处理、3D建模和情感合成，但未涉及大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用直接相关，而本文研究内容与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了语音驱动的3D面部动画在未注册网格上泛化能力差和情感动态建模困难的问题，提出了FreeTalk框架，通过预测稀疏地标位移并转移到目标网格，实现了对任意拓扑面部网格的鲁棒情感说话头部动画。

摘要翻译

语音驱动的三维面部动画技术发展迅速，但多数方法仍依赖于已注册的模板网格，导致其难以有效应用于具有任意拓扑结构的原始三维扫描数据。同时，在唇部动作之外建立可控的情感动态模型仍具挑战，且通常受限于基于模板的参数化方法。为应对这些挑战，我们提出FreeTalk——一个适用于未注册面部网格（具有任意顶点数量与连接关系）的双阶段情感条件三维说话头动画框架。第一阶段，音频到稀疏映射模块通过语音音频预测时序连贯的三维标志点位移序列，该过程受情感类别与强度条件控制。这种稀疏表征既能捕捉发音动作也能捕捉情感运动，同时保持与网格拓扑无关的特性。第二阶段，稀疏到网格映射模块通过将内在曲面特征与标志点-顶点条件相结合，将预测的标志点运动迁移至目标网格，在测试时无需模板拟合或对应关系监督即可生成稠密的逐顶点形变。大量实验表明，FreeTalk在域内训练时与专业基线模型性能相当，同时对未见过的身份与网格拓扑结构展现出显著提升的鲁棒性。代码与预训练模型将公开提供。

摘要 (Abstract)

Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.

关键词: 3D talking heads, speech-driven facial animation, emotional dynamics, topology-free, landmark displacement, mesh deformation, audio-to-sparse, sparse-to-mesh

201. ❌ Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference

作者: Nitin Priyadarshini Shankar, Soham Lahiri, Sheetal Kalyani, Saurav Prakash 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习中的二进制神经网络（FedBNN），旨在通过权重二值化（1位表示）实现模型压缩和推理加速，以解决边缘设备部署的资源限制问题。与评分关键词的相关性分析如下：1）与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为核心贡献是二值化（1位量化）以减少内存占用和计算需求；2）与’Small Language Models OR SLMs OR On-device AI’有一定关联（5分），因为研究涉及边缘设备部署的轻量级模型，但未明确针对语言模型；3）与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为二值化降低了推理时的FLOPs和内存需求，间接加速推理；4）其他关键词（如LLMs、MoE、对齐、RAG等）完全无关（0分），因为论文未涉及大语言模型、专家混合、对齐技术、检索增强生成等内容，也未应用于科学领域（如生物信息学）。

!!! tip deepseek-chat TL;DR

该论文提出FedBNN框架，通过直接在联邦学习中训练二进制神经网络（权重编码为1位），解决了边缘设备部署中深度神经网络资源消耗过高的问题，在保持与实值模型相当性能的同时显著降低了模型内存占用和推理计算需求。

摘要翻译

联邦学习（Federated Learning, FL）通过将训练过程分布到各设备上来保护隐私。然而，在推理阶段，使用深度神经网络（DNNs）对低功耗边缘设备而言计算负担沉重。边缘部署要求模型能同时优化内存占用和计算效率，而传统DNNs因超出资源限制而无法应对这一困境。传统的训练后二值化方法虽能减小模型尺寸，但由于量化误差会导致严重的精度损失。为解决这些挑战，我们提出FedBNN，一种旋转感知的二进制神经网络框架，可在本地训练过程中直接学习二进制表示。通过将每个权重编码为单比特 ${+1, -1}$（而非32位浮点数），FedBNN缩小了模型占用空间，与使用实数模型的联邦学习方法相比，显著降低了推理时的浮点运算量（FLOPs）和内存需求。在多个基准数据集上的评估表明，FedBNN在资源消耗显著降低的同时，性能与使用实值模型的现有联邦学习方法相当。

摘要 (Abstract)

Federated Learning (FL) preserves privacy by distributing training across devices. However, using DNNs is computationally intensive at the low-powered edge during inference. Edge deployment demands models that simultaneously optimize memory footprint and computational efficiency, a dilemma where conventional DNNs fail by exceeding resource limits. Traditional post-training binarization reduces model size but suffers from severe accuracy loss due to quantization errors. To address these challenges, we propose FedBNN, a rotation-aware binary neural network framework that learns binary representations directly during local training. By encoding each weight as a single bit ${+1, -1}$ instead of a $32$-bit float, FedBNN shrinks the model footprint, significantly reducing runtime (during inference) FLOPs and memory requirements in comparison to federated methods using real models. Evaluations across multiple benchmark datasets demonstrate that FedBNN significantly reduces resource consumption while performing similarly to existing federated methods using real-valued models.

关键词: Federated Learning, Binary Neural Networks, Model Compression, Edge Deployment, Inference Efficiency, Low-bit Weights, Resource Optimization, FedBNN

202. ❌ Real-Time Oriented Object Detection Transformer in Remote Sensing Images

作者: Zeyu Ding, Yong Zhou, Jiaqi Zhao, Wen-Liang Du, Xixi Li, Rui Yao, Abdulmotaleb El Saddik 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的定向目标检测，特别是针对遥感图像中的实时检测问题。论文的核心贡献包括：角度分布细化、Chamfer距离匹配成本、定向对比去噪训练等。所有关键词（除了最后一个）都明确涉及大语言模型（LLM）或深度学习技术原理的创新，而本文研究的是基于Transformer的视觉检测器，与LLM无关。最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’得5分，因为遥感图像分析可视为AI在科学（地球观测）领域的应用，但并非核心焦点。

!!! tip deepseek-chat TL;DR

本文提出了一种实时定向目标检测Transformer，通过角度分布细化、Chamfer距离匹配和定向对比去噪，解决了遥感图像中物体任意角度旋转带来的检测挑战，在DOTA1.0数据集上达到77.73%-80.15% AP50和119-132 FPS的性能。

摘要翻译

近期实时检测Transformer因其简洁高效而备受关注。然而，这些检测器未显式建模物体旋转，尤其在遥感影像中物体以任意角度出现时，导致角度表示、匹配代价和训练稳定性方面存在挑战。本文提出一种实时旋转目标检测Transformer，据我们所知，这是首个实时端到端的旋转目标检测器，旨在解决上述问题。具体而言，我们提出角度分布细化方法，将角度回归重新定义为概率分布的迭代优化，从而捕捉物体旋转的不确定性并提供更细粒度的角度表示。随后，我们将倒角距离代价引入二分图匹配，通过顶点集度量框体距离，实现更精确的几何对齐并消除模糊匹配。此外，我们提出旋转对比去噪策略以稳定训练过程，并分析了四种噪声模式。我们观察到同一真实标注框在不同解码层可能被分配给不同索引查询，并利用提出的不稳定性度量指标对该问题进行分析。我们设计了一系列模型变体与实验以验证所提方法。值得注意的是，我们的O2-DFINE-L、O2-RTDETR-R50和O2-DEIM-R50在DOTA1.0数据集上分别达到77.73%/78.45%/80.15% AP50，在2080ti GPU上实现132/119/119 FPS。代码发布于https://github.com/wokaikaixinxin/ai4rs。

摘要 (Abstract)

Recent real-time detection transformers have gained popularity due to their simplicity and efficiency. However, these detectors do not explicitly model object rotation, especially in remote sensing imagery where objects appear at arbitrary angles, leading to challenges in angle representation, matching cost, and training stability. In this paper, we propose a real-time oriented object detection transformer, the first real-time end-to-end oriented object detector to the best of our knowledge, that addresses the above issues. Specifically, angle distribution refinement is proposed to reformulate angle regression as an iterative refinement of probability distributions, thereby capturing the uncertainty of object rotation and providing a more fine-grained angle representation. Then, we incorporate a Chamfer distance cost into bipartite matching, measuring box distance via vertex sets, enabling more accurate geometric alignment and eliminating ambiguous matches. Moreover, we propose oriented contrastive denoising to stabilize training and analyze four noise modes. We observe that a ground truth can be assigned to different index queries across different decoder layers, and analyze this issue using the proposed instability metric. We design a series of model variants and experiments to validate the proposed method. Notably, our O2-DFINE-L, O2-RTDETR-R50 and O2-DEIM-R50 achieve 77.73%/78.45%/80.15% AP50 on DOTA1.0 and 132/119/119 FPS on the 2080ti GPU. Code is available at https://github.com/wokaikaixinxin/ai4rs.

关键词: oriented object detection, transformer, remote sensing images, real-time detection, angle distribution refinement, Chamfer distance, contrastive denoising, DOTA dataset

203. ❌ ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

作者: Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散变换器（Video Diffusion Transformers）的调优框架，核心贡献是提出了一种无需视频训练数据、仅使用2D图像进行适配的方法，以实现视频生成和编辑。所有评分关键词均与大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等）或特定科学领域AI应用直接相关，而本文研究的是扩散模型在视频生成与控制中的技术问题，属于计算机视觉和生成模型领域，与LLMs及所列关键词无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为ViFeEdit的视频扩散变换器调优框架，解决了视频控制与编辑中缺乏配对视频数据和高计算成本的问题，通过仅使用2D图像进行训练，实现了视觉保真且时间一致的视频生成与编辑。

摘要翻译

扩散变换器（Diffusion Transformers, DiTs）在图像和视频生成中展现出卓越的可扩展性与生成质量，这促使学界日益关注将其扩展至可控生成与编辑任务。然而，与图像领域相比，视频控制与编辑的进展仍相对有限，这主要源于配对视频数据的稀缺以及训练视频扩散模型所需的高昂计算成本。为解决这一问题，本文提出了一种无需视频数据的调优框架——ViFeEdit，专为视频扩散变换器设计。该方法无需任何形式的视频训练数据，仅通过二维图像适配即可实现多样化的视频生成与编辑。我们方法的核心在于架构重参数化，该技术将现代视频扩散变换器中的全三维注意力机制解耦为空间独立部分，从而在仅引入极少额外参数的情况下，实现视觉保真度高的编辑效果，同时保持时间一致性。此外，该设计采用双路径流程，配备独立的噪声调度时间步嵌入，展现出对不同条件信号的强大适应能力。大量实验表明，我们的方法仅需对二维图像数据进行极少量训练，即可在可控视频生成与编辑任务中取得令人满意的结果。代码已公开于 https://github.com/Lexie-YU/ViFeEdit。

摘要 (Abstract)

Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.

关键词: Video Diffusion Transformers, Video Generation, Video Editing, Video-free Tuning, Architectural Reparameterization, Temporal Consistency, 2D Image Training, Controllable Video Generation

204. ❌ Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation

作者: Yuanfan Zheng, Kunyu Peng, Xu Zheng, Kailun Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的全景语义分割，特别是跨域自适应问题。论文的核心技术是领域自适应（Domain Adaptation），因此与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。然而，论文的研究内容与所有其他关键词（主要涉及大语言模型、模型训练技术、推理方法、AI代理、模型压缩等）完全无关，因为这些关键词都是针对自然语言处理或通用AI技术，而本论文是纯粹的计算机视觉研究，没有涉及任何语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为EDA-PSeg的框架，用于解决跨域全景语义分割中的几何视场畸变和开放集语义不一致问题，通过Euler-Margin Attention和Graph Matching Adapter实现了在多种场景下的最先进性能。

摘要翻译

跨域全景语义分割因其能为现实应用提供全面的360°场景理解而日益受到关注。然而，由于严重的几何视场畸变以及跨域间开放集语义的不一致性，该任务仍面临显著挑战。本研究构建了一个开放集域自适应设定，并提出外推式域自适应全景分割框架，该框架在局部透视视图上进行训练，并在完整的360°全景图像上进行测试，显式地处理跨域的几何视场偏移以及来自未见类别所产生的语义不确定性。为此，我们提出欧拉边际注意力机制，该机制通过引入角度边际以增强视角不变的语义表征，同时执行幅度与相位调制以提升对未见类别的泛化能力。此外，我们设计了图匹配适配器，该模块构建高阶图关系以对齐视场偏移下的共享语义，同时通过结构自适应有效分离新类别。在相机偏移、天气条件及开放集场景下的四个基准数据集上的大量实验表明，EDA-PSeg实现了最先进的性能，对多样化观测几何具有鲁棒的泛化能力，并在不同环境条件下保持稳定性。代码公开于https://github.com/zyfone/EDA-PSeg。

摘要 (Abstract)

Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA-PSeg.

关键词: panoramic semantic segmentation, domain adaptation, open-set, field of view distortion, Euler-Margin Attention, Graph Matching Adapter, cross-domain, 360° scene understanding

205. ❌ Anchor then Polish for Low-light Enhancement

作者: Tianle Du, Mingjia Li, Hainuo Wang, Xiaojie Guo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15472v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的低光图像增强任务，提出了一种新颖的anchor-then-polish框架。论文内容涉及图像处理、深度学习架构设计、小波变换等计算机视觉技术，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新、AI for Science应用或任何评分关键词中列出的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本论文是纯粹的计算机视觉应用研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对低光图像增强中存在的全局失真问题，提出了一种anchor-then-polish框架，通过全局能量对齐和局部细节精炼的分离方法，在多个基准测试中实现了最先进的性能。

摘要翻译

低光图像增强因存在光照不足、色彩偏移与纹理干扰等交织的退化问题而极具挑战。现有方法常依赖复杂架构联合处理这些退化，但可能过度拟合简单的物理约束，导致全局失真。本研究提出一种新颖的“锚定-优化”（ATP）框架，从本质上将全局能量对齐与局部细节细化进行解耦。首先，通过仅学习具有12个自由度的场景自适应投影矩阵，定制宏观锚定模块以（显著）稳定亮度分布并校正色彩，证明简单的线性算子即可有效对齐全局能量。随后，宏观锚定将任务简化为微观优化，该模块在矩阵引导下于小波域与色度空间进一步细化细节。我们设计了约束性亮度更新策略，在确保全局一致性的同时引导网络专注于细粒度优化。在多个基准数据集上的大量实验表明，本方法取得了最先进的性能，能够生成视觉自然、量化指标优越的低光增强结果。

摘要 (Abstract)

Low-light image enhancement is challenging due to entangled degradations, mainly including poor illumination, color shifts, and texture interference. Existing methods often rely on complex architectures to address these issues jointly but may overfit simple physical constraints, leading to global distortions. This work proposes a novel anchor-then-polish (ATP) framework to fundamentally decouple global energy alignment from local detail refinement. First, macro anchoring is customized to (greatly) stabilize luminance distribution and correct color by learning a scene-adaptive projection matrix with merely 12 degrees of freedom, revealing that a simple linear operator can effectively align global energy. The macro anchoring then reduces the task to micro polishing, which further refines details in the wavelet domain and chrominance space under matrix guidance. A constrained luminance update strategy is designed to ensure global consistency while directing the network to concentrate on fine-grained polishing. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance, producing visually natural and quantitatively superior low-light enhancements.

关键词: low-light image enhancement, anchor-then-polish framework, global energy alignment, local detail refinement, wavelet domain, luminance distribution, color correction, matrix guidance

206. ❌ Automated Counting of Stacked Objects in Industrial Inspection

作者: Corentin Dumery, Noa Etté, Aoxiang Fan, Ren Li, Jingyi Xu, Hieu Le, Pascal Fua 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的工业检测任务，特别是针对堆叠物体的3D计数问题，使用了多视角图像、几何重建和深度学习深度分析等技术。所有评分关键词均与大语言模型、模型训练优化、推理加速、对齐技术、代理系统等大模型相关主题相关，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新颖的3D计数方法，通过结合几何重建和深度学习深度分析，解决了工业检测中堆叠物体因严重遮挡而难以准确计数的问题，并在合成和真实数据上验证了其鲁棒性能。

摘要翻译

视觉物体计数是工业检测中的一项基础计算机视觉任务，其精准、高吞吐量的库存追踪与质量保障至关重要。此外，制造部件往往因过轻而难以通过重量可靠推断其数量，或因过重而无法安全、实际地移动堆叠物进行称重，这使得自动化视觉计数在许多场景中成为更稳健的解决方案。然而，现有方法在处理容器、托盘或料箱中堆叠的3D物体时面临挑战，因为大多数物体被严重遮挡，仅少数可直接可见。为应对这一重要但尚未被充分探索的难题，我们提出了一种新颖的3D计数方法，将任务分解为两个互补的子问题：从多视角图像中估计堆叠物的3D几何结构及其占用率。通过将几何重建与基于深度学习的深度分析相结合，我们的方法能够精准计数容器内相同的制造部件，即使它们被不规则堆叠且部分隐藏。我们在经过人工验证总数的大规模合成数据及多样化真实数据上验证了所提出的3D计数流程，证明了其在现实检测条件下的稳健性能。

摘要 (Abstract)

Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.

关键词: visual object counting, industrial inspection, 3D counting, stacked objects, multi-view images, geometric reconstruction, deep learning, occlusion handling

作者: Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu, Xuechen Liu, Fuwen Luo, Peng Li, Yang Liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在4D环境中的时间感知和跨模态主动感知能力评估，与’Large Language Models’高度相关（10分），因为论文明确研究MLLMs；与’LLM Agents’高度相关（10分），因为研究模型作为智能体在环境中的推理和行为；与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为涉及多步时空推理；与’Explainable AI’有一定关联（5分），因为分析模态交互如何影响决策；其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过构建4D Escape Room任务环境，评估多模态大语言模型的时间感知和跨模态主动感知能力，发现现有模型在时间约束下整合多模态信息存在显著困难，并揭示了模态偏见问题。

摘要翻译

多模态大语言模型（MLLMs）近期在整合视觉、语言与听觉的统一全能模型（Omni models）方向上取得了快速进展。然而，现有环境主要集中于二维或三维视觉上下文及视觉-语言任务，对时间依赖的听觉信号以及选择性跨模态整合的支持有限——在不同模态可能提供互补或干扰信息的情况下，这种整合能力对于现实的多模态推理至关重要。因此，模型能否在时变且不可逆的条件下主动协调多种模态并进行推理，目前仍未得到充分探索。为此，我们提出了 EscapeCraft-4D，一个可定制的四维环境，用于评估全能模型中的选择性跨模态感知与时间意识。该环境整合了基于触发的听觉源、时间瞬态证据以及位置依赖线索，要求智能体在时间约束下执行时空推理与主动的多模态整合。基于此环境，我们构建了一个基准测试，以评估各类强大模型在相应能力上的表现。评估结果表明，模型在处理模态偏差方面存在困难，并揭示了当前模型在时间约束下整合多种模态的能力存在显著差距。进一步的深入分析揭示了在复杂的多模态推理环境中，多种模态如何相互作用并共同影响模型的决策。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model’s ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.

关键词: Multimodal Large Language Models, 4D environment, time awareness, cross-modal perception, spatio-temporal reasoning, EscapeCraft-4D, modality bias, multimodal integration

208. ❌ MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts

作者: Zheng Zhang, Qinchuan Zhang, Yuteng Ye, Zhi Chen, Penglei Ji, Mengfei Li, Wenxiao Zhang, Yuan Liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D资产纹理生成，提出了一种结合多视图生成和UV修复的方法（MV2UV）。虽然论文使用了生成模型（可能基于扩散模型），但所有关键词均直接针对大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、代理等），或特定科学领域AI应用（如生物信息学）。论文内容与这些LLM特定技术或科学AI应用无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MV2UV的新方法，通过结合多视图生成和UV修复来解决3D资产纹理生成中的多视图不一致和遮挡部分缺失纹理的问题，实验表明该方法在生成质量上优于现有方法。

摘要翻译

为三维资产生成高质量纹理是一项具有挑战性的任务。现有的多视角纹理生成方法存在多视角不一致性及不可见部分纹理缺失的问题，而UV修复（UV inpainting）纹理方法则因UV数据不足导致泛化能力不佳，且难以充分利用二维图像扩散先验知识。本文提出一种名为MV2UV的新方法，该方法结合了多视角生成中的二维生成先验与UV细化（UV refinement）的修复能力，以获取高质量的纹理贴图。我们的核心思路是采用一个UV空间生成模型，该模型能够在修复多视角图像中不可见部分的同时，解决多视角图像的不一致性问题。实验表明，与现有方法相比，我们的方法能够实现更优的纹理生成质量，尤其在不可见的遮挡区域和多视角不一致部分表现突出。

摘要 (Abstract)

Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.

关键词: texture generation, 3D assets, multiview inconsistency, UV inpainting, generative priors, diffusion models, occluded parts, texture maps

209. ❌ Real-Time Human Frontal View Synthesis from a Single Image

作者: Fangyu Lin, Yingdong Hu, Lunjie Zhu, Zhening Liu, Yushi Huang, Zehong Lin, Jun Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15433v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Real-Time Human Frontal View Synthesis from a Single Image》专注于计算机视觉和图形学领域，研究从单张图像实时合成人体正面视图的技术。论文提出的PrismMirror框架涉及几何特征学习、渲染监督和轻量级线性注意力模型，但所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关。具体来说：1）论文未涉及任何语言模型（LLMs/SLMs）、预训练/后训练、对齐、RLHF、RAG、推理技术（CoT、MCTS）、代理系统、工具使用、量化、幻觉缓解、可解释性、世界模型、模型合并或上下文学习。2）虽然提到了“linear attention”，但这是作为轻量级模型的一部分，与关键词“Linear Attention”在Transformer优化中的含义不同。3）论文属于计算机视觉应用，不属于“AI for Science”中的生物信息学或化学信息学子领域。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了从单张图像实时合成逼真人体正面视图的挑战，提出了PrismMirror框架，通过级联几何特征学习和轻量级线性注意力模型，首次实现了24 FPS的实时推理，在视觉真实性和结构准确性上显著优于先前方法。

摘要翻译

基于单张图像实现照片级真实感人体新视角合成，对于普及沉浸式三维远程呈现至关重要，它能消除对复杂多相机系统的依赖。然而，当前以渲染为中心的方法优先考虑视觉保真度，而忽视了对显式几何结构的理解，且在面部、手部等复杂区域处理困难，导致时序上的不稳定。与此同时，以人体为中心的框架通常依赖辅助模型来为几何建模提供信息丰富的结构先验，这带来了内存瓶颈，限制了实时性能。为应对这些挑战，我们提出了PrismMirror，一个用于从单张图像即时合成正面视图的几何引导框架。通过避免外部几何建模并专注于正面视图合成，我们的模型优化了远程呈现的视觉完整性。具体而言，PrismMirror引入了一种新颖的级联学习策略，实现了从粗到细的几何特征学习。它首先直接学习粗糙的几何特征，如SMPL-X网格和点云，然后通过渲染监督细化纹理。为实现实时效率，我们将这一统一框架蒸馏为一个轻量级的线性注意力模型。值得注意的是，PrismMirror是首个实现24 FPS实时推理的单目人体正面视图合成模型，在视觉真实性与结构准确性上均显著优于先前方法。

摘要 (Abstract)

Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.

关键词: Human Frontal View Synthesis, Single Image, Real-Time, PrismMirror, Geometry-Guided, Cascade Learning, Linear Attention, SMPL-X

210. ❌ Gym-V: A Unified Vision Environment System for Agentic Vision Research

作者: Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要介绍Gym-V，一个用于视觉智能体研究的统一平台，包含179个程序生成的视觉环境。虽然论文涉及智能体（agents）和强化学习（RL），但核心是基础设施和评估工具包，而非大模型技术本身。与大多数关键词（如LLM、MoE、Scaling Laws、各种训练方法、推理技术等）完全无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有弱关联（5分），因为论文提到’agentic systems’和’agentic VLMs’，但未深入讨论LLM智能体的具体技术。其他关键词如AI for Science等也未涉及。

!!! tip deepseek-chat TL;DR

该论文针对视觉智能体研究缺乏标准化基础设施的问题，提出了Gym-V统一平台，通过179个程序生成的视觉环境进行实验，发现观察支架对训练成功比RL算法选择更重要，且多样化的任务训练能带来更广泛的泛化能力。

摘要翻译

随着智能体系统日益依赖可验证奖励的强化学习，标准化的“训练场”基础设施已成为快速迭代、可复现性和公平比较的关键要素。视觉智能体领域目前缺乏此类基础设施，这限制了对驱动其学习的关键因素及现有模型不足之处的系统性研究。我们推出 Gym-V——一个包含10个领域、共179个程序化生成视觉环境的统一平台，其难度可调控，使得以往在分散工具包中难以实现的受控实验成为可能。通过该平台，我们发现观测框架对训练成功的影响比强化学习算法的选择更为关键，其中环境描述字幕与游戏规则直接决定学习能否成功。跨领域迁移实验进一步表明，在多样化任务类别上进行训练能产生广泛的泛化能力，而狭窄的训练则可能导致负迁移，且多轮交互会放大所有这些效应。Gym-V已作为便捷的训练环境与评估工具包基础开源发布，旨在加速未来面向智能体视觉语言模型的研究。

摘要 (Abstract)

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym’’ infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

关键词: Gym-V, vision agents, reinforcement learning, visual environments, observation scaffolding, cross-domain transfer, agentic VLMs, evaluation toolkit

211. ❌ AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

作者: Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma, Yujian Zheng, Jing Wang, Zheng Chong, Xujie Zhang, Xianhang Cheng, Xiaodan Liang, Hao Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation》专注于计算机视觉和视频生成领域，特别是多角色动画生成。它提出了基于Diffusion Transformer (DiT)的框架，通过Instance-Isolated Latent Representation (IILR)、Tri-Stage Decoupled Attention (TSDA)和Adaptive Gated Fusion (AGF)等技术解决多角色动画中的身份纠缠和姿态绑定问题。然而，所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文主题是视频生成和动画控制，属于计算机图形学/视觉子领域，未涉及大模型技术、训练方法、推理优化、AI代理或科学应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了多角色动画生成中的身份纠缠和姿态绑定不一致问题，提出了一个基于Diffusion Transformer的框架AnyCrowd，通过实例隔离表示和注意力机制实现任意数量角色的可控动画生成。

摘要翻译

近年来，可控角色动画发展迅速，但多角色动画的研究仍显不足。随着角色数量的增加，多角色参考编码更容易受到潜在身份纠缠的影响，导致身份混淆和可控性下降。此外，学习参考身份与驱动姿态序列之间精确且时空一致的对应关系也变得越来越困难，常常引发身份-姿态错误绑定及生成视频的不一致问题。为应对这些挑战，我们提出了AnyCrowd，这是一个基于扩散变换器（Diffusion Transformer, DiT）的视频生成框架，能够扩展到任意数量的角色。具体而言，我们首先引入了实例隔离潜在表示（Instance-Isolated Latent Representation, IILR），该表示在DiT处理前独立编码每个角色实例，以防止潜在身份纠缠。基于这种解耦表示，我们进一步提出了三阶段解耦注意力（Tri-Stage Decoupled Attention, TSDA），通过将自注意力分解为：（i）实例感知的前景注意力，（ii）以背景为中心的交互，以及（iii）全局前景-背景协调，从而将身份绑定到驱动姿态上。此外，为减轻重叠区域中的令牌歧义，我们在TSDA中集成了自适应门控融合（Adaptive Gated Fusion, AGF）模块，用于预测身份感知权重，从而有效地将竞争的令牌组融合为身份一致的表征。

摘要 (Abstract)

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations…

关键词: multi-character animation, Diffusion Transformer (DiT), identity entanglement, pose binding, video generation, instance-isolated representation, attention mechanism, controllable animation

212. ❌ Pointing-Based Object Recognition

作者: Lukáš Hajdúch, Viktor Kocur 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于RGB图像的人类指向手势目标识别系统，整合了物体检测、姿态估计、深度估计和视觉语言模型等技术。虽然使用了vision-language models（属于多模态模型），但论文未涉及任何评分关键词中的大语言模型（LLM）技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用等具体内容。所有关键词均与大语言模型的核心技术、训练范式、优化方法或特定应用领域相关，而本文主要关注计算机视觉和机器人交互的集成应用，与这些LLM特定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于RGB图像识别人类指向手势目标物体的综合系统，通过集成物体检测、姿态估计、深度估计和视觉语言模型，实验表明深度信息显著提升了复杂场景中的目标识别准确率。

摘要翻译

本文提出了一种基于RGB图像识别人类指向手势目标物体的完整流程。随着人机交互向更直观的界面发展，识别非语言交流目标的能力变得至关重要。我们提出的系统整合了多种现有先进方法，包括目标检测、人体姿态估计、单目深度估计以及视觉-语言模型。我们评估了从单张图像重建的三维空间信息的影响，以及图像描述模型在修正分类错误方面的效用。在自定义数据集上的实验结果表明，融入深度信息能显著提升目标识别准确率，尤其在物体相互遮挡的复杂场景中。该方法的模块化特性使其能够部署在缺乏专用深度传感器的环境中。

摘要 (Abstract)

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

关键词: pointing gesture recognition, object detection, human-robot interaction, monocular depth estimation, vision-language models, 3D spatial information, target identification, RGB images

213. ❌ Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation

作者: Xiaoxian Zhang, Minghai Shi, Lei Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于视觉基础模型在医学影像（结肠镜）深度估计中的参数高效适应，核心创新是自适应频谱校正模块。高度相关的关键词包括：Foundation Models（论文直接使用）、Domain Adaptation（核心研究问题）、Parameter-efficient Fine-tuning（SpecDepth框架的核心特性）、AI for Science（医学影像应用）。其他关键词主要涉及语言模型、推理、对齐、压缩等，与论文的计算机视觉和医学影像应用完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效的自适应频谱校正框架（SpecDepth），通过增强结肠镜图像的高频特征来解决基础模型在医学影像深度估计中的泛化问题，在两个公开数据集上取得了最先进的性能。

摘要翻译

精确的单目深度估计在结肠镜检查中对病灶定位与导航至关重要。基于自然图像训练的基础模型无法直接泛化至结肠镜领域。我们发现核心问题并非语义鸿沟，而是频域的统计偏移：结肠镜图像缺乏这些模型进行几何推理所依赖的强高频边缘与纹理梯度。为解决此问题，我们提出SpecDepth——一种参数高效的适应框架，该框架在保持预训练模型稳健几何表征的同时，使其适应结肠镜领域。其关键创新在于自适应频谱校正模块，该模块通过可学习的小波分解对特征图中衰减的高频分量进行显式建模与增强。与可能扭曲高层语义特征的传统微调方法不同，这种针对性的低层调整使输入信号与基础模型的原始归纳偏置重新对齐。在公开的C3VD和SimCol3D数据集上，SpecDepth分别取得了0.022和0.027的绝对相对误差，实现了最先进的性能。我们的研究表明，直接解决频谱失配问题是将视觉基础模型适应专业医学成像任务的高效策略。代码将在稿件被接受发表后公开。

摘要 (Abstract)

Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.

关键词: Foundation Models, Domain Adaptation, Parameter-efficient Fine-tuning, Colonoscopy, Depth Estimation, Spectral Rectification, Medical Imaging, AI for Science

作者: Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, Yihong Gong 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言导航（VLN）任务，提出NavGRPO强化学习框架，专注于导航策略的鲁棒性改进。论文内容涉及强化学习、轨迹多样性、策略优化等，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大语言模型技术、模型训练优化、推理加速、AI科学应用等相关，而本文是计算机视觉与机器人导航领域的强化学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言导航任务中模仿学习方法泛化性差、对执行扰动鲁棒性不足的问题，提出了基于群体相对策略优化的强化学习框架NavGRPO，通过在R2R和REVERIE基准测试中实现显著性能提升，证明了目标导向的强化学习训练能构建更鲁棒的导航策略。

摘要翻译

视觉与语言导航任务要求智能体依据自然语言指令在逼真环境中进行导航。现有方法主要依赖模仿学习，其泛化能力有限且对执行扰动的鲁棒性较差。本文提出NavGRPO，一种通过组内相对策略优化学习目标导向导航策略的强化学习框架。该方法通过探索多样化的轨迹并利用组内性能比较进行优化，使智能体能够超越专家路径识别有效策略，且无需额外的价值网络。基于ScaleVLN构建的NavGRPO在R2R和REVERIE基准测试中实现了卓越的鲁棒性，在未见环境中分别获得+3.0%和+1.71%的SPL提升。在极端早期阶段扰动下，相较于基线方法我们取得了+14.89%的SPL增益，证实了目标导向的强化学习训练能构建显著更鲁棒的导航策略。代码与模型将公开发布。

摘要 (Abstract)

Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.

关键词: Vision-and-Language Navigation, Reinforcement Learning, Robust Navigation, Group Relative Policy Optimization, Trajectory Diversity, Execution Perturbations, Goal-directed Policies, SPL Improvement

215. ❌ IRIS: Intersection-aware Ray-based Implicit Editable Scenes

作者: Grzegorz Wilczyński, Mikołaj Zieliński, Krzysztof Byrski, Joanna Waczyńska, Dominik Belter, Przemysław Spurek 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IRIS专注于计算机视觉和图形学领域，研究神经辐射场（NeRF）和3D高斯泼溅的高效渲染与场景编辑技术，核心贡献是提出一种基于射线与场景图元精确交点的分析采样策略和连续特征聚合机制，以实现实时渲染和灵活形状编辑。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、代理系统等）或特定科学AI应用（如生物信息学），而本文完全不涉及语言模型、深度学习技术原理创新或任何评分关键词所指定的子领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文IRIS提出了一种用于高效交互式场景编辑的新框架，通过分析采样策略精确识别射线与场景图元的交点以消除空空间处理，并引入连续特征聚合机制绕过昂贵的3D搜索，实现了高保真实时渲染和灵活形状编辑。

摘要翻译

神经辐射场能够实现高保真的场景表征，但其训练与渲染成本高昂，而三维高斯泼溅技术虽能提供实时性能且实证效果优异，却存在明显局限。近期研究尝试结合两者优势，以高斯分布作为代理来引导神经场评估，但仍受计算效率低下的困扰。这类方法通常依赖随机体素采样进行特征聚合，严重限制了渲染性能。为解决此问题，本文提出名为IRIS（基于射线交点感知的可编辑隐式场景）的新型框架，旨在实现高效交互式的场景编辑。为突破传统射线步进法的局限，本方法采用解析采样策略，精确计算射线与场景基元之间的交点，从而有效避免无效空间处理。此外，针对空间邻近查询的计算瓶颈，本框架引入沿射线操作的连续特征聚合机制。通过对排序交点的隐式属性进行插值计算，该方法规避了昂贵的三维搜索，在确保几何一致性的同时，实现了高保真实时渲染与灵活的形体编辑。代码发布于https://github.com/gwilczynski95/iris。

摘要 (Abstract)

Neural Radiance Fields achieve high-fidelity scene representation but suffer from costly training and rendering, while 3D Gaussian splatting offers real-time performance with strong empirical results. Recently, solutions that harness the best of both worlds by using Gaussians as proxies to guide neural field evaluations, still suffer from significant computational inefficiencies. They typically rely on stochastic volumetric sampling to aggregate features, which severely limits rendering performance. To address this issue, a novel framework named IRIS (Intersection-aware Ray-based Implicit Editable Scenes) is introduced as a method designed for efficient and interactive scene editing. To overcome the limitations of standard ray marching, an analytical sampling strategy is employed that precisely identifies interaction points between rays and scene primitives, effectively eliminating empty space processing. Furthermore, to address the computational bottleneck of spatial neighbor lookups, a continuous feature aggregation mechanism is introduced that operates directly along the ray. By interpolating latent attributes from sorted intersections, costly 3D searches are bypassed, ensuring geometric consistency, enabling high-fidelity, real-time rendering, and flexible shape editing. Code can be found at https://github.com/gwilczynski95/iris.

关键词: Neural Radiance Fields, 3D Gaussian splatting, real-time rendering, scene editing, ray marching, analytical sampling, feature aggregation, interactive editing

216. ❌ Oscillating Dispersion for Maximal Light-throughput Spectral Imaging

作者: Jiuyun Zhang, Zhan Shi, Linsen Chen, Xun Cao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15348v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算光谱成像系统的硬件创新（ODIS）和深度学习重建算法（PDAUN），属于光学成像和计算机视觉领域。所有关键词均与大语言模型、深度学习技术原理、AI对齐、推理优化等大模型相关主题无关，因此除’AI for Science’因涉及科学应用（光谱成像）获得5分外，其余关键词均为0分。论文未涉及任何大模型技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种新型振荡色散成像光谱仪（ODIS）和PAN引导的色散感知深度展开网络（PDAUN），解决了传统计算光谱成像系统光通量低的问题，在低光照条件下实现了高保真光谱重建。

摘要翻译

现有计算光谱成像系统通常依赖编码孔径和分束器，这些组件会阻挡大部分入射光，导致在弱光条件下重建质量下降。为解决这一局限性，我们开发了振荡色散成像光谱仪（ODIS），该系统首次通过沿单一光路轴向平移色散元件，使其在共轭像平面与离焦位置间连续移动，依次捕获全色（PAN）图像和色散测量数据，从而实现了接近全光通量的传输。我们进一步提出了一种全色引导的色散感知深度展开网络（PDAUN），该网络能够在全色结构引导下从无掩模色散数据中重建高保真光谱信息。其数据保真度步骤通过利用ODIS前向模型的循环卷积特性，推导出基于FFT-伍德伯里分解的预处理求解器；同时，色散感知可变形卷积模块（DADC）利用全色特征校正亚像素级的光谱错位。实验表明，该方法在标准基准测试中达到了最先进的性能，跨系统比较证实ODIS在低照度条件下具有决定性优势。通过物理原型机验证了该系统可实现高保真光谱重建。

摘要 (Abstract)

Existing computational spectral imaging systems typically rely on coded aperture and beam splitters that block a substantial fraction of incident light, degrading reconstruction quality under light-starved conditions. To address this limitation, we develop the Oscillating Dispersion Imaging Spectrometer (ODIS), which for the first time achieves near-full light throughput by axially translating a disperser between the conjugate image plane and a defocused position, sequentially capturing a panchromatic (PAN) image and a dispersed measurement along a single optical path. We further propose a PAN-guided Dispersion-Aware Deep Unfolding Network (PDAUN) that recovers high-fidelity spectral information from maskless dispersion under PAN structural guidance. Its data-fidelity step derives an FFT-Woodbury preconditioned solver by exploiting the cyclic-convolution property of the ODIS forward model, while a Dispersion-Aware Deformable Convolution module (DADC) corrects sub-pixel spectral misalignment using PAN features. Experiments show state-of-the-art performance on standard benchmarks, and cross-system comparisons confirm that ODIS yields decisive gains under low illumination. High-fidelity reconstruction is validated on a physical prototype.

关键词: computational spectral imaging, light throughput, oscillating dispersion, deep unfolding network, PAN-guided reconstruction, dispersion-aware, low illumination, spectral reconstruction

217. ❌ MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

作者: Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, Yan Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15330v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MeMix专注于计算机视觉中的流式3D重建任务，提出了一种训练无关的模块来缓解循环模型中的状态漂移和遗忘问题。虽然该研究属于AI应用范畴，但所有给定的关键词均与大语言模型（LLM）、深度学习技术原理或特定科学领域（如生物信息学）直接相关，而本文的核心内容（3D重建、记忆管理、循环神经网络）与这些关键词没有直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为MeMix的训练无关模块，通过将循环状态重新构建为记忆混合体来缓解流式3D重建中的状态漂移和遗忘问题，在多个基准测试中平均降低了15.3%的重建完整性误差。

摘要翻译

重建是三维视觉中的基础任务，也是空间智能的核心能力。其中，流式三维重建对于实时空间感知至关重要，然而现有的循环在线模型在长序列上常因状态漂移和遗忘而出现渐进性性能退化，这促使了推理时补救方法的研究。我们提出了MeMix，一种免训练、即插即用的模块，它通过将循环状态重构为记忆混合体来改进流式重建。MeMix将状态划分为多个独立的内存块，仅更新对齐程度最低的内存块，同时严格保留其他部分。这种选择性更新缓解了灾难性遗忘，同时保持了O(1)的推理内存开销，且无需微调或额外的可学习参数，使其可直接应用于现有的循环重建模型。在标准基准数据集（ScanNet、7-Scenes、KITTI等）上，在相同骨干网络和推理设置下，MeMix在7-Scenes的300至500帧流数据上将重建完整性误差平均降低了15.3%（最高达40.0%）。代码发布于https://dongjiacheng06.github.io/MeMix/。

摘要 (Abstract)

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300–500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/

关键词: streaming 3D reconstruction, recurrent models, memory mixture, catastrophic forgetting, state drift, training-free module, real-time spatial perception, selective update

218. ❌ Generative Video Compression with One-Dimensional Latent Representation

作者: Zihan Zheng, Zhaoyang Jia, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Zhenghao Chen, Houqiang Li, Yan Lu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频压缩技术，提出了一种使用一维潜在表示的生成式视频压缩方法（GVC1D）。虽然论文涉及生成模型和深度学习技术，但其核心内容与所有评分关键词（均围绕大语言模型及其相关技术、应用和优化方法）完全无关。论文未提及任何语言模型、MoE、缩放定律、训练方法、对齐、推理加速、AI for Science等主题。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用一维潜在表示的生成式视频压缩方法（GVC1D），通过减少空间和时间冗余，在HEVC Class B数据集上实现了比现有方法更高的压缩效率（LPIPS下比特率降低60.4%，DISTS下降低68.8%）。

摘要翻译

近期生成式视频编解码器（GVC）通常将视频编码为二维潜在网格，并采用高容量生成式解码器进行重建。然而，该范式在充分利用时空冗余方面仍面临两个关键挑战：在空间上，二维潜在网格因其刚性结构不可避免地保留了帧内冗余，相邻图像块保持高度相似，从而导致需要更高码率。在时间上，二维潜在网格难以以紧凑且语义连贯的方式有效建模长期相关性，因为它阻碍了跨帧共有内容的聚合。为应对这些局限，我们提出了基于一维潜在表示的生成式视频压缩方法（GVC1D）。GVC1D将视频数据编码为极端紧凑的一维潜在标记，并基于短期与长期上下文进行条件化。由于摆脱了刚性的二维空间对应关系，这些一维潜在标记能够自适应地关注语义区域，并自然地促进标记精简，从而降低空间冗余。此外，所提出的一维记忆模块在保持低计算成本的同时提供了语义丰富的长期上下文，进一步减少了时间冗余。实验结果表明，GVC1D实现了卓越的压缩效率：在HEVC Class B数据集上，其码率在LPIPS指标下降低60.4%，在DISTS指标下降低68.8%，超越了现有视频压缩方法。项目地址：https://gvc1d.github.io/

摘要 (Abstract)

Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4% under LPIPS and 68.8% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.Project: https://gvc1d.github.io/

关键词: Generative Video Compression, One-Dimensional Latent Representation, Video Codec, Spatial-Temporal Redundancy, Latent Tokens, Compression Efficiency, Bitrate Reduction, HEVC Dataset

219. ❌ GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection

作者: Aggelos Psiris, Yannis Panagakis, Maria Vakalopoulou, Georgios Th. Papadopoulos 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15300v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于工业视觉异常检测，使用图注意力网络（GAT）和重建方法，未涉及任何大语言模型（LLM）、深度学习技术原理创新或科学领域AI应用。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文属于计算机视觉中的异常检测领域，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图注意力网络编码的少样本工业视觉异常检测方法GATE-AD，在多个基准测试中实现了最先进的检测精度和更快的推理速度。

摘要翻译

少样本工业视觉异常检测（Few-Shot Industrial Visual Anomaly Detection，简称FS-IVAD）是现代制造业中的一项关键任务，其要求自动化产品检测系统仅利用少量正常/无缺陷训练样本即可识别罕见缺陷。在此背景下，本研究提出了一种新颖的基于重构的方法，命名为GATE-AD。具体而言，该框架采用一种掩码化、表示对齐的图注意力网络（Graph Attention Network，GAT）编码方案，以学习正常样本的鲁棒外观模式。通过将密集的图像块级视觉特征标记作为图节点，模型利用堆叠的自注意力层自适应地编码复杂、不规则、非欧几里得的局部关系。该图通过一个基于可学习潜在空间的表示对齐组件进行增强，其中高重构残差区域（即缺陷）使用一种缩放余弦误差（Scaled Cosine Error，SCE）目标函数进行评估。在MVTec AD、VisA和MPDD工业缺陷检测基准数据集上进行的大量对比实验表明，与现有性能最优的文献方法相比，GATE-AD在1至8样本设置下均达到了最先进的性能，同时实现了最高的检测精度（在MPDD数据集的8样本情况下，图像AUROC指标最高提升1.8%）与最低的单图像推理延迟（至少加快25.05%）。为促进可复现性与进一步研究，GATE-AD的源代码已公开于https://github.com/gthpapadopoulos/GATE-AD。

摘要 (Abstract)

Few-Shot Industrial Visual Anomaly Detection (FS-IVAD) comprises a critical task in modern manufacturing settings, where automated product inspection systems need to identify rare defects using only a handful of normal/defect-free training samples. In this context, the current study introduces a novel reconstruction-based approach termed GATE-AD. In particular, the proposed framework relies on the employment of a masked, representation-aligned Graph Attention Network (GAT) encoding scheme to learn robust appearance patterns of normal samples. By leveraging dense, patch-level, visual feature tokens as graph nodes, the model employs stacked self-attentional layers to adaptively encode complex, irregular, non-Euclidean, local relations. The graph is enhanced with a representation alignment component grounded on a learnable, latent space, where high reconstruction residual areas (i.e., defects) are assessed using a Scaled Cosine Error (SCE) objective function. Extensive comparative evaluation on the MVTec AD, VisA, and MPDD industrial defect detection benchmarks demonstrates that GATE-AD achieves state-of-the-art performance across the $1$- to $8$-shot settings, combining the highest detection accuracy (increase up to $1.8%$ in image AUROC in the 8-shot case in MPDD) with the lowest per-image inference latency (at least $25.05%$ faster), compared to the best-performing literature methods. In order to facilitate reproducibility and further research, the source code of GATE-AD is available at https://github.com/gthpapadopoulos/GATE-AD.

关键词: Few-Shot Industrial Visual Anomaly Detection, Graph Attention Network, Reconstruction-based approach, Masked representation-aligned encoding, Scaled Cosine Error, MVTec AD, VisA, MPDD

220. ❌ Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling

作者: Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher, Paolo Favaro 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15279v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Conditional Flow Matching（CFM）方法，属于生成模型领域，专注于图像/视频生成的训练和推理加速技术。论文内容涉及数据-噪声耦合优化、最优传输理论、采样轨迹改进等，但完全不涉及大语言模型（LLM）、深度学习技术原理创新、AI for Science应用或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。所有关键词均与大语言模型及其相关技术、科学AI应用相关，而本文研究的是连续归一化流和扩散模型的替代方法，属于不同的生成模型分支，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出LOOM-CFM方法，通过跨小批量优化数据-噪声配对来改进条件流匹配，从而在多个数据集上实现了采样速度与质量权衡的一致提升。

摘要翻译

条件流匹配是一种用于训练连续归一化流的免模拟方法，为图像和视频生成等关键任务提供了扩散模型的高效替代方案。条件流匹配在此类任务中的性能取决于数据与噪声的耦合方式。近期研究采用小批量最优传输方法，通过在每一步训练中重新分配噪声-数据对来优化采样轨迹，从而加速推理过程。然而，该方法的优化仅限于单个小批量数据，限制了其在大规模数据集上的有效性。为克服这一局限，我们提出LOOM-CFM（跨小批量优化条件流匹配），这是一种通过在整个训练过程中跨小批量保持并优化噪声-数据分配，从而扩展小批量最优传输作用范围的新方法。我们的方法在多个数据集上均实现了采样速度与质量权衡的持续改进。LOOM-CFM同时提升了蒸馏初始化效果，并支持潜在空间训练中的高分辨率合成。

摘要 (Abstract)

Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.

关键词: Conditional Flow Matching, continuous normalizing flows, minibatch optimal transport, sampling acceleration, data-noise coupling, generative models, inference speed, latent space training

221. ❌ Dataset Diversity Metrics and Impact on Classification Models

作者: Théo Sourget, Niclas Claßen, Jack Junchi Xu, Rob van der Goot, Veronika Cheplygina 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究数据集多样性指标及其对分类模型的影响，主要关注图像、文本和元数据多样性评估，使用MorphoMNIST和PadChest数据集进行实验。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及医学影像（胸部X光）数据集分析，属于AI在科学/生物医学领域的应用，但并非核心创新技术研究，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文研究了数据集多样性指标的定义、评估及其对分类模型性能的影响，发现图像/元数据多样性指标与AUC相关性有限，但与FID和语义多样性指标相关性较高，且临床专家认为扫描仪是多样性的主要来源，但增加扫描仪可能导致捷径学习。

摘要翻译

训练数据集的多样性通常被视为获得稳健模型的重要方面。然而，多样性的定义往往未被明确界定或在各文献中存在差异，且尽管存在一些度量指标，在开发新算法时对这种多样性的量化却常被忽视。在本研究中，我们利用 MorphoMNIST（一个具有可控扰动的玩具数据集）和 PadChest（一个公开可用的胸部 X 射线数据集），探究了针对图像、文本和元数据的多种数据集多样性度量的行为。我们评估了这些度量是否彼此相关，同时也评估了它们是否与临床专家的直觉相符。我们还检验了这些度量是否与下游任务性能相关，以及它们如何影响模型的训练动态。我们发现，AUC 与图像或元数据的无参考多样性度量之间相关性有限，但与 FID（弗雷歇起始距离）和语义多样性度量的相关性较高。最后，临床专家指出，扫描仪在实践中是多样性的主要来源。然而，我们发现，在训练集中添加另一台扫描仪会导致捷径学习。本研究所用代码可在 https://github.com/TheoSourget/dataset_diversity_evaluation 获取。

摘要 (Abstract)

The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at https://github.com/TheoSourget/dataset_diversity_evaluation

关键词: dataset diversity, diversity metrics, classification models, chest X-ray, MorphoMNIST, PadChest, shortcut learning, downstream-task performance

222. ❌ Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

作者: Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于原生统一多模态模型的推理加速，提出了FlashU框架，包含任务特定网络剪枝、动态层跳过、扩散头缓存和动态令牌剪枝等技术。与关键词列表的相关性分析：1）仅与"Speculative Decoding OR Inference Acceleration"高度相关（8分），因为论文核心是推理加速框架；2）其他关键词主要涉及大语言模型技术、训练方法、对齐、代理系统等，而本文研究的是多模态模型（图像生成和理解）的加速，未涉及这些特定技术；3）虽然研究背景提到大模型应用可酌情给分，但本文明确针对"native unified multimodal models"而非大语言模型，且未使用关键词列表中的具体技术，因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

本文针对原生统一多模态模型在生成和理解任务中的计算瓶颈，提出了训练无关的任务感知加速框架FlashU，通过任务特定网络剪枝、动态层跳过、扩散头缓存和动态令牌剪枝等技术，在保持性能的同时实现了1.78倍到2.01倍的推理加速。

摘要翻译

原生统一多模态模型集成了生成与理解能力，但其巨大的计算开销阻碍了实际部署。现有的加速技术通常采用静态、单一的策略，忽视了迭代生成任务（如图像生成）与单次理解任务（如视觉问答）之间计算特征的根本差异。在本工作中，我们首次对统一模型进行了系统性分析，揭示了显著的参数专化现象：不同任务依赖于不同的神经元集合。这意味着在参数层面，统一模型已在单一架构内隐式内化了分别用于生成和理解的推理路径。基于这一发现，我们提出了一种免训练且任务感知的加速框架FlashU，该框架根据各任务需求进行定制化优化。针对两类任务，我们引入了任务特定网络剪枝和动态层跳过技术，旨在消除层间及任务特定的冗余。对于视觉生成任务，我们通过时变控制信号调整引导尺度，并借助扩散头缓存技术对扩散头进行时序近似。对于多模态理解任务，我们在剪枝后的模型基础上，通过V-Norm代理引入动态令牌剪枝，以利用视觉输入的空间冗余性。在Show-o2数据集上的大量实验表明，FlashU在保持最先进性能的同时，在理解与生成任务上实现了1.78倍至2.01倍的推理加速，优于现有统一模型，验证了我们任务感知加速范式的有效性。代码已公开于https://github.com/Rirayh/FlashU。

摘要 (Abstract)

Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task’s demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.

关键词: unified multimodal models, inference acceleration, task-aware optimization, network pruning, dynamic layer skipping, diffusion head cache, dynamic token pruning, computational efficiency

223. ❌ Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

作者: Kim Ouan, Noémie Moreau, Katarzyna Bozek 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究医学影像中的角膜神经纤维弯曲度分级，使用自监督预训练的DINO模型进行微调。与大多数关键词无关，因为论文不涉及大语言模型、推理技术、对齐、代理等。仅与两个关键词相关：1) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文使用ImageNet预训练特征进行领域迁移，但未涉及持续预训练或领域自适应技术；2) ‘Post-training OR Supervised Fine-tuning OR SFT’（8分）：论文核心是通过微调（fine-tuning）提升模型性能；3) ‘AI for Science OR Bioinformatics OR Cheminformatics’（8分）：属于生物信息学/医学影像的AI应用。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出使用自监督预训练的DINO模型进行微调，无需分割图即可对体内共聚焦显微镜图像中的角膜神经纤维弯曲度进行分级，在准确率和灵敏度上超越了现有方法。

摘要翻译

角膜神经纤维的弯曲度可作为多种疾病的诊断指标。当前最先进的弯曲度分级方法严重依赖这些神经纤维的昂贵分割图。本文证明，来自ImageNet的自监督预训练特征可迁移至活体共聚焦显微镜成像领域。尽管DINO已被两个后续版本超越，但我们表明其作为医学影像深度学习模型的价值不应被忽视。经过精细微调后，DINO在准确率（84.25%）和灵敏度（77.97%）方面均超越了现有最优方法。我们的微调模型无需使用分割图，即可专注于分级任务中的关键形态学要素。

摘要 (Abstract)

The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

关键词: self-supervised learning, medical imaging, confocal microscopy, tortuosity grading, fine-tuning, DINO, corneal nerve fibers, segmentation-free

224. ❌ Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels

作者: Victor Wåhlstrand, Jennifer Alvén, Ida Häggström 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像中的目标检测，提出了一种利用推理时现有标签（exemplars）的框架，基于扩散方法实现训练免费的性能提升。所有关键词均与大型语言模型、深度学习技术原理或特定AI技术（如MoE、RLHF、RAG等）直接相关，而本文研究的是计算机视觉中的扩散模型在医学图像检测的应用，与这些关键词无直接关联。唯一有微弱关联的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及医学图像分析（可视为生物信息学或AI for Science的边缘应用），但并非核心内容，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为'exemplar diffusion'的训练免费框架，利用推理时的现有标签来提升医学图像中目标检测的性能，实验表明该方法能普遍提高平均精度和召回率，并对标签质量具有鲁棒性。

摘要翻译

我们提出一种在推理阶段利用现有标注（称为“范例”）以提升医学图像目标检测性能的框架。该方法名为“范例扩散”，基于现有扩散模型的目标检测方法，实现了一种无需重新训练即可在测试阶段引入已知边界框信息的途径。我们证明，对于具有明确空间结构的医学图像数据集，该方法能全面提升平均精度与召回率，并对范例标注质量具有鲁棒性，从而允许非专业标注。此外，我们还展示了该方法如何用于量化扩散检测模型中的预测不确定性。源代码与数据划分已公开：https://github.com/waahlstrand/ExemplarDiffusion

摘要 (Abstract)

We present a framework to take advantage of existing labels at inference, called \textit{exemplars}, in order to improve the performance of object detection in medical images. The method, \textit{exemplar diffusion}, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: https://github.com/waahlstrand/ExemplarDiffusion

关键词: exemplar diffusion, medical object detection, diffusion methods, opportunistic labels, training-free approach, predictive uncertainty, medical images, average precision

225. ❌ IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning

作者: Konstantinos Almpanakis, Anna Kreshuk 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于自监督表示学习中的表示崩溃预防问题，提出IConE框架来解决小批量训练场景下的挑战。所有关键词均与大语言模型、模型训练技术、推理优化、对齐、代理系统等大模型特定技术相关，而本文研究的是通用的自监督表示学习框架，不涉及大语言模型或深度学习技术原理的创新。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在2D和3D生物医学模态上进行了实验，属于AI在科学领域的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

论文提出IConE框架，通过解耦表示崩溃预防与训练批量大小，解决了自监督学习中在小批量场景下表示崩溃的问题，并在生物医学数据上验证了其有效性。

摘要翻译

自监督学习（SSL）已彻底改变了表征学习领域，其中联合嵌入架构（Joint-Embedding Architectures, JEAs）作为一种捕获语义特征的有效方法而兴起。现有的JEAs依赖于隐式或显式的批次交互——通过负采样或统计正则化——来防止表征坍塌。这种依赖性在批次大小必须较小的场景中会产生问题，例如高维科学数据领域，其中内存限制和类别不平衡使得大规模、均衡的批次难以实现。我们提出了IConE（实例对比嵌入），这是一个将防坍塌机制与训练批次大小解耦的框架。IConE并非通过批次统计来强制多样性，而是维护一组全局可学习的辅助实例嵌入，并通过显式的多样性目标进行正则化。这将防坍塌机制从瞬时的批次转移到了数据集级别的嵌入空间，使得即使在批次统计不可靠时（直至批次大小为1）也能进行稳定训练。在多种2D和3D生物医学模态中，IConE在整个小批次范围（从B=1到B=64）内均优于强对比式与非对比式基线方法，并展现出对严重类别不平衡的显著鲁棒性。几何分析表明，IConE在学习到的表征中保持了较高的本征维度，防止了现有JEAs中随着批次缩小而出现的坍塌现象。

摘要 (Abstract)

Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction – via negative sampling or statistical regularization – to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.

关键词: Self-supervised learning, Representation collapse, Joint-Embedding Architectures, Batch size independence, Instance-Contrasted Embeddings, Biomedical data, Class imbalance, Intrinsic dimensionality

226. ❌ HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

作者: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLMs）在图像描述任务中的幻觉检测，与关键词列表中的大多数技术（如LLM架构、训练方法、推理优化等）无直接关联；唯一高度相关的关键词是"Hallucination Mitigation OR Factuality OR Truthfulness"，因为论文的核心是评估和减少VLMs在图像描述中的幻觉，这直接对应幻觉缓解和事实性主题；其他关键词均未在论文标题或摘要中提及或暗示，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了HalDec-Bench基准，用于系统评估视觉语言模型在图像描述任务中的幻觉检测能力，并发现检测器倾向于将响应开头的句子视为正确，同时使用强VLMs作为过滤器可显著减少数据集噪声。

摘要翻译

图像描述幻觉检测（HalDec）通过识别描述中歪曲图像内容的错误，评估视觉-语言模型将图像内容与文本正确对齐的能力。除评估外，有效的幻觉检测对于筛选用于训练视觉-语言模型的高质量图像-描述对也至关重要。然而，由于缺乏全面的基准测试，视觉-语言模型作为幻觉检测器在不同描述模型和幻觉类型间的泛化能力尚不明确。本研究提出了HalDec-Bench基准，旨在以系统且可解释的方式评估幻觉检测器。HalDec-Bench包含由多种视觉-语言模型生成的描述，以及标注了幻觉存在性、详细幻觉类型类别和片段级标签的人工注释。该基准提供了不同难度范围的任务，并揭示了在多模态推理或对齐基准中无法观察到的模型性能差异。我们的分析进一步揭示了两项关键发现：首先，检测器倾向于将出现在回答开头的句子判定为正确，而无论其实际正确性；其次，实验表明，通过使用强视觉-语言模型作为过滤器，同时采用新型视觉-语言模型作为描述生成器，可显著降低数据集噪声。项目页面详见：https://dahlian00.github.io/HalDec-Bench-Page/。

摘要 (Abstract)

Hallucination detection in captions (HalDec) assesses a vision-language model’s ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.

关键词: hallucination detection, vision-language models, image captioning, benchmark, dataset curation, multimodal alignment, evaluation, VLM performance

227. ❌ Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

作者: Yao Gu, Xiaohao Xu, Yingna Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种基于物理知识的指令调优框架，用于视觉语言模型（VLM）的物理基础异常检测。核心创新在于通过多轮对话将物理先验知识（如物体属性、运动范式、动态约束）编码为结构化提示，从而增强模型对动态异常的理解和推理能力。论文与以下关键词高度相关：1）‘Instruction Tuning’（10分）：论文的核心方法是’physics-informed instruction tuning framework’；2）‘AI for Science’（10分）：论文属于AI在科学领域的应用，具体是物理基础异常检测；3）‘Large Language Models’（8分）：论文基于视觉语言模型（VLM），属于基础模型范畴；4）‘Post-training’（8分）：论文使用指令调优，属于后训练方法；5）‘Chain of Thought’和’System 2 Thinking’（各8分）：论文通过多轮对话将因果推理分解为增量步骤，体现了多步推理和深度推理。其他关键词如’Explainable AI’（5分）有一定关联，因为论文提到’causal explanations’，但非核心。其余关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在物理基础异常检测中因果理解不足的问题，提出了一种物理知识指导的指令调优框架，通过多轮对话编码物理先验知识，在Phys-AD基准上实现了96.7%的AUROC检测性能，显著优于现有方法。

摘要翻译

视觉语言模型（VLMs）展现出强大的通用推理能力，但在基于物理的异常检测任务中仍存在局限，这类任务需要对动态过程进行因果理解。现有视觉语言模型主要基于以外观为中心的相关性进行训练，难以捕捉运动学约束，导致其在处理不规则旋转或违反机械运动规律等异常情况时表现不佳。我们提出一种融合物理知识的指令微调框架，该框架显式地将物体属性、运动范式与动态约束编码至结构化提示中。通过多轮对话传递这些物理先验知识，我们的方法将因果推理分解为渐进步骤，从而构建对正常与异常动态过程的鲁棒内部表征。在Phys-AD基准测试中，本方法在视频级检测上达到96.7%的AUROC值——显著超越先前最佳方法（66.9%）——并产生更优的因果解释（LLM评分0.777）。这项工作揭示了结构化物理先验如何将视觉语言模型转化为可靠的动态异常检测器。

摘要 (Abstract)

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection–substantially outperforming prior SOTA (66.9%)–and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

关键词: Vision-Language Models, Physics-informed instruction tuning, Anomaly detection, Causal reasoning, Multi-turn dialogues, Dynamic constraints, Phys-AD benchmark, AUROC

228. ❌ Tracking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift

作者: Wooseok Lee, Jin Mo Yang, Saewoong Bahk, Hyung-Sin Kim 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度学习系统中的分布外检测问题，特别是针对协变量偏移下的测试时在线检测方法。论文内容涉及深度学习、特征空间分析、原型跟踪等传统深度学习技术，但完全不涉及大语言模型、大模型技术原理、大模型应用或AI for Science等关键词领域。所有关键词均与大模型相关，而本文研究的是通用深度学习系统的可靠性问题，与大模型无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DART的测试时在线分布外检测方法，通过动态跟踪双重原型来恢复漂移的判别轴，在协变量偏移环境下显著提升了检测性能。

摘要翻译

为实现深度学习系统的可靠部署，分布外检测不可或缺。在现实场景中，测试时输入常以流式混合形式出现，包含分布内样本与分布外样本，且两者均处于持续演化的协变量偏移下；此时分布外样本受环境约束而具有领域局限性，分布内与分布外数据同时受到相同协变量因素的影响。现有方法通常假设分布内数据具有静态分布，但这一假设在此类场景中并不成立，导致性能严重下降。我们通过实验发现，即使在协变量偏移下，经历协变量偏移的分布内样本与分布外样本在特征空间中仍可沿一条判别轴保持分离。基于此观察，我们提出DART方法——一种测试时在线的分布外检测方法，通过动态追踪双原型（分别对应分布内与分布外）以恢复漂移的判别轴，并结合多层融合与翻转校正机制增强鲁棒性。在涵盖多种挑战性基准的广泛实验中，所有数据集均经受15类常见损坏类型（严重程度为5级）的影响，结果表明我们的方法显著提升了检测性能：在ImageNet-C与Textures-C的对比实验中，相较于现有基线方法，AUROC指标提升15.32个百分点，FPR@95TPR指标降低49.15个百分点。这些结果凸显了测试时判别轴追踪在动态变化环境中实现可靠分布外检测的潜力。

摘要 (Abstract)

For reliable deployment of deep-learning systems, out-of-distribution (OOD) detection is indispensable. In the real world, where test-time inputs often arrive as streaming mixtures of in-distribution (ID) and OOD samples under evolving covariate shifts, OOD samples are domain-constrained and bounded by the environment, and both ID and OOD are jointly affected by the same covariate factors. Existing methods typically assume a stationary ID distribution, but this assumption breaks down in such settings, leading to severe performance degradation. We empirically discover that, even under covariate shift, covariate-shifted ID (csID) and OOD (csOOD) samples remain separable along a discriminative axis in feature space. Building on this observation, we propose DART, a test-time, online OOD detection method that dynamically tracks dual prototypes – one for ID and the other for OOD – to recover the drifting discriminative axis, augmented with multi-layer fusion and flip correction for robustness. Extensive experiments on a wide range of challenging benchmarks, where all datasets are subjected to 15 common corruption types at severity level 5, demonstrate that our method significantly improves performance, yielding 15.32 percentage points (pp) AUROC gain and 49.15 pp FPR@95TPR reduction on ImageNet-C vs. Textures-C compared to established baselines. These results highlight the potential of the test-time discriminative axis tracking for dependable OOD detection in dynamically changing environments.

关键词: out-of-distribution detection, covariate shift, test-time adaptation, dual prototypes, discriminative axis tracking, deep learning reliability, online OOD detection, feature space analysis

作者: Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu, Xiao Zhang, Yang Li, Songtao Liu, Miles Yang, Yu Shi, Zhao Zhong, Liefeng Bo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HYDRA专注于多模态视觉模型的统一架构，提出了一种从生成到理解的渐进式ViT设计（HYDRA-TOK）和原生统一框架（HYDRA），核心贡献在于视觉表示的统一和模型架构创新。所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、对齐、推理、代理等），而本文研究的是纯视觉多模态模型，未涉及任何语言模型技术、LLM训练方法、推理优化或代理系统，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了统一多模态模型中视觉理解与生成之间的表示鸿沟问题，通过提出HYDRA-TOK渐进式ViT架构和HYDRA原生统一框架，在多个视觉重建、生成和理解基准上取得了最先进的性能。

摘要翻译

统一多模态模型难以弥合视觉理解所需的抽象表征与生成所需的精细基元之间的根本性差距。现有方法通常通过采用解耦编码器、在变分自编码器之上堆叠表征编码器或利用离散量化进行折中，但这些方法往往会破坏信息连贯性并导致优化冲突。为此，我们提出HYDRA-TOK——一种基于“视觉建模应从生成演进至理解”理念而设计的表征协调纯视觉Transformer（ViT）。HYDRA-TOK将标准主干网络重构为渐进式学习器，其从捕获结构保持基元的生成视觉Transformer（Gen-ViT）逐步过渡至用于语义编码的语义视觉Transformer（Sem-ViT）。关键的是，这一过渡通过生成-语义瓶颈（Generation-Semantic Bottleneck, GSB）实现：该模块将特征压缩至低维空间以过滤噪声从而实现鲁棒合成，随后恢复维度以赋能复杂语义理解。基于此架构，我们提出了HYDRA——一个在单一参数空间内原生统一感知与生成的本征融合框架。大量实验证实HYDRA达到了新的最优性能：它在视觉重建任务中创下基准记录（rFID 0.08），并在GenEval（0.86）、DPG-Bench（86.4）和WISE（0.53）上取得顶级生成性能，同时在八项具有挑战性的理解基准测试中平均超越先前原生统一多模态模型10.0个点。

摘要 (Abstract)

Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.

关键词: Unified Multimodal Models, Visual Understanding, Visual Generation, Representation Harmonization, ViT Architecture, Generation-Semantic Bottleneck, Native Unified Framework, State-of-the-art Performance

230. ❌ Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

作者: Sosuke Yamao, Natsuki Miyahara, Yuankai Qi, Shun Takeuchi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究长视频理解中的视觉压缩与记忆反馈框架，核心涉及大模型（large multimodal models）在视频理解中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定相关性（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐等）、推理优化、智能体、模型压缩、科学AI等均未在摘要中提及或直接相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对长视频理解任务，提出了一种基于问题引导的视觉压缩与记忆反馈框架（QViC-MF），通过迭代的压缩和记忆反馈机制显著提升了在多个基准测试上的性能。

摘要翻译

在大型多模态模型的长视频理解研究背景下，已有诸多框架被提出。尽管基于Transformer的视觉压缩器和记忆增强方法常被用于处理长视频，但它们通常独立压缩每一帧，因此在需要理解完整事件的任务（如MLVU和VNBench中的时序排序任务）上表现欠佳。这促使我们重新思考从感知到记忆的传统单向处理范式，转而建立一种反馈驱动的机制，使得存储在上下文记忆中的历史视觉信息能够持续优化实时感知过程。为此，我们提出了问题引导的记忆反馈视觉压缩框架（Question-guided Visual Compression with Memory Feedback, QViC-MF），用于长视频理解任务。其核心是问题引导的多模态选择性注意力模块（Question-guided Multimodal Selective Attention, QMSA），该模块学习从当前视频片段及记忆中的历史相关帧中筛选并保留与给定问题相关的视觉信息。压缩器与记忆反馈机制在整个视频的每个片段上迭代运行。这种简洁而高效的设计在长视频理解任务上带来了显著的性能提升。大量实验表明，我们的方法相较于当前最优方法取得了显著进步：在MLVU测试集上提升6.1%，在LVBench上提升8.3%，在VNBench Long上提升18.3%，在VideoMME Long上提升3.7%。代码将公开发布。

摘要 (Abstract)

In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.

关键词: long-term video understanding, large multimodal models, visual compression, memory feedback, question-guided, multimodal selective attention, temporal ordering, performance improvement

231. ❌ DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

作者: Zhengxu He, Jun Li, Zhijian Wu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉-语言模型（VLM）的知识蒸馏到轻量级分类器，属于计算机视觉和模型压缩领域。所有评分关键词均针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等），而论文专注于视觉-语言模型（VLM），未涉及LLM或相关技术。因此，所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种自适应中间教师转移的蒸馏方法（DAIT），解决了从通用视觉-语言模型到轻量级细粒度视觉分类器的知识转移问题，在多个基准数据集上显著提升了性能。

摘要翻译

大规模视觉语言模型（VLMs）编码了丰富的多模态语义信息，这对细粒度视觉分类（FGVC）极具价值。然而，其高昂的计算成本阻碍了在资源受限环境中的实际部署。尽管知识蒸馏有助于将VLMs的能力迁移至轻量级分类器，但传统的蒸馏机制——直接从通用VLM迁移到紧凑的学生模型——常因严重的架构失准及引入任务无关信息而导致次优结果。为缓解这一局限，本研究提出自适应中间教师迁移蒸馏法（DAIT），以促进从VLM到轻量级学生模型的自适应知识迁移。DAIT引入了一个可训练的中间教师，该教师在目标细粒度任务的明确监督下，学习迁移冻结的VLM表征。此中间教师自适应地增强判别性视觉线索，从而产生紧凑且与任务对齐的知识，可被可靠地蒸馏至轻量级模型中。在多个FGVC基准数据集上使用不同学生架构进行的广泛评估表明，我们的方法在FGVC-Aircraft和CUB-200-2011数据集上分别实现了12.63%和8.34%的性能提升，确立了DAIT作为从通用VLM迁移至可部署细粒度识别模型的一种原则性范式。

摘要 (Abstract)

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

关键词: Vision-Language Models, Knowledge Distillation, Fine-grained Visual Categorization, Lightweight Classifiers, Intermediate Teacher Transfer, Model Compression, Adaptive Knowledge Transfer, FGVC

232. ❌ Vision-Language Model Based Multi-Expert Fusion for CT Image Classification

作者: Jianfa Bai, Kejin Lu, Runtian Yuan, Qingqiu Li, Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像（CT）分类，特别是COVID-19检测，使用基于视觉语言模型（MedSigLIP）的多专家融合框架。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（医学影像分析）领域的应用，与’AI for Science’高度相关，评分为10分。其他关键词均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于视觉语言模型的多专家融合框架，用于解决多源CT图像中COVID-19分类的鲁棒性问题，在验证集上取得了高准确率（如AUC达0.9864）。

摘要翻译

在多机构环境下，由于显著的源偏移、源不平衡及隐藏的测试源身份，从胸部CT中稳健检测COVID-19仍面临挑战。本研究提出一种三阶段源感知多专家框架，用于多源COVID-19 CT分类。首先，我们通过结合原始CT体积与肺部提取CT体积，构建一个肺感知三维专家模型进行体积分类。其次，我们开发了两个基于MedSigLIP的专家模块：一个切片级表征与概率学习模块，以及一个基于Transformer的切片间上下文建模模块，用于捕捉跨切片依赖关系。第三，我们训练一个源分类器来预测每个测试扫描的潜在源身份。通过利用预测的源信息，我们基于不同专家进行模型融合与投票。在覆盖全部四个源的验证集上，第一阶段模型取得了最佳宏观F1值0.9711、准确率（ACC）0.9712和曲线下面积（AUC）0.9791。第二阶段a模块与第二阶段b模块分别达到最佳AUC分数0.9864和0.9854。第三阶段源分类器获得0.9107准确率与0.9114 F1值。这些结果表明，源感知专家建模与分层投票机制为异构多源条件下的稳健COVID-19 CT分类提供了有效解决方案。

摘要 (Abstract)

Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~~2a and Stage~~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.

关键词: CT image classification, COVID-19 detection, multi-expert fusion, vision-language model, MedSigLIP, source-aware, robust classification, multi-source

233. ❌ TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

作者: Hua Chang, Xin Xu, Wei Liu, Jiayi Wu, Kui Jiang, Fei Ma, Qi Tian 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15153v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频超分辨率（Video Super-Resolution）任务，特别是针对歌剧视频的真实世界退化问题。虽然论文使用了文本引导（text-guided）方法，但这里的“文本”指的是描述退化过程和视频内容的文本提示（textual prompts），并非基于大语言模型（LLMs）或深度学习技术原理的创新。论文的核心是计算机视觉中的图像/视频处理，涉及生成对抗网络（GAN）、特征融合等传统CV技术，与评分关键词列表中的大模型、深度学习原理、AI for Science等主题完全无关。所有关键词均未在标题或摘要中出现，也未隐含相关概念。

!!! tip deepseek-chat TL;DR

该论文针对经典歌剧视频因早期拍摄设备限制和长期存储退化导致的视觉质量问题，提出了一种文本引导的双分支歌剧视频超分辨率网络（TextOVSR），通过引入退化描述文本和内容描述文本来指导超分辨率过程，在OperaLQ基准测试中取得了优于现有方法的效果。

摘要翻译

许多经典歌剧视频因早期拍摄设备的局限及存储过程中的长期劣化，呈现出较差的视觉质量。尽管真实世界视频超分辨率技术近年来取得了显著进展，但直接将现有方法应用于劣化歌剧视频仍面临挑战。主要困难体现在两方面：首先，准确建模真实世界劣化过程十分复杂——经典退化核的简单组合难以捕捉真实的噪声分布，而从外部数据集中提取真实噪声块的方法易因风格失配引入视觉伪影。其次，当前仅依赖退化图像特征的真实世界视频超分辨率方法，因缺乏高层语义指导，难以重建逼真细致的纹理。为解决这些问题，我们提出了一种文本引导的双分支歌剧视频超分辨率网络，通过引入两类文本提示来指导超分辨率过程。具体而言，基于退化过程生成的退化描述文本被整合至负向分支以约束解空间；同时，内容描述文本被融入正向分支及我们提出的文本增强判别器中，为增强纹理重建提供语义指导。此外，我们设计了退化鲁棒特征融合模块，在促进跨模态特征融合的同时抑制退化干扰。在自建的OperaLQ基准测试上的实验表明，本文方法在定性与定量评估上均优于现有先进技术。代码已发布于https://github.com/ChangHua0/TextOVSR。

摘要 (Abstract)

Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.

关键词: Video Super-Resolution, Real-World Degradation, Text-Guided, Opera Videos, Degradation-Descriptive Text, Content-Descriptive Text, Text-Enhanced Discriminator, Degradation-Robust Feature Fusion

234. ❌ SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

作者: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, Jason Kuen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15150v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于离散图像生成领域，提出了一种新的训练目标SNCE来优化大规模VQ码本的生成模型。虽然属于深度学习在计算机视觉中的应用，但论文内容与所有评分关键词（主要针对大语言模型技术、训练方法、推理优化、对齐、代理系统等）均无直接关联。论文未涉及语言模型、MoE、量化、推理加速、对齐技术、代理系统或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SNCE的新型训练目标，通过构建软分类分布来优化大规模VQ码本的离散图像生成模型，显著提高了收敛速度和生成质量。

摘要翻译

近期离散图像生成领域的研究进展表明，扩大VQ码本规模能显著提升重建保真度。然而，使用大型VQ码本训练生成模型仍面临挑战，通常需要更大的模型规模和更长的训练周期。本工作提出随机邻域交叉熵最小化（Stochastic Neighbor Cross Entropy Minimization, SNCE），这是一种新颖的训练目标函数，旨在解决大型码本离散图像生成器的优化难题。SNCE不再使用硬性独热目标监督模型，而是构建一组相邻标记上的软分类分布。每个标记被分配的概率与其编码嵌入和真实图像嵌入之间的邻近度成正比，从而促使模型在量化嵌入空间中捕获具有语义意义的几何结构。我们在类别条件ImageNet-256生成、大规模文本到图像合成以及图像编辑任务上进行了广泛实验。结果表明，与标准交叉熵目标相比，SNCE显著提升了收敛速度与整体生成质量。

摘要 (Abstract)

Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

关键词: discrete image generation, VQ codebook, training objective, SNCE, image synthesis, convergence speed, generation quality, embedding space

235. ❌ Clinical Priors Guided Lung Disease Detection in 3D CT Scans

作者: Kejin Lu, Jianfa Bai, Qingqiu Li, Runtian Yuan, Jilan Xu Junlin Hou, Yuejie Zhang, Rui Feng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析（3D CT扫描中的肺部疾病检测），使用深度学习技术解决类别不平衡问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（医学影像分析）领域的应用，属于’AI for Science’的范畴，但并非论文的核心创新点（其创新在于性别感知的两阶段分类框架），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种性别感知的两阶段深度学习框架，用于从3D CT扫描中检测肺部疾病，以解决医学影像数据中的类别不平衡问题，实验表明该方法能有效提升少数疾病类别（如鳞状细胞癌）的识别性能。

摘要翻译

胸部CT扫描的肺部疾病精确分类在计算机辅助诊断系统中具有重要作用。然而，医学影像数据集常面临严重的类别不平衡问题，这可能显著降低深度学习模型的性能，尤其对于少数疾病类别。为解决这一问题，我们提出了一种性别感知的两阶段肺部疾病分类框架。该方法将性别信息明确整合到疾病识别流程中。在第一阶段，训练一个性别分类器从CT扫描中预测患者性别。在第二阶段，输入的CT图像被路由至相应的性别特异性疾病分类器进行最终疾病预测。这一设计使模型能更好地捕捉与性别相关的影像特征，并缓解不平衡数据分布的影响。实验结果表明，所提出的方法提升了少数疾病类别（特别是鳞状细胞癌，squamous cell carcinoma）的识别性能，同时在其他类别上保持了有竞争力的表现。

摘要 (Abstract)

Accurate classification of lung diseases from chest CT scans plays an important role in computer-aided diagnosis systems. However, medical imaging datasets often suffer from severe class imbalance, which may significantly degrade the performance of deep learning models, especially for minority disease categories. To address this issue, we propose a gender-aware two-stage lung disease classification framework. The proposed approach explicitly incorporates gender information into the disease recognition pipeline. In the first stage, a gender classifier is trained to predict the patient’s gender from CT scans. In the second stage, the input CT image is routed to a corresponding gender-specific disease classifier to perform final disease prediction. This design enables the model to better capture gender-related imaging characteristics and alleviate the influence of imbalanced data distribution. Experimental results demonstrate that the proposed method improves the recognition performance for minority disease categories, particularly squamous cell carcinoma, while maintaining competitive performance on other classes.

关键词: lung disease detection, 3D CT scans, class imbalance, gender-aware classification, two-stage framework, deep learning, medical imaging, computer-aided diagnosis

236. ❌ Low-light Image Enhancement with Retinex Decomposition in Latent Space

作者: Bolun Zheng, Qingshan Lei, Quan Chen, Qianyu Zhang, Kainan Yu, Xu Jia, Lingyu Zhu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的低光图像增强，提出了一种基于Retinex理论的Transformer模型。虽然论文涉及深度学习技术，但其研究内容与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。论文未提及任何大模型、语言模型、科学AI应用或相关技术概念。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Retinex理论和Transformer的低光图像增强方法，通过潜在空间分解和组件优化，在多个基准数据集上实现了竞争性性能。

摘要翻译

Retinex理论为低光照图像增强提供了原理性基础，催生了众多融合其思想的基于学习的方法。然而，现有方法在准确分解反射分量与光照分量方面存在局限。为此，我们提出一种Retinex引导的Transformer模型（Retinex-Guided Transformer, RGT），该模型是一个包含分解阶段与增强阶段的两阶段框架。首先，我们提出一种潜在空间分解策略，以分离反射分量与光照分量。通过引入对数变换与1像素偏移，我们将固有的乘性关系转化为加性表达，从而提升分解的稳定性与精度。随后，我们构建了一个包含所提出的引导融合Transformer模块的U形分量优化器。该优化器通过细化反射分量以保留纹理细节，并优化光照分布，从而有效地将低光照输入转换为正常光照图像。在四个基准数据集上的实验评估表明，我们的方法在低光照增强任务中取得了具有竞争力的性能，并实现了更稳定的训练过程。

摘要 (Abstract)

Retinex theory provides a principled foundation for low-light image enhancement, inspiring numerous learning-based methods that integrate its principles. However, existing methods exhibits limitations in accurately decomposing reflectance and illumination components. To address this, we propose a Retinex-Guided Transformer~(RGT) model, which is a two-stage model consisting of decomposition and enhancement phases. First, we propose a latent space decomposition strategy to separate reflectance and illumination components. By incorporating the log transformation and 1-pixel offset, we convert the intrinsically multiplicative relationship into an additive formulation, enhancing decomposition stability and precision. Subsequently, we construct a U-shaped component refiner incorporating the proposed guidance fusion transformer block. The component refiner refines reflectance component to preserve texture details and optimize illumination distribution, effectively transforming low-light inputs to normal-light counterparts. Experimental evaluations across four benchmark datasets validate that our method achieves competitive performance in low-light enhancement and a more stable training process.

关键词: low-light image enhancement, Retinex theory, latent space decomposition, Transformer, reflectance component, illumination component, U-shaped component refiner, benchmark datasets

作者: Hainuo Wang, Mingjia Li, Xiaojie Guo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的扩散模型和生成式AI，提出了一种名为WiT的像素空间轨迹解缠方法，使用预训练视觉模型生成语义路标来引导扩散过程。所有评分关键词均与大语言模型（LLM）相关，包括其技术原理、训练方法、推理优化、应用场景等，而本文研究的是纯视觉生成模型（扩散变换器），未涉及任何语言模型或文本处理内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对像素空间扩散模型中轨迹冲突导致次优解的问题，提出了Waypoint Diffusion Transformers（WiT），通过预训练视觉模型生成语义路标来解缠生成轨迹，在ImageNet 256×256上超越了现有像素空间基线并加速了训练收敛。

摘要翻译

尽管近期流匹配模型通过直接在像素空间操作避免了潜在自编码器的重构瓶颈，但像素流形中语义连续性的缺失导致最优传输路径严重纠缠。这在交叉点附近引发严重的轨迹冲突，从而产生次优解。我们并未通过有信息损失的潜在表征来规避此问题，而是通过提出路标扩散变换器直接解构像素空间轨迹。WiT通过从预训练视觉模型投影的中间语义路标对连续向量场进行因式分解，通过将最优传输拆分为先验到路标和路标到像素的片段，有效解耦了生成轨迹。具体而言，在迭代去噪过程中，轻量级生成器从当前含噪状态动态推断这些中间路标。随后它们通过Just-Pixel AdaLN机制持续调节主扩散变换器，引导其向下一状态演化，最终生成RGB像素。在ImageNet 256x256数据集上的评估表明，WiT超越了强像素空间基线，将JiT训练收敛速度提升2.2倍。代码将在https://github.com/hainuo-wang/WiT.git公开。

摘要 (Abstract)

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

关键词: Diffusion Transformers, Pixel-space Generation, Trajectory Disentanglement, Waypoint Guidance, Optimal Transport, Semantic Waypoints, Image Generation, Vision Models

238. ❌ Context-Aware Sensor Modeling for Asynchronous Multi-Sensor Tracking in Stone Soup

作者: Martin Vonheim Larsen, Kim Mathiassen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Context-Aware Sensor Modeling for Asynchronous Multi-Sensor Tracking in Stone Soup》专注于多传感器目标跟踪领域，具体研究异步雷达-激光雷达数据融合中的上下文感知建模问题。论文的核心贡献是提出了DetectorContext抽象，用于改进开源多目标跟踪框架Stone Soup中的检测概率和杂波强度建模。所有给定的评分关键词均与大语言模型（LLMs）、深度学习技术原理、AI for Science应用（如生物信息学、化学信息学）或相关训练/推理技术直接相关。本论文的研究内容（传感器跟踪、概率模型、数据融合）与这些关键词的主题（自然语言处理、大模型架构、对齐微调、科学AI应用等）完全不同，没有任何重叠或关联。因此，所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文针对异步多传感器跟踪中因全局均匀可观测性假设导致性能下降的问题，提出了一种上下文感知的传感器建模抽象（DetectorContext），在Stone Soup框架中实现了状态依赖的检测概率和杂波强度建模，实验表明该方法能稳定融合并显著提升跟踪性能指标（HOTA和GOSPA）。

摘要翻译

现实世界中的多传感器跟踪涉及具有部分覆盖范围与异构检测性能的异步传感器。尽管概率跟踪方法允许检测概率和杂波强度依赖于状态与感知上下文，但许多实用框架强制采用全局均匀的可观测性假设。在多速率且部分重叠的感知条件下，这种简化会导致高速率传感器产生的重复未检测事件侵蚀仅对低速率传感器可见的航迹，从而可能降低融合性能。

我们提出DetectorContext——一种面向开源多目标跟踪框架Stone Soup的抽象模型。该抽象将检测概率与杂波强度作为状态依赖函数在假设形成阶段进行评估。此设计无需修改现有概率跟踪器的更新方程即可实现集成。基于异步雷达-激光雷达数据的实验表明，上下文感知建模能够恢复稳定的融合性能，并在不增加虚假航迹的前提下显著提升HOTA与GOSPA指标。

摘要 (Abstract)

Multi-sensor tracking in the real world involves asynchronous sensors with partial coverage and heterogeneous detection performance. Although probabilistic tracking methods permit detection probability and clutter intensity to depend on state and sensing context, many practical frameworks enforce globally uniform observability assumptions. Under multi-rate and partially overlapping sensing, this simplification causes repeated non-detections from high-rate sensors to erode tracks visible only to low-rate sensors, potentially degrading fusion performance. We introduce DetectorContext, an abstraction for the open-source multi-target tracking framework Stone Soup. DetectorContext exposes detection probability and clutter intensity as state-dependent functions evaluated during hypothesis formation. The abstraction integrates with existing probabilistic trackers without modifying their update equations. Experiments on asynchronous radar-lidar data demonstrate that context-aware modeling restores stable fusion and significantly improves HOTA and GOSPA performance without increasing false tracks.

关键词: multi-sensor tracking, asynchronous sensors, context-aware modeling, detection probability, clutter intensity, probabilistic trackers, radar-lidar fusion, Stone Soup framework

239. ❌ A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements

作者: Jan Andre Rudolph, Dennis Haitz, Markus Ulrich 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器人视觉系统中的相机-机器人标定方法，具体涉及激光跟踪器与相机测量模态的融合、位姿估计和变换计算。论文内容完全属于机器人学、计算机视觉和计量学领域，专注于硬件标定和测量精度问题。所有评分关键词均与大模型、深度学习、AI技术原理或AI科学应用相关，而本论文未涉及任何人工智能、机器学习或大语言模型技术，也未在生物信息学、化学信息学等AI for Science领域有应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于地面观测移动机器人的新型手眼标定方法，通过设计一个结合激光跟踪器3D测量和相机2D成像的参考板，实现了亚毫米级的重复测量精度。

摘要翻译

本文提出一种面向地面观测移动机器人的新型手眼标定方法。移动机器人虽普遍搭载相机，却鲜少将其用于地面观测测量任务。激光跟踪仪在机器人领域日益广泛地应用于精确定位。本研究设计了一种参考板，以融合激光跟踪仪三维计量与基于相机的二维成像两种测量模式。该参考板集成用于激光跟踪仪位姿采集的反射器巢，以及供机器人搭载相机观测的相机标定靶标。标定流程包括估计参考板位姿、参考板-相机位姿及机器人位姿，进而计算机器人-相机变换矩阵。实验结果表明，该方法具有亚毫米级的重复精度。

摘要 (Abstract)

A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.

关键词: hand-eye calibration, mobile robots, camera calibration, laser tracker, pose estimation, robot-camera transformation, ground-observing measurement, sub-millimeter repeatability

240. ❌ Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

作者: Yunuo Chen, Chuqin Zhou, Jiangchuan Li, Xiaoyue Ling, Bing He, Jincheng Dai, Li Song, Guo Lu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像压缩，使用预训练视频扩散模型（VDM）作为先验，实现超低比特率压缩。与大多数关键词（如LLM、对齐、推理等）无关。仅与三个关键词有弱关联：1) ‘Pre-training’（使用预训练VDM）得5分；2) ‘Quantization’（涉及比特率压缩）得5分；3) ‘Speculative Decoding’（提到解码加速）得5分。其他关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用预训练视频扩散模型进行超低比特率图像压缩的新方法，通过锚帧和下一帧预测实现50%比特率节省和5倍解码加速。

摘要翻译

本文提出了一种用于超低码率图像压缩（ULB-IC）的新范式，该范式利用了生成式图像压缩中的“时序”演化过程。具体而言，我们在解码过程中定义了一个显式的中间状态：一个紧凑的锚帧，该帧保留了场景几何结构和语义布局，同时舍弃了高频细节。随后，我们将生成式解码重新解释为从该锚帧到最终重建图像的虚拟时序过渡。为了建模这一进程，我们利用预训练的视频扩散模型（VDM）作为时序先验：锚帧作为初始帧，原始图像作为目标帧，从而将解码过程转化为下一帧预测任务。与基于图像扩散的ULB-IC模型相比，我们的解码过程从一个可见的、语义保真的锚帧开始，这同时提升了感知图像压缩的保真度与真实感。大量实验表明，我们的方法在客观和主观性能上均表现优异。在CLIC2020测试集上，与DiffC相比，我们的方法在LPIPS、DISTS、FID和KID指标上实现了超过50%的码率节省，同时解码速度显著提升，最高可达$\times$5倍。代码将于后续发布。

摘要 (Abstract)

We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal’’ evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed image.To model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction task.In contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.

关键词: ultra-low-bitrate image compression, video diffusion model, anchor frame, next-frame prediction, decoding speedup, generative image compression, temporal priors, perceptual image compression

241. ❌ A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation

作者: Nevrez Imamoglu, Ali Caglayan, Toru Kouyama 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于卫星图像（特别是SAR）的自监督预训练和下游语义分割任务，属于计算机视觉领域。与大多数大语言模型（LLM）关键词无关，但高度相关于’Pre-training’（提出SAR-W-SimMIM方法）和’Post-training’（微调用于语义分割），评分为10。与’AI for Science’有一定关联（卫星图像分析属于科学应用），评分为8。其他关键词均不相关，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对合成孔径雷达（SAR）图像语义标注困难和高噪声的挑战，提出了SAR-W-SimMIM自监督预训练方法，并构建了日本区域的ALOS-2 SAR数据集，实验表明该方法相比随机初始化显著提升了语义分割性能。

摘要翻译

掩码自编码器（MAE）及其相关方法在卫星影像领域已展现出潜力，但由于语义标注困难和高噪声水平的挑战，其在合成孔径雷达（SAR）影像中的应用仍较为有限。基于我们先前采用SAR-W-MixMAE的研究——该方法在标准MixMAE预训练中增加了SAR专用的强度加权损失，本文进一步提出了SAR-W-SimMIM：一种应用于ALOS-2单通道SAR影像的SimMIM加权变体。该方法旨在自监督预训练过程中降低散斑噪声和极端强度值的影响。我们通过对比先前使用SAR-W-MixMAE的试验及随机初始化的结果，评估了该方法对语义分割任务的影响，并观察到显著提升。此外，在卫星影像上进行预训练与微调模型面临独特挑战，特别是在开发区域专用模型时。不平衡的土地覆盖分布（如大面积水域、森林或荒漠占主导）可能引入偏差，影响预训练及土地覆盖分割等下游任务。为解决这一问题，我们利用ALOS-2单通道（HH极化）影像构建了一个聚焦日本区域的SAR数据集，标志着向国家级基础模型迈出的初步阶段。该数据集被用于预训练基于视觉Transformer的自编码器，所得编码器通过任务专用解码器进行微调，以完成语义分割任务。初步结果表明，相较于随机初始化从头训练的方式，该方法带来了显著的性能提升。总之，本研究为处理和准备ALOS-2观测数据以构建数据集提供了指导，使其能够有效用于模型的自监督预训练以及语义分割等下游任务的微调。

摘要 (Abstract)

Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.

关键词: SAR, self-supervised pretraining, semantic segmentation, ALOS-2, vision transformer, satellite imagery, fine-tuning, dataset preparation

242. ❌ Sampling-guided exploration of active feature selection policies

作者: Gabriel Bernardino, Anders Jonsson, Patrick Clarysse, Nicolas Duchateau 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于强化学习的主动特征选择方法，用于机器学习预测模型，旨在优化特征获取成本与性能的平衡。所有给定的关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于传统的机器学习特征选择问题，未涉及大模型、深度学习、AI for Science或任何列出的具体技术（如MoE、RLHF、RAG等）。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的主动特征选择框架，通过启发式策略和正则化方法处理大规模数据集，在四个二元分类数据集上实现了比现有方法更好的准确性和策略简洁性。

摘要翻译

为机器学习预测模型确定最合适的特征在性能和特征获取成本方面均具有挑战性。特别是，考虑到某些特征仅对部分实例有益，全局特征选择受到限制。在先前工作中，我们提出了一种强化学习方法，基于已获取的实例特定信息，按顺序推荐下一步应获取哪种模态，以达到最佳信息/成本比。我们将该问题建模为一个马尔可夫决策过程，其中状态的维度在决策过程中会发生变化，从而避免了数据插补，这与现有方法不同。然而，该方法仅能处理少量特征，因为它考虑了所有可能的特征组合。本文通过两项改进来解决这些局限性：1）我们采用一种基于启发式的策略扩展了框架，使其能处理更大的数据集，该策略专注于最有希望的特征组合；2）我们引入了一种后拟合正则化策略，以减少不同特征组合的数量，从而得到紧凑的决策序列。我们在四个二元分类数据集（其中一个涉及高维变量）上测试了我们的方法，其中最大的数据集包含56个特征和4500个样本。我们在准确性和策略复杂性方面均取得了优于现有先进方法的性能。

摘要 (Abstract)

Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state’s dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.

关键词: active feature selection, reinforcement learning, Markov Decision Process, heuristic strategy, post-fit regularization, binary classification, policy complexity, feature acquisition cost

作者: Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估多模态基础模型在文档结构化数据提取任务上的表现，重点关注≤4B参数的小型模型。因此与’Large Language Models/Foundation Models’（评估对象）、‘Small Language Models/On-device AI’（重点关注≤4B参数模型）、‘Post-training/Supervised Fine-tuning’（论文提到微调带来显著提升）和’Instruction Tuning/Alignment’（论文指出结构化输出合规性是指令遵循缺陷）高度相关（10分）。其他关键词如MoE、Scaling Laws、RAG、CoT等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了VAREX基准，用于评估多模态基础模型从政府表格中提取结构化数据的能力，发现小型模型（≤4B参数）的主要瓶颈是结构化输出合规性而非提取能力，且布局保留文本能带来最大精度提升。

摘要翻译

我们推出VAREX（多模式表单提取基准），这是一个用于评估多模态基础模型从政府表格中提取结构化数据能力的基准。VAREX采用逆向标注流程，通过编程方式将合成值填入PDF模板，生成经过三重质量验证的确定性标注真值。该基准包含1,777份文档，涵盖三种结构类别共1,771种独特模式，每份文档提供四种输入模态：纯文本、保留布局文本（通过空格对齐模拟列位置）、文档图像，或文本与图像组合。与现有仅从单一输入表示进行评估的基准不同，VAREX为每份文档提供四种受控模态，能够系统性地分析输入格式如何影响提取准确率——这是先前基准所缺乏的能力。我们评估了20个模型，涵盖前沿专有模型到小型开源模型，特别关注参数≤40亿、适合成本敏感和延迟受限部署场景的模型。结果显示：（1）在40亿参数以下，结构化输出合规性——而非提取能力——是主要瓶颈；特别是模式回显现象（模型生成符合模式结构而非提取值）使受影响模型的评分降低45-65个百分点；（2）对20亿参数模型进行提取任务专项微调可获得+81个百分点的性能提升，表明指令跟随缺陷无需扩大模型规模即可解决；（3）保留布局文本带来最大准确率增益（+3-18个百分点），其效果超越像素级视觉线索；（4）该基准能最有效区分60%-95%准确率区间的模型性能。数据集与评估代码已公开提供。

摘要 (Abstract)

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy – a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance – not extraction capability – is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

关键词: multimodal foundation models, structured data extraction, benchmark evaluation, small language models, instruction following, fine-tuning, layout-preserving text, document understanding

244. ❌ PAKAN: Pixel Adaptive Kolmogorov-Arnold Network Modules for Pansharpening

作者: Haoyu Zhang, Haojing Chen, Zhen Zhong, Liangjian Deng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于遥感图像融合（pansharpening）任务，提出了一种基于Kolmogorov-Arnold Network（KAN）的像素自适应网络模块。所有关键词均与大语言模型（LLM）或深度学习通用技术原理相关，而本文研究的是特定计算机视觉任务中的网络架构创新，与LLM、MoE、对齐、推理、智能体等主题完全无关。唯一可能的相关性是“AI for Science”，因为遥感属于地球科学应用，但论文未明确强调科学发现或生物/化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像pansharpening任务中静态激活函数限制非线性映射能力的问题，提出了像素自适应的Kolmogorov-Arnold网络模块（PAKAN），通过实验证明其能显著提升网络性能。

摘要翻译

全色锐化的目标是将全色图像的高分辨率空间细节与多光谱图像丰富的光谱信息相融合。现有用于此任务的深度神经网络通常依赖静态激活函数，这限制了其动态建模最优空谱融合所需复杂非线性映射的能力。尽管近期提出的科尔莫戈罗夫-阿诺德网络（Kolmogorov-Arnold Network，KAN）采用了可学习的激活函数，但传统KAN在推理过程中缺乏动态适应性。为突破这一局限，本文提出像素自适应科尔莫戈罗夫-阿诺德网络框架。以KAN为基础，我们设计了两种自适应变体：在空间维度生成样条求和权重的二维自适应KAN，以及在光谱通道生成权重的一维自适应KAN。这两个组件随后被组装为用于特征融合的PAKAN 2to1模块和用于特征优化的PAKAN 1to1模块。大量实验表明，所提出的模块显著提升了网络性能，证明了像素自适应激活在全色锐化任务中的有效性与优越性。

摘要 (Abstract)

Pansharpening aims to fuse high-resolution spatial details from panchromatic images with the rich spectral information of multispectral images. Existing deep neural networks for this task typically rely on static activation functions, which limit their ability to dynamically model the complex, non-linear mappings required for optimal spatial-spectral fusion. While the recently introduced Kolmogorov-Arnold Network (KAN) utilizes learnable activation functions, traditional KANs lack dynamic adaptability during inference. To address this limitation, we propose a Pixel Adaptive Kolmogorov-Arnold Network framework. Starting from KAN, we design two adaptive variants: a 2D Adaptive KAN that generates spline summation weights across spatial dimensions and a 1D Adaptive KAN that generates them across spectral channels. These two components are then assembled into PAKAN 2to1 for feature fusion and PAKAN 1to1 for feature refinement. Extensive experiments demonstrate that our proposed modules significantly enhance network performance, proving the effectiveness and superiority of pixel-adaptive activation in pansharpening tasks.

关键词: Pansharpening, Kolmogorov-Arnold Network, Pixel Adaptive, Spatial-Spectral Fusion, Learnable Activation Functions, Remote Sensing, Image Fusion, Deep Neural Networks

245. ❌ Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC

作者: Alice Natalina Caragliano, Giulia Farina, Fatih Aksu, Camillo Maria Caruso, Claudia Tacconi, Carlo Greco, Lorenzo Nibid, Edy Ippolito, Michele Fiore, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究非小细胞肺癌病理反应预测的多模态深度学习框架，属于AI在生物医学领域的应用。与大多数大模型技术关键词（如MoE、RLHF、量化等）完全无关，得0分。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/医学AI应用，得10分。‘Large Language Models OR LLMs OR Foundation Models’得5分，因为摘要提到’foundation model-based CT feature extraction’，但这不是论文核心创新点，只是特征提取工具。

!!! tip deepseek-chat TL;DR

该研究提出了一种多模态深度学习框架，通过整合基于基础模型的CT特征提取和缺失感知的临床变量架构，在数据有限和不完整的现实临床条件下，显著提高了非小细胞肺癌病理反应的预测准确性。

摘要翻译

新辅助治疗后主要病理缓解（pR）是非小细胞肺癌中具有临床意义的终点指标，与患者生存期的改善密切相关。然而，在术前准确预测pR仍存在挑战，尤其是在现实临床环境中，常面临数据有限和临床信息不完整的问题。本研究提出一种多模态深度学习框架，旨在通过整合基于基础模型的CT特征提取与针对临床变量的缺失感知架构，应对这些限制。该方法能够从小规模队列中进行稳健学习，同时显式建模缺失的临床信息，无需依赖传统的填补策略。我们采用加权融合机制以充分利用影像与临床模态的互补性，构建的多模态模型在性能上持续优于单模态影像基线及临床基线。这些发现强调了整合异质数据源的附加价值，并凸显了多模态、缺失感知系统在真实临床条件下支持pR预测的潜力。

摘要 (Abstract)

Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.

关键词: multimodal deep learning, pathological response prediction, non-small cell lung cancer, foundation model-based feature extraction, missing-aware architecture, clinical variables, weighted fusion, small cohorts

246. ❌ The Good, the Better, and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning

作者: Ana Dias, João Ribeiro Pinto, Hugo Proença, João C. Neves 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的人脸识别技术，提出了一种基于属性感知的人脸嵌入学习方法，通过联合学习身份标签和面部属性来提升识别性能。论文内容完全围绕传统深度学习在计算机视觉中的应用，未涉及任何大语言模型（LLM）、大模型技术原理、大模型在不同领域的应用或AI for Science等关键词。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而该论文属于纯粹的计算机视觉研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种属性感知的人脸识别架构，通过联合学习身份标签和面部属性来提升人脸嵌入的区分性，实验表明使用身份相关属性子集并强制模型遗忘非身份相关属性能显著提高人脸验证性能。

摘要翻译

尽管人脸识别技术近期取得了进展，但在年龄、姿态和遮挡存在较大变化的情况下，其鲁棒性能仍面临挑战。解决这些问题的常见策略是利用面部属性的辅助监督来引导表征学习，促使视觉编码器关注与身份相关的区域。然而，现有方法通常依赖于异构且固定的属性集合，隐含地假设所有属性具有同等重要性。这一假设并非最优，因为不同属性对身份识别的判别力存在差异，某些属性甚至可能引入有害偏差。本文提出一种属性感知的人脸识别架构，该架构利用身份类别标签、身份相关的面部属性以及非身份相关的属性，共同监督面部嵌入的学习。面部属性被组织成可解释的组别，从而能够以人类可理解的方式分解和分析其个体贡献。在标准人脸验证基准上的实验表明，身份与面部属性的联合学习提升了人脸嵌入的判别能力，并得出两个主要结论：（一）使用身份相关的面部属性子集进行监督，其性能始终优于使用更广泛属性集的监督；（二）与对此类属性不加监督相比，显式强制嵌入模型“遗忘”非身份相关属性能带来进一步的性能提升。此外，本方法还可作为一种诊断工具，用于评估人脸识别编码器的可信度：通过抑制非身份相关属性并测量准确率提升，若存在提升则表明模型可能从与每个身份相关的冗余属性中进行了捷径学习。

摘要 (Abstract)

Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.

关键词: face recognition, facial embeddings, attribute-aware learning, identity-relevant attributes, non-identity-related attributes, face verification, discriminability, shortcut learning

247. ❌ SRL-MAD: Structured Residual Latents for One-Class Morphing Attack Detection

作者: Diogo J. Paulo, Hugo Proença, João C. Neves 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的面部变形攻击检测（MAD），使用傅里叶变换和残差分析等传统图像处理方法，未涉及任何大语言模型、深度学习技术原理、AI for Science或其他评分关键词。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于结构化残差傅里叶表示的单类面部变形攻击检测方法SRL-MAD，通过在频域中学习频率感知投影，显著提升了对未见攻击的检测性能，并在多个数据集上超越了现有方法。

摘要翻译

面部融合攻击通过将多个身份特征融合至单一人脸，对生物识别系统构成严重威胁。尽管有监督的融合攻击检测方法已展现出良好性能，但其对攻击标注数据的依赖限制了模型对未知融合攻击的泛化能力。这促使学界日益关注单类别融合攻击检测方法，该类方法仅使用真实人脸样本进行训练，旨在通过识别与正常面部结构的偏差来检测未知攻击。在此背景下，我们提出SRL-MAD——一种基于结构化残差傅里叶表示的单类别单图像融合攻击检测方法，用于开放集融合攻击检测。该方法从抑制图像特异性频谱趋势的残差频率图出发，通过环形表示保留傅里叶域的二维结构，并以可学习的环形频谱投影替代方位角平均。为进一步编码关于融合伪影产生区域的领域知识，我们通过将频谱证据组织为低频、中频和高频带并学习跨频带交互，施加了频率感知的归纳偏置。这些结构化频谱特征被映射至专为直接评分设计的隐空间，避免了对重构误差的依赖。在FERET-Morph、FRLL-Morph和MorDIFF数据集上的广泛评估表明，SRL-MAD在各项实验中均优于近期的单类别及有监督融合攻击检测模型。总体而言，我们的研究证明：对于单类别融合攻击检测任务，学习频率感知的投影为方位角频谱汇总提供了更具判别力的替代方案。

摘要 (Abstract)

Face morphing attacks represent a significant threat to biometric systems as they allow multiple identities to be combined into a single face. While supervised morphing attack detection (MAD) methods have shown promising performance, their reliance on attack-labeled data limits generalization to unseen morphing attacks. This has motivated increasing interest in one-class MAD, where models are trained exclusively on bona fide samples and are expected to detect unseen attacks as deviations from the normal facial structure. In this context, we introduce SRL-MAD, a one-class single-image MAD that uses structured residual Fourier representations for open-set morphing attack detection. Starting from a residual frequency map that suppresses image-specific spectral trends, we preserve the two-dimensional organization of the Fourier domain through a ring-based representation and replace azimuthal averaging with a learnable ring-wise spectral projection. To further encode domain knowledge about where morphing artifacts arise, we impose a frequency-informed inductive bias by organizing spectral evidence into low, mid, and high-frequency bands and learning cross-band interactions. These structured spectral features are mapped into a latent space designed for direct scoring, avoiding the reliance on reconstruction errors. Extensive evaluation on FERET-Morph, FRLL-Morph, and MorDIFF demonstrates that SRL-MAD consistently outperforms recent one-class and supervised MAD models. Overall, our results show that learning frequency-aware projections provides a more discriminative alternative to azimuthal spectral summarization for one-class morphing attack detection.

关键词: face morphing attack detection, one-class MAD, structured residual Fourier representations, frequency-aware projections, open-set detection, biometric security, spectral analysis, unsupervised anomaly detection

248. ❌ GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

作者: Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于移动GUI智能体（Mobile GUI Agents）的基准测试，属于大模型（MLLMs）在特定应用领域（人机交互、移动应用）的研究。核心相关关键词包括：LLM Agents（高度相关，论文评估MLLM-based agents）、Multi-agent Systems（相关，评估多智能体系统）、Tool Use（相关，涉及GUI交互控制）、Self-Correction（相关，评估反思和自我评估能力）、Chain of Thought/System 2 Thinking（有一定关联，涉及规划和反思维度）。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对中文移动GUI智能体缺乏全面基准的问题，提出了首个基于真实设备环境的综合性评测框架GUI-CEval，并通过实验发现现有MLLMs在反思决策和行动后自我评估方面存在明显弱点。

摘要翻译

多模态大语言模型（MLLMs）的最新进展催生了能够进行视觉感知、跨模态推理和交互控制的移动图形用户界面（GUI）智能体。然而，现有基准测试大多以英语为中心，未能捕捉中国移动生态系统的语言与交互特性。这些基准通常仅关注图形用户界面定位或离线智能体等孤立技能，缺乏一个统一且细粒度的框架来评估从感知到执行的完整能力链。为弥补这一空白，我们推出了GUI-CEval——首个完全基于真实设备环境构建的中文移动图形用户界面智能体综合评测基准。该基准覆盖四大设备类型的201款主流应用，采用双层评估结构，从感知、规划、反思、执行与评估五个维度，分别对原子能力与实际应用级表现进行系统测评。所有数据均通过多阶段人工流程采集与验证，以确保真实性与可复现性。对20个代表性多模态大语言模型及多智能体系统的广泛实验表明，尽管Qwen2.5-VL、UI-TARS等模型展现出较强竞争力，但大多数多模态大语言模型在反思性决策与行动后自我评估方面仍存在明显缺陷，限制了其在真实交互场景中的可靠性。我们期待GUI-CEval能为中文移动图形用户界面智能体的能力诊断与发展推进提供全面且可解释的评估基准。

摘要 (Abstract)

Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

关键词: Mobile GUI Agents, Multimodal Large Language Models, Benchmark, Chinese Mobile Ecosystem, Perception-Execution Chain, Multi-agent Systems, Self-evaluation, Real-world Interaction

249. ❌ Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

作者: Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成检测，提出了一种无需训练的零样本检测方法STALL，通过时空似然性来识别合成视频。论文内容与所有评分关键词（主要涉及大模型技术原理、训练方法、推理优化、对齐、应用等）均无直接关联，属于计算机视觉/多媒体取证领域，而非大模型或深度学习技术原理的创新研究。

!!! tip deepseek-chat TL;DR

该论文针对合成视频检测的挑战，提出了一种无需训练的零样本检测方法STALL，通过联合建模时空证据的似然性评分，在多个基准测试中优于现有方法。

摘要翻译

随着文本与图像生成领域取得重大进展，视频生成技术迅猛发展，已能产出高度逼真且可控的序列。在这一进展的同时，此类模型也引发了关于虚假信息的严重担忧，使得对合成视频进行可靠检测变得日益关键。基于图像的检测器存在根本性局限，因其仅能逐帧分析而忽略了时序动态特征；而基于监督学习的视频检测器对未见过的生成模型泛化能力较差——鉴于新模型正快速涌现，这一缺陷尤为突出。这些挑战催生了零样本检测方法的发展，其避免使用合成数据，转而依据真实数据统计特征对内容进行评分，从而实现无需训练、与模型无关的检测。本文提出 \emph{STALL}，一种简单、无需训练且具有理论依据的检测器，可为视频提供基于似然度的评分，在概率框架内联合建模空间与时间证据。我们在两个公开基准上评估 STALL，并引入 ComGenVid——一个包含前沿生成模型的新基准。STALL 在所有测试中均优于先前基于图像和视频的基线方法。代码与数据详见 https://omerbenhayun.github.io/stall-video。

摘要 (Abstract)

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

关键词: synthetic video detection, training-free detection, zero-shot approach, spatial-temporal likelihoods, video generation models, misinformation detection, model-agnostic detection, probabilistic framework

250. ❌ One CT Unified Model Training Framework to Rule All Scanning Protocols

作者: Fengzhi Xu, Ziyuan Yang, Zexin Lu, Yingyu Chen, Fenglei Fan, Hongming Shan, Yi Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（CT扫描）的深度学习增强方法，提出了一种名为UMS（Uncertainty-Guided Manifold Smoothing）的框架，用于解决非理想测量CT（NICT）中因扫描协议不同导致的泛化问题。论文的核心技术涉及无监督学习、特征空间流形平滑、分类器引导的动态架构等，属于计算机视觉和医学影像分析的特定领域。所有关键词均与大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理加速、智能体等）直接相关，而该论文未涉及任何大语言模型或通用AI技术，仅最后一个关键词“AI for Science”与论文的医学影像应用有一定关联（评5分），但论文未具体涉及生物信息学或化学信息学。因此，其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文针对非理想测量CT（NICT）中因扫描协议不同导致的图像质量下降和模型泛化问题，提出了一种不确定性引导的流形平滑（UMS）框架，通过动态结合全局和子流形特征，有效提升了CT图像重建性能。

摘要翻译

非理想测量计算机断层扫描（NICT）以降低图像质量为代价减少辐射剂量，正推动CT的临床应用扩展。尽管统一模型在NICT增强方面展现出潜力，但现有方法大多需要配对数据，而器官不可避免的运动使得这种需求难以实现。无监督方法试图突破此限制，但其对噪声均匀性的假设忽略了扫描协议的差异性，导致泛化能力不足及潜在的模型崩溃风险。我们进一步观察到，对应不同物理成像过程的扫描协议会在特征空间中形成离散的子流形，这与现有假设相矛盾并限制了方法效能。为此，我们提出不确定性引导的流形平滑（UMS）框架以弥合子流形间的间隙。该框架通过分类器识别子流形并预测不确定性分数，以此指导生成覆盖整个流形的多样化样本。借助分类器的判别能力，UMS有效填补了离散子流形间的空白，构建出连续密集的特征空间。鉴于全局流形的复杂性难以直接建模，我们提出动态融合全局与子流形特异性特征的策略。具体而言，我们设计了由分类器引导的全局-子流形双驱动架构，使其能动态适应子域变化。这种动态机制增强了网络对共享特征与域特异性特征的捕获能力，从而提升重建性能。我们在多个公共数据集上进行了广泛实验，验证了该方法在不同生成范式下的有效性。

摘要 (Abstract)

Non-ideal measurement computed tomography (NICT), which lowers radiation at the cost of image quality, is expanding the clinical use of CT. Although unified models have shown promise in NICT enhancement, most methods require paired data, which is an impractical demand due to inevitable organ motion. Unsupervised approaches attempt to overcome this limitation, but their assumption of homogeneous noise neglects the variability of scanning protocols, leading to poor generalization and potential model collapse. We further observe that distinct scanning protocols, which correspond to different physical imaging processes, produce discrete sub-manifolds in the feature space, contradicting these assumptions and limiting their effectiveness. To address this, we propose an Uncertainty-Guided Manifold Smoothing (UMS) framework to bridge the gaps between sub-manifolds. A classifier in UMS identifies sub-manifolds and predicts uncertainty scores, which guide the generation of diverse samples across the entire manifold. By leveraging the classifier’s capability, UMS effectively fills the gaps between discrete sub-manifolds, and promotes a continuous and dense feature space. Due to the complexity of the global manifold, it’s hard to directly model it. Therefore, we propose to dynamically incorporate the global- and sub-manifold-specific features. Specifically, we design a global- and sub-manifold-driven architecture guided by the classifier, which enables dynamic adaptation to subdomain variations. This dynamic mechanism improves the network’s capacity to capture both shared and domain-specific features, thereby improving reconstruction performance. Extensive experiments on public datasets are conducted to validate the effectiveness of our method across different generation paradigms.

关键词: CT reconstruction, unsupervised learning, manifold smoothing, scanning protocols, domain adaptation, medical imaging, feature space, uncertainty guidance

251. ❌ Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization

作者: Lehuai Xu, Weiming Zhang, Yang Li, Sidan Du, Lin Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多鱼眼相机立体匹配和深度估计问题，提出了一种名为FreeOmniMVS的参考自由框架，通过多视图一致性最大化实现全局一致的深度估计。论文内容涉及Transformer架构、注意力机制、多视图几何等计算机视觉技术，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词领域。所有评分关键词均与大模型、深度学习技术原理、科学AI应用等主题相关，而本论文属于传统计算机视觉研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FreeOmniMVS的参考自由多视图立体匹配框架，通过多视图一致性最大化实现了对遮挡、部分重叠和不同基线具有鲁棒性的全局一致、可见性感知和尺度感知的全向深度估计。

摘要翻译

基于多鱼眼立体匹配的可靠全向深度估计对具身机器人等众多应用至关重要。现有方法要么依赖球面扫描与启发式融合策略构建代价柱体，要么基于校正视图进行以参考视图为中心的立体匹配。然而，这些方法未能显式利用多视图间的几何关系，导致其难以捕捉全局依赖性、可见性或尺度变化。本文提出一种新视角，通过多视图一致性最大化构建了一种无需指定参考视图的新型框架——FreeOmniMVS。该框架的核心在于能够将成对相关性聚合为鲁棒、可见性感知且具有全局一致性的共识，从而有效应对遮挡、部分重叠及变化基线等问题。具体而言，为实现全局一致性，我们提出新型视图对相关性变换器（View-pair Correlation Transformer, VCT），显式建模所有相机视图对间的成对相关体积，从而剔除因遮挡或离焦观测导致的不可靠视图对。为实现可扩展且可见性感知的共识，我们设计了一种轻量级注意力机制，自适应融合相关向量，无需指定参考视图，使所有相机能够平等参与立体匹配过程。在多个基准数据集上的大量实验表明，本方法在全向深度估计的全局一致性、可见性感知及尺度感知方面均具有优越性。

摘要 (Abstract)

Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.

关键词: omnidirectional depth estimation, multi-view stereo matching, reference-free framework, multi-view consistency maximization, view-pair correlation transformer, visibility-aware consensus, fisheye cameras, global coherence

252. ❌ Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching

作者: Fangran Miao, Jian Huang, Ting Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于人类动作生成，提出了一种基于黎曼流匹配的几何感知建模框架。虽然属于AI应用领域，但论文内容与所有评分关键词（均围绕大语言模型、训练技术、推理优化、对齐、代理等主题）完全无关。论文未涉及任何语言模型、深度学习技术原理创新或大模型在不同领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于黎曼流匹配的几何感知框架（RMG），用于在非欧几里得流形上表示和生成人类动作，在HumanML3D和MotionMillion数据集上实现了最先进的性能。

摘要翻译

人体运动生成通常在欧几里得空间中学习，但有效运动遵循结构化的非欧几里得几何。我们提出黎曼运动生成（Riemannian Motion Generation, RMG），这是一个在乘积流形上表示运动并通过黎曼流匹配学习动力学的统一框架。RMG将运动分解为多个流形因子，从而产生具有内在归一化的无尺度表示，并利用测地线插值、切空间监督和保持流形的常微分方程积分进行训练与采样。在HumanML3D数据集上，RMG在HumanML3D格式下取得了最先进的FID分数（0.043），并在MotionStreamer格式下所有已报告指标中排名第一。在MotionMillion数据集上，它也超越了强基线模型（FID 5.6，R@1 0.86）。消融实验表明，紧凑的$\mathscr{T}+\mathscr{R}$（平移+旋转）表示最为稳定有效，这凸显了几何感知建模是实现高保真运动生成的一条实用且可扩展的路径。

摘要 (Abstract)

Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.

关键词: Riemannian flow matching, human motion generation, product manifold, geometry-aware modeling, motion representation, non-Euclidean geometry, state-of-the-art FID

253. ❌ Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

作者: Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15011v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于化学信息学领域的视觉语言模型应用，核心贡献是提出分子标识符视觉提示（IdtVP）和可验证强化学习算法（Re3-DAPO）来改进化学反应图解析。与关键词的相关性分析：1）与"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），论文直接属于化学信息学应用；2）与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（8分），论文提到标准监督微调作为基线；3）与"Pre-training OR Continual Pre-training OR Domain Adaptation"有弱关联（5分），涉及VLM预训练知识的激活；4）其他关键词（如LLMs、MoE、RLHF等）与论文的视觉语言模型和强化学习应用无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了化学反应图解析中视觉化学实体与预训练知识对齐困难的问题，通过提出分子标识符视觉提示方法和可验证强化学习算法，显著提升了视觉语言模型在该任务上的准确性和泛化能力。

摘要翻译

反应图解分析（RxnDP）对于从文献中提取化学合成信息至关重要。尽管近期出现的视觉语言模型（VLMs）为自动化这一复杂的视觉推理任务提供了有前景的范式，但其应用从根本上受到两大瓶颈的制约：一是无法将视觉化学实体与预训练知识对齐，二是模型在标记级训练与反应级评估之间存在固有差异。为应对这双重挑战，本研究从提示表示和学习范式两个互补的角度，对基于VLM的RxnDP进行了增强。首先，我们提出了标识符作为视觉提示（Identifier as Visual Prompting, IdtVP），该方法利用自然存在的分子标识符（例如粗体数字如1a）来激活VLM预训练期间获得的化学知识。IdtVP赋予了模型强大的零样本和分布外泛化能力，其表现优于现有的提示策略。其次，为了在微调范式中进一步优化性能，我们引入了Re3-DAPO，这是一种利用可验证奖励直接优化反应级指标的强化学习算法，从而在标准监督微调基础上实现了持续的性能提升。此外，我们发布了ScannedRxn基准数据集，该数据集包含带有真实世界伪影的扫描历史反应图，用于严格评估模型的鲁棒性和分布外能力。我们的贡献提升了基于VLM的反应图解分析的准确性和泛化能力。我们将在GitHub上发布数据、模型和代码。

摘要 (Abstract)

Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.

关键词: Reaction Diagram Parsing, Vision-Language Models, Molecular Identifier, Visual Prompting, Reinforcement Learning, Chemical Synthesis, Out-of-distribution, Benchmark Dataset

254. ❌ Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

作者: Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ClueNet框架，专注于多模态大语言模型（MLLMs）在视频推理中的应用，核心贡献包括：1）使用监督微调（SFT）的两阶段训练范式，与关键词’Post-training OR Supervised Fine-tuning OR SFT’高度相关；2）强调链式推理和深度推理，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关；3）旨在减少幻觉并提高可解释性，与’Hallucination Mitigation OR Factuality OR Truthfulness’和’Mechanistic Interpretability OR Explainable AI’高度相关；4）基于MLLMs，与’Large Language Models OR LLMs OR Foundation Models’高度相关。其他关键词如MoE、SLMs、RAG、量化等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视频问答中存在的幻觉和可解释性差的问题，提出了一个基于视觉线索感知的ClueNet框架，通过两阶段监督微调和链式推理，在多个基准测试中实现了性能提升并有效缓解了幻觉。

摘要翻译

多模态大语言模型（MLLMs）显著推动了视频推理的发展，然而视频问答（VideoQA）因其对时序因果推理和基于证据的答案生成的要求，仍然面临挑战。主流的端到端MLLM框架缺乏视觉感知与答案推导之间的显式结构化推理，导致严重的幻觉问题和较差的解释性。现有方法也未能解决三个核心差距：可靠的视觉线索提取、效用感知的线索筛选以及端到端的线索-答案对齐。受人类分层视觉认知的启发，我们提出了ClueNet，一种线索感知的视频推理框架，采用两阶段监督微调范式，无需对基础模型进行大量修改。解耦监督机制对齐线索提取与基于链式推理的过程，而带有自适应线索过滤器的推理监督则优化高阶推理，同时配合轻量级模块实现高效推理。在NExT-QA、STAR和MVBench上的实验表明，ClueNet以$\ge$ 1.1%的优势超越现有最先进方法，并展现出卓越的泛化能力、幻觉缓解效果、推理效率以及跨骨干网络的兼容性。本研究弥合了MLLM视频理解中从感知到生成的差距，为高风险的VideoQA应用提供了一种可解释、可靠的推理范式。

摘要 (Abstract)

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

关键词: Multi-modal Large Language Models, Video Reasoning, Video Question Answering, Supervised Fine-tuning, Chain-based Reasoning, Hallucination Mitigation, Interpretable Reasoning, ClueNet

255. ❌ Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning

作者: Nasrin Rahimi, Mısra Yavuz, Burak Can Biner, Yunus Bilge Kurt, Ahmet Rasim Emirdağı, Süleyman Aslan, Görkay Aydemir, M. Akın Yılmaz, A. Murat Tekalp 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究是使用LoRA对大型图像编辑基础模型进行少样本微调，使其适应视频帧插值任务。因此，与’PEFT/LoRA’高度相关（15分），与’Foundation Models’、‘Pre-training’、‘SFT’直接相关（10分），与’In-context Learning’有一定关联（5分），因为少样本学习是其核心方法。其他关键词如MoE、SLMs、RAG、Agents等与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，通过使用LoRA进行少样本微调，可以将仅用于静态图像编辑的大型基础模型（Qwen-Image-Edit）成功适配到视频帧插值任务，揭示了空间编辑模型中隐含的时间推理能力。

摘要翻译

预训练图像编辑模型通过从数十亿图像-文本对中学习，展现出强大的空间推理与对象感知变换能力，但其本身不具备显式的时间建模。本文证明，这些空间先验知识可通过极少量适应调整被重新用于解锁时间合成能力——无需引入任何视频专用架构或运动估计模块。我们研究表明，一个原本仅设计用于基于指令的静态图像编辑的大规模模型（Qwen-Image-Edit），仅需通过低秩适应（Low-Rank Adaptation, LoRA）技术使用64-256个训练样本进行微调，即可适应视频帧插值（Video Frame Interpolation, VFI）任务。我们的核心贡献在于揭示：该模型对静态场景中“物体如何变换”的固有理解中，蕴含着可通过少量样本微调激活的潜在时间推理能力。虽然基线模型完全无法生成连贯的中间帧，但我们提出的参数高效适应方法成功解锁了其插值能力。本研究并非旨在与基于海量数据从头训练的任务专用VFI方法竞争，而是证实基础图像编辑模型在时间性任务中具有尚未开发的潜力，为资源受限场景下的视频合成提供了一条数据高效的路径。这项工作弥合了图像处理与视频理解之间的鸿沟，表明在基础模型中，空间推理与时间推理可能比以往认知更具内在关联性。

摘要 (Abstract)

Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model’s inherent understanding of “how objects transform” in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized

关键词: Video Frame Interpolation, Foundation Models, Few-shot Learning, Low-Rank Adaptation (LoRA), Parameter-efficient Fine-tuning, Spatial-to-Temporal Adaptation, Image Editing Models, Temporal Synthesis

作者: Hürkan Şahin, Huy Xuan Pham, Van Huyen Dang, Alper Yegenoglu, Erdal Kayacan 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于热成像的单目深度估计和SLAM技术，使用轻量级监督网络和循环块来改进热图像并预测深度，然后集成到ORB-SLAM3中。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而该论文专注于计算机视觉、机器人导航和传感器融合，未涉及大模型、深度学习创新或AI在生物/化学信息学中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用轻量级循环网络从单目热图像中估计深度并集成到ORB-SLAM3中的方法，在低光条件下实现了竞争性的深度精度和稳健的SLAM性能。

摘要翻译

在GPS拒止与视觉退化环境中实现无人机自主导航仍具挑战性。为此，本研究探索在无人机平台上使用单目热成像相机作为独立传感器，以实现实时深度估计与同步定位与建图（SLAM）。为从热图像中提取深度信息，我们提出一种新颖流程，采用集成循环模块（RBs）的轻量化监督网络以捕捉时序依赖性，从而获得更鲁棒的预测结果。该网络将轻量化卷积主干与热优化网络（T-RefNet）相结合，以优化原始热成像输入并增强特征可见性。优化后的热图像与预测深度图被集成至ORB-SLAM3系统中，实现仅依赖热成像的定位功能。与现有方法不同，本网络使用自定义非辐射测量数据集进行训练，无需依赖高成本的辐射测量热像仪。在数据集与无人机实际飞行中的实验结果表明，该方法在弱光条件下具有竞争力的深度精度与鲁棒的SLAM性能。在辐射测量数据集VIVID++（室内暗光）上，本方法的绝对相对误差约为0.06，而基线方法误差超过0.11；在自建的非辐射测量室内数据集中，基线误差仍高于0.24，而本方法误差始终低于0.10。仅使用热成像的ORB-SLAM3系统平均轨迹误差保持在0.4米以下。

摘要 (Abstract)

Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.

关键词: thermal image refinement, depth estimation, recurrent networks, monocular ORB-SLAM3, UAV navigation, low-light conditions, simultaneous localization and mapping

257. ❌ MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

作者: Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）的推测解码（speculative decoding）技术，这是大模型推理加速的核心方法之一。与关键词’Speculative Decoding OR Inference Acceleration’高度相关（15分），因为这是论文的核心研究内容。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为VLMs是大模型的一种，论文涉及将推测解码从文本LLMs扩展到多模态场景。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中推测解码技术的性能，创建了首个多模态推测解码基准MMSpec，并提出了动态适应视觉令牌的ViSkip方法，实现了最先进的加速性能。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在多模态任务上表现出色，但由于模型规模庞大且多模态上下文较长，其推理延迟较高。推测解码（Speculative Decoding）作为一种高效的加速技术近期受到关注，但其在视觉语言模型中的行为机制尚未得到充分理解。本文提出MMSpec，这是首个用于评估视觉语言模型中推测解码性能的基准测试。MMSpec涵盖六大任务类别的600个多模态样本，并在统一评估框架下集成了十种代表性推测解码算法。我们的研究揭示了三个关键发现：（1）专为纯文本大语言模型设计的方法在多模态场景中性能下降；（2）视觉感知能力在更大批处理规模下重要性显著提升；（3）仅凭吞吐量加速指标无法可靠反映实际延迟性能。基于这些发现，我们提出ViSkip——一种即插即用的推测解码方法，能够动态适配视觉令牌（vision tokens）的推测过程，并实现了最先进的性能表现。

摘要 (Abstract)

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

关键词: Vision-Language Models, Speculative Decoding, Inference Acceleration, Multimodal Benchmark, MMSpec, ViSkip, Inference Latency, Vision Tokens

258. ❌ Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

作者: Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang, Jun Yu, Jiaen Liang, Wei Huang, Shengping Liu, Ximin Zheng 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14976v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于情感计算中的多模态融合技术，特别是针对情感模仿强度估计，提出了基于文本锚定的融合框架。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是传统多模态情感分析，未涉及大模型、LLM技术、模型训练优化、推理加速、AI代理等主题，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TAEMI的文本锚定多模态融合框架，用于在自然环境中稳健地估计情感模仿强度，通过在Hume-Vidmimic2数据集上的实验，该框架在六个连续情感维度上实现了最先进的平均皮尔逊相关系数。

摘要翻译

在自然环境中估计情绪模仿强度是情感计算领域一项关键而具有挑战性的任务。其主要困难在于如何有效建模高度异质模态间复杂、非线性的时序动态，尤其是在物理信号被干扰或缺失的情况下。为此，我们提出TAEMI（文本锚定的情绪模仿强度估计框架），这是一个为第十届ABAW竞赛设计的新型多模态框架。基于连续视觉与声学信号极易受瞬时环境噪声影响的观察，我们打破了传统的对称融合范式，转而利用文本转录——其本身编码了稳定且与时间无关的语义先验——作为中心锚点。具体而言，我们引入了一种文本锚定的双重交叉注意力机制，该机制利用这些鲁棒的文本查询来主动过滤帧级冗余信息并对齐受噪声干扰的物理信号流。此外，为应对无约束现实场景中不可避免的数据缺失所导致的性能急剧下降，我们在训练中整合了可学习的缺失模态令牌与模态丢弃策略。在Hume-Vidmimic2数据集上的大量实验表明，TAEMI能有效捕捉细粒度情绪变化，并在非理想条件下保持稳健的预测韧性。我们的框架在六个连续情绪维度上取得了最先进的平均皮尔逊相关系数，显著优于现有基线方法。

摘要 (Abstract)

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript–which inherently encode a stable, time-independent semantic prior–as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.

关键词: Emotional Mimicry Intensity, Multimodal Fusion, Text-Anchored, Affective Computing, Missing Modality, Cross-Attention, Robust Prediction, Hume-Vidmimic2

259. ❌ Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition

作者: Jaein Kim, Hee Bin Yoo, Dong-Sig Han, Byoung-Tak Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LiDAR地点识别中的二阶池化方法改进，涉及Voronoi细胞、白化算法和Mahalanobis距离等技术，属于计算机视觉和机器人定位领域。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文未涉及任何大模型、语言模型、对齐、微调、推理加速、智能体等主题，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于Voronoi细胞的二阶池化方法，通过白化全局描述符来隐式测量Mahalanobis距离，解决了LiDAR地点识别中二阶池化的数值不稳定问题，并在Oxford Robotcar和Wild-Places基准测试中展示了性能提升。

摘要翻译

在激光雷达地点识别（LPR）中，池化层对于将局部描述符聚合为可度量的全局描述符起着至关重要的作用。其中，二阶池化能够捕捉局部描述符间的高阶交互关系。然而，当前LPR领域中的现有方法仍遵循传统实现方式与后归一化处理，导致生成的描述符不适用于欧氏距离度量。基于近期将NetVLAD与二阶统计量关联的阐释，我们提出将二阶池化与来自Voronoi cells的归纳偏置相结合。我们新颖的池化方法通过聚合局部描述符形成二阶矩阵，并对全局描述符进行白化处理，从而在隐含度量马氏距离（Mahalanobis distance）的同时保留Voronoi cells的聚类特性，同时通过多种技术解决了其学习过程中的数值不稳定问题。通过在Oxford Robotcar和Wild-Places基准数据集上的实验，我们验证了该方法带来的性能提升，并分析了所提白化算法的数值效应。

摘要 (Abstract)

The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.

关键词: LiDAR Place Recognition, second-order pooling, Voronoi cells, whitening algorithm, Mahalanobis distance, global descriptor, numerical instability, Oxford Robotcar benchmark

260. ❌ GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

作者: Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon, Kuk-Jin Yoon 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的新视角合成任务，提出了一种基于几何引导的视频扩散模型GeoNVS，其核心创新是高斯泼溅特征适配器（GS-Adapter），用于提升几何保真度和相机可控性。所有评分关键词均与大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等）或特定科学领域AI应用（如生物信息学）相关，而本文研究的是纯视觉任务，未涉及任何大语言模型技术、原理或应用，也未涉及生物/化学等科学领域的AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出GeoNVS，一种几何引导的视频扩散模型，通过高斯泼溅特征适配器（GS-Adapter）在特征空间注入3D几何约束，解决了新视角合成中几何失真和相机可控性有限的问题，在多个场景和设置下实现了最先进的性能。

摘要翻译

新视角合成需要强大的三维几何一致性以及在不同视角下生成视觉连贯图像的能力。尽管近期相机可控的视频扩散模型展现出有前景的结果，但它们常受到几何失真和相机可控性有限的困扰。为克服这些挑战，我们提出了GeoNVS——一种基于几何约束的新视角合成器，通过显式的三维几何指导同时提升几何保真度与相机可控性。我们的核心创新是高斯溅射特征适配器（Gaussian Splat Feature Adapter，简称GS-Adapter），该模块将输入视角的扩散特征提升为三维高斯表示，渲染几何约束的新视角特征，并自适应地将其与扩散特征融合以修正几何不一致的表征。与先前在输入层面注入几何信息的方法不同，GS-Adapter在特征空间中操作，避免了会降低结构一致性的视角相关色彩噪声。其即插即用设计实现了与多种前馈几何模型的零样本兼容性，无需额外训练，并可适配其他视频扩散主干网络。在9个场景和18种设置下的实验证明了其领先性能，相较SEVA和CameraCtrl分别实现了11.3%和14.9%的性能提升，平移误差降低达2倍，倒角距离（Chamfer Distance）降低达7倍。

摘要 (Abstract)

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

关键词: Novel View Synthesis, Video Diffusion Models, Geometry Grounded, Gaussian Splatting, 3D Geometric Guidance, Feature Space Adaptation, Camera Controllability, Zero-shot Compatibility

261. ❌ Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset

作者: Songcheng Du, Yang Zou, Jiaxin Li, Mingxuan Liu, Ying Li, Changjing Shang, Qiang Shen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于遥感图像处理中的薄云污染全色锐化问题，提出了一种结合物理先验的端到端统一框架（Pan-TCR），并创建了首个真实世界数据集（PanTCR-GF2）。论文的核心是计算机视觉和图像处理技术，特别是针对遥感图像的多光谱和全色图像融合与去云处理。所有关键词均与大模型、深度学习技术原理或其在科学领域的应用相关，但论文未涉及任何大模型、语言模型、模型训练/微调技术、推理优化、对齐、代理系统或模型压缩等内容。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为遥感图像处理可视为AI在科学（地球观测）中的一个应用领域，但论文并未明确使用这些术语，也未涉及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对薄云污染条件下的遥感图像全色锐化问题，提出了一个统一的全色锐化与薄云去除端到端框架（Pan-TCR），并创建了首个真实世界基准数据集（PanTCR-GF2），通过频率解耦恢复和交互式频间一致性模块实现了优越的性能。

摘要翻译

薄云条件下的全色锐化是一项具有实际意义但鲜有研究的工作，其同时面临空间分辨率下降与云层引起光谱畸变的双重挑战。现有方法通常对云层去除与全色锐化进行串行处理，由于缺乏联合退化建模，易导致误差累积和次优性能。为应对这些挑战，我们提出了一种融合物理先验的端到端框架——薄云去除统一全色锐化模型（Pan-TCR）。受频域理论分析的启发，我们设计了频率解耦复原（FDR）模块，将多光谱图像（MSI）特征复原解耦为振幅与相位两个分量，并分别通过互补的抗退化提示信息进行引导：利用近红外（NIR）波段振幅实现云层鲁棒性复原，同时借助全色（PAN）影像相位进行高分辨率结构增强。为确保两个分量间的协调性，我们进一步引入了交互式频间一致性（IFC）模块，通过跨模态优化实现频率线索间的一致性与鲁棒性增强。此外，我们构建了首个真实世界薄云污染全色锐化数据集（PanTCR-GF2），包含成对的洁净与含云PAN-MSI影像，以支持真实场景下的鲁棒性基准测试。在真实与合成数据集上的大量实验证明了Pan-TCR的优越性与鲁棒性，为真实大气退化条件下的全色锐化研究确立了新基准。

摘要 (Abstract)

Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.

关键词: pansharpening, thin cloud removal, remote sensing images, unified framework, frequency-decoupled restoration, benchmark dataset, multispectral image, panchromatic image

262. ❌ GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM

作者: Guohua Zhang, Jian Jin, Meiqin Liu, Chao Yao, Weisi Lin, Yao Zhao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种基于多模态大语言模型（MLLM）的点云质量评估框架GT-PCQA，核心涉及MLLM的应用、指令微调和参数高效微调（LoRA），因此与’Large Language Models OR LLMs OR Foundation Models’、‘Instruction Tuning OR Alignment OR Value Alignment’和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。论文属于AI在科学领域的应用（点云质量评估），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。论文提到预训练MLLM的纹理主导偏差，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有弱关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有MLLM方法在点云质量评估（PCQA）中面临的指令微调不稳定和几何结构退化敏感度不足的问题，提出了一个基于几何-纹理解耦和2D-3D联合训练的MLLM框架GT-PCQA，实现了竞争性的性能和强泛化能力。

摘要翻译

随着多模态大语言模型（MLLMs）的快速发展，基于MLLM的图像质量评估（IQA）方法已展现出良好的泛化能力。然而，直接将这类基于MLLM的IQA方法扩展到点云质量评估（PCQA）领域仍面临挑战。一方面，现有PCQA数据集规模有限，这阻碍了对MLLM进行稳定且有效的指令微调。另一方面，由于经过大规模图像-文本预训练，MLLM倾向于依赖纹理主导的推理，而对PCQA至关重要的几何结构退化不够敏感。为弥补这些不足，我们提出了一种新颖的基于MLLM的无参考PCQA框架，命名为GT-PCQA，该框架基于两项关键策略构建。首先，为在稀缺的PCQA监督下实现稳定有效的指令微调，我们提出了2D-3D联合训练策略。该策略将PCQA表述为相对质量比较问题，以统一大规模IQA数据集与有限的PCQA数据集，并采用参数高效的低秩自适应（Low-Rank Adaptation, LoRA）方案来支持指令微调。其次，我们提出了几何-纹理解耦策略，该策略将双提示机制与交替优化方案相结合，以缓解预训练MLLM固有的纹理主导偏差，同时增强对几何结构退化的敏感性。大量实验表明，GT-PCQA取得了具有竞争力的性能，并展现出强大的泛化能力。

摘要 (Abstract)

With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.

关键词: Multi-modal Large Language Models, Point Cloud Quality Assessment, Instruction Tuning, LoRA, Geometry-Texture Decoupling, 2D-3D Joint Training, Parameter-efficient Fine-tuning, Generalization

263. ❌ Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

作者: Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, Jianbing Shen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶领域的世界模型（World Models）研究，提出WorldDrive框架来统一视觉和运动表示，以桥接场景生成和规划。论文核心与关键词’World Models AND General World Models’高度相关（10分），因为其核心贡献是Trajectory-aware Driving World Model，并强调世界模型在自动驾驶中的应用。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法等）、AI for Science或其他关键词，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了自动驾驶中视觉场景生成与运动规划之间的割裂问题，通过提出WorldDrive框架统一视觉和运动表示，实现了领先的规划性能和高保真的动作控制视频生成。

摘要翻译

端到端自动驾驶旨在从原始传感器输入中生成安全合理的规划策略。驾驶世界模型通过预测驾驶场景的未来演化，在习得丰富表征方面展现出巨大潜力。然而，现有驾驶世界模型主要关注视觉场景表征，其运动表征并未被显式设计为规划器共享且可继承的，导致视觉场景生成的优化目标与精确运动规划的需求之间存在割裂。本文提出WorldDrive——一个通过统一视觉与运动表征来耦合场景生成与实时规划的整体框架。我们首先引入轨迹感知驾驶世界模型，该模型以轨迹词汇表为条件，强制视觉动态与运动意图之间的一致性，从而能够基于特定轨迹生成多样且合理的未来场景。我们将视觉与运动编码器迁移至下游多模态规划器，确保驾驶策略在经场景生成预优化的成熟表征上运行。通过运动表征、视觉表征与自车状态之间的简单交互，即可生成高质量的多模态轨迹。此外，为利用世界模型的预见能力，我们提出未来感知奖励器，其从冻结的世界模型中蒸馏未来潜在表征，以实时评估并选择最优轨迹。在NAVSIM、NAVSIM-v2及nuScenes基准测试上的大量实验表明，WorldDrive在纯视觉方法中实现了领先的规划性能，同时保持了高保真的动作控制视频生成能力，为统一视觉与运动表征以实现鲁棒自动驾驶的有效性提供了有力证据。

摘要 (Abstract)

End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model’s foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

关键词: autonomous driving, world model, scene generation, motion planning, trajectory-aware, vision representation, motion representation, real-time planning

264. ❌ FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

作者: Yaoru Li, Federico Landi, Marco Godi, Xin Jin, Ruiju Fu, Yufei Ma, Muyang Sun, Heyu Si, Qi Guo 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FAR-Drive专注于自动驾驶的闭环视频生成框架，使用多视图扩散变换器和自回归训练策略解决长时域一致性和低延迟推理问题。所有关键词均与大型语言模型（LLM）相关，而本文研究的是视频生成和自动驾驶模拟，未涉及LLM技术、训练方法、对齐、推理、代理系统或科学AI应用。仅与’Speculative Decoding OR Inference Acceleration’有弱关联（5分），因论文提到系统级效率优化和推理加速，但非LLM特定加速技术。其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了FAR-Drive框架，通过多视图扩散变换器和两阶段训练策略解决了自动驾驶闭环模拟中长时域一致性、自回归退化和低延迟推理的挑战，在nuScenes数据集上实现了最先进的性能并保持亚秒级延迟。

摘要翻译

尽管自动驾驶技术发展迅速，但驾驶系统的可靠训练与评估仍因缺乏可扩展的交互式仿真环境而受到根本性制约。近期生成式视频模型取得了显著的视觉保真度，但大多在开环设置下运行，无法支持智能体动作与环境演变之间的细粒度帧级交互。构建基于学习的自动驾驶闭环仿真器面临三大挑战：保持长时序与跨视角一致性，缓解迭代自条件作用下的自回归退化，以及满足低延迟推理约束。本研究提出FAR-Drive——一种面向自动驾驶的帧级自回归视频生成框架。我们引入具有细粒度结构化控制的多视角扩散变换器，实现几何一致的多相机生成。针对长时序一致性与迭代退化问题，我们设计了包含自适应参考时序条件与混合强制自回归训练的两阶段训练策略，逐步提升自条件作用下的系统一致性与鲁棒性。为满足低延迟交互需求，我们进一步集成系统级效率优化以实现推理加速。在nuScenes数据集上的实验表明，本方法在现有闭环自动驾驶仿真方案中达到最优性能，同时在单GPU上保持亚秒级延迟。

摘要 (Abstract)

Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.

关键词: autonomous driving, video generation, closed-loop simulation, multi-view diffusion transformer, autoregressive training, long-horizon consistency, inference acceleration, nuScenes dataset

265. ❌ Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

作者: Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到图像扩散模型中的交互式框架，专注于视觉反馈机制和提示重建，不涉及大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均针对大语言模型及相关技术，与论文的扩散模型和计算机视觉主题完全无关。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中用户意图表达不精确的问题，提出了一个无需训练、模型无关的交互式框架RFD，通过视觉反馈和概率采样机制显著提高了偏好对齐效果。

摘要翻译

基于扩散模型的文生图技术已取得显著成功。然而，用户通常具有明确的视觉意图，却难以用语言精确表达，导致提示词模糊和生成图像与意图错位。现有方法难以弥合这一鸿沟，通常依赖于高负荷的文本对话、不透明的黑盒推断或昂贵的微调，无法同时实现低认知负荷、可解释的偏好推断，并保持免训练和模型无关性。为此，我们提出RFD，一种交互式框架，将信息检索中的相关反馈机制适配至扩散模型。在RFD中，用户以隐式的多选视觉反馈替代显式的文本对话，以最小化认知负荷，轻松表达复杂、多维的偏好。为将反馈转化为精确的生成指导，我们构建了专家策划的特征知识库，并引入基于信息论的加权累积偏好分析。这种白盒方法从当前轮次反馈中计算偏好并增量累积，避免了历史交互的简单拼接，防止了长上下文导致的推断退化。此外，RFD采用概率采样机制进行提示词重构，以平衡利用与探索，防止输出同质化。关键的是，RFD完全在外部文本空间中运行，使其作为通用的即插即用方案，严格保持免训练和模型无关性。大量实验表明，RFD能有效捕捉用户的真实视觉意图，在偏好对齐方面显著优于基线方法。

摘要 (Abstract)

Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user’s true visual intent, significantly outperforming baselines in preference alignment.

关键词: text-to-image generation, diffusion models, relevance feedback, interactive framework, visual feedback, preference alignment, training-free, model-agnostic

266. ❌ Video-CoE: Reinforcing Video Event Prediction via Chain of Events

作者: Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在视频事件预测任务中的应用，提出了Chain of Events（CoE）范式来增强模型的逻辑推理能力。与关键词高度相关的是：1）‘Large Language Models’（论文使用MLLMs，属于大模型范畴）；2）‘Chain of Thought’（CoE范式与思维链推理高度相似，都是通过多步推理提升性能）；3）‘System 2 Thinking’（CoE旨在增强深度推理能力，与系统2思维相关）。其他关键词如’SFT’有一定关联（论文提到训练协议），但大部分关键词（如MoE、量化、RAG等）与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视频事件预测任务中逻辑推理能力不足的问题，提出了Chain of Events范式，通过构建时序事件链来增强模型对视觉内容和逻辑关系的建模，在公开基准上取得了最先进的性能。

摘要翻译

尽管多模态大语言模型（MLLMs）在各种视频任务中的应用已取得进展，但视频事件预测（VEP）领域仍相对缺乏深入探索。VEP要求模型对视频进行细粒度的时间建模，并建立视频与未来事件之间的逻辑关联，而当前MLLMs在此方面仍面临挑战。本研究首先对当前主流MLLMs在VEP任务上进行了全面评估，揭示了其预测不准确的原因，包括对未来事件预测的逻辑推理能力不足以及对视觉信息利用不充分。为解决这些问题，我们提出了事件链（Chain of Events, CoE）范式，该范式通过构建时序事件链，隐式地促使MLLM聚焦于视频内容及其与未来事件间的逻辑联系，并借助多种训练协议激发模型的推理能力。在公开基准测试上的实验结果表明，我们的方法超越了当前领先的开源及商业MLLMs，在VEP任务上取得了新的最优性能。代码与模型即将公开发布。

摘要 (Abstract)

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model’s reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

关键词: Video Event Prediction, Multimodal Large Language Models, Chain of Events, Logical Reasoning, Temporal Modeling, Visual Information Utilization, State-of-the-art Performance, Benchmark Evaluation

267. ❌ Workflow-Aware Structured Layer Decomposition for Illustration Production

作者: Tianyu Zhang, Dongchi Li, Keiichi Sawada, Haoran Xie 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像编辑和分解，特别是针对动漫插图的生成式图像编辑方法。论文的核心是提出一种工作流感知的结构化层分解框架，涉及层语义嵌入、层特定损失和高质量数据集构建。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本论文的研究内容（图像层分解、动漫插图编辑）与这些关键词完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对动漫插图生产的工作流感知结构化层分解框架，通过引入层语义嵌入和层特定损失，成功实现了准确且视觉连贯的线稿、平涂、阴影和高光层分解，并构建了高质量数据集以支持下游编辑任务。

摘要翻译

近期生成式图像编辑方法采用分层表征来缓解栅格图像的固有纠缠性并提升可控性，其通常依赖于基于对象的分割策略。然而，此类方法可能难以捕捉人类创作图像（如动漫插图）的结构化与风格化特性。为解决此问题，我们提出一种面向动漫作品插图制作流程的工作流感知结构化分层分解框架。受动漫制作创作流程启发，我们的方法将插图分解为具有语义意义的生产层级，包括线稿、平涂色块、阴影与高光。为解耦所有这些层级，我们引入了轻量级的层级语义嵌入，为每一层提供具体的任务指导。此外，我们采用了一组分层的损失函数来监督各独立层级的训练过程。为克服真实分层标注数据的缺乏，我们构建了一个模拟标准动漫制作流程的高质量插图数据集。实验表明，使用我们的方法能够实现精确且视觉一致的分层分解。我们相信，所得的分层表征能够进一步支持诸如重新着色、纹理嵌入等下游任务，从而助力内容创作与插图编辑。代码发布于：https://github.com/zty0304/Anime-layer-decomposition

摘要 (Abstract)

Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: https://github.com/zty0304/Anime-layer-decomposition

关键词: generative image editing, layer decomposition, anime illustration, workflow-aware, semantic embeddings, layer-wise losses, illustration dataset, content creation

268. ❌ $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling

作者: Huanjing Yue, Dawei Li, Shaoxiong Tu, Jingyu Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的HDR视频重建技术，提出了一种基于光流适配器和物理运动建模的两阶段框架。论文内容完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词领域，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为F²HDR的两阶段框架，通过光流适配器和物理运动建模解决了动态场景下HDR视频重建中的帧间对齐和细节恢复问题，在真实HDR视频基准测试中取得了最先进的性能。

摘要翻译

从交替曝光的低动态范围（LDR）帧序列中重建高动态范围（HDR）视频仍然极具挑战性，尤其是在动态场景下，跨曝光不一致性与复杂运动使得帧间对齐困难，从而导致重影和细节丢失。现有方法常面临对齐不准确、特征聚合欠佳，以及在运动主导区域重建质量下降的问题。为解决这些挑战，我们提出了 $\text{F}^2\text{HDR}$，一个两阶段的HDR视频重建框架，能够鲁棒地感知帧间运动并在复杂动态场景中恢复精细细节。该框架集成了一个流适配器，用于调整通用光流以实现鲁棒的跨曝光对齐；一个物理运动建模模块，用于识别显著运动区域；以及一个运动感知的细化网络，在聚合互补信息的同时去除重影和噪声。大量实验表明，$\text{F}^2\text{HDR}$ 在真实世界的HDR视频基准测试中取得了最先进的性能，能够在大幅度运动和曝光变化下生成无重影、高保真的结果。

摘要 (Abstract)

Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that $\text{F}^2\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.

关键词: HDR video reconstruction, optical flow adapter, physical motion modeling, ghosting removal, dynamic scenes, two-stage framework, motion-aware refinement

269. ❌ EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

作者: Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, Ke Gu, Jian Zhang, Shusong Xu, Jinwei Chen, Bo Li, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用多模态大语言模型（MLLM）进行图像编辑评估和强化学习优化，与多个大模型技术关键词高度相关：1）使用MLLM（属于LLMs）作为评估模型（8分）；2）涉及人类偏好对齐（10分）；3）使用强化学习（RLHF）优化图像编辑模型（10分）；4）包含监督微调（SFT）过程（8分）；5）数据集规模和质量涉及扩展规律（5分）；6）图像编辑评估涉及事实性/真实性（5分）。其他关键词如MoE、SLMs、RAG、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了EditHF-1M百万级图像编辑人类偏好数据集，基于多模态大语言模型开发了EditHF评估模型，并利用其作为奖励信号通过强化学习优化文本引导图像编辑模型，显著提升了编辑性能。

摘要翻译

近期文本引导图像编辑模型取得了显著进展，但许多编辑后的图像仍存在伪影、非预期编辑、内容不美观等问题。尽管已有部分基准和方法被提出用于评估编辑后的图像，可扩展的评估模型仍然缺乏，这限制了面向图像编辑的人类反馈奖励模型的发展。为应对这些挑战，我们首先引入了 EditHF-1M——一个百万规模的图像编辑数据集，包含超过2900万个人类偏好对和14.8万个人类平均意见评分，两者均从三个维度进行评估，即视觉质量、指令对齐和属性保持。基于EditHF-1M，我们提出了 EditHF，一个基于多模态大语言模型的评估模型，用于从图像编辑中提供与人类对齐的反馈。最后，我们引入了 EditHF-Reward，它利用EditHF作为奖励信号，通过强化学习优化文本引导图像编辑模型。大量实验表明，EditHF在人类偏好对齐方面表现优异，并在其他数据集上展现出强大的泛化能力。此外，我们使用EditHF-Reward对Qwen-Image-Edit模型进行微调，实现了显著的性能提升，这证明了EditHF作为奖励模型规模化提升图像编辑能力的作用。数据集与代码均将发布于我们的GitHub仓库：https://github.com/IntMeGroup/EditHF。

摘要 (Abstract)

Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF.

关键词: image editing, human preference feedback, multimodal large language model, reinforcement learning, reward model, text-guided image editing, evaluation model, dataset scaling

270. ❌ HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

作者: Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（GPT 5.4 Pro）在数学发现领域的应用，属于’AI for Science’范畴，因此该关键词高度相关（10分）。论文明确研究LLMs的数学推理能力，因此’Large Language Models’高度相关（10分）。论文涉及数学问题的解决需要深入推理和思维链，因此’Chain of Thought’和’System 2 Thinking’有一定关联（各8分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型能否解决未解决的数学问题，并提出了HorizonMath基准和评估框架，发现GPT 5.4 Pro对两个问题提出了优于已知结果的解决方案。

摘要翻译

人工智能能否在重要且未解决的数学问题上取得进展？当前的大型语言模型已具备复杂的数学与科学推理能力，但其能否开展创新性研究仍存在广泛争议且探索不足。我们推出HorizonMath基准测试，该测试涵盖计算数学与应用数学中8个领域的100多个以未解决问题为主的题目，并配套一个用于自动验证的开源评估框架。本基准聚焦于一类发现困难、需要实质性数学洞察力，但验证过程计算高效且简单的问题。由于这些问题的解决方案尚未可知，HorizonMath能有效避免数据污染，而目前大多数先进模型在该基准上的得分接近0%。现有的研究级基准测试则依赖于形式化证明验证或人工评审，这两种方式均难以规模化扩展。借助该平台，我们发现GPT 5.4 Pro针对两个问题提出的解决方案优于已发表的最佳结果，可能代表了创新性贡献（有待专家评审）。我们将HorizonMath作为一项开放挑战和持续发展的社区资源公开发布，其中针对未解问题类别的正确解决方案可能构成数学文献中的新成果。

摘要 (Abstract)

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.

关键词: Large Language Models, Mathematical Discovery, Benchmark, Automatic Verification, Unsolved Problems, AI for Science, Mathematical Reasoning, GPT 5.4 Pro

271. ❌ SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

作者: Jesper Derehag, Carlos Calva, Timmy Ghiurau 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究对话记忆检索系统，使用LLM进行检索增强生成（RAG）的优化，涉及长上下文处理。与"Retrieval-Augmented Generation"高度相关（10分），因为核心是检索增强生成系统；与"Large Language Models"相关（8分），因为系统基于LLM构建；与"Context Window Extension"相关（8分），因为处理长对话历史并优化token预算。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过智能排名而非结构化方法优化对话记忆检索，提出SmartSearch系统在减少85%token使用的同时，在两个基准测试中超越了现有记忆系统。

摘要翻译

近期对话记忆系统在信息摄入阶段大量依赖基于大语言模型的结构化处理，并在查询阶段采用学习式检索策略。我们证明这两者均非必需。SmartSearch采用完全确定性的流程从原始非结构化对话历史中进行检索：通过NER加权的子字符串匹配实现召回，基于规则的多跳实体发现进行扩展，以及融合CrossEncoder与ColBERT的排序阶段——这是唯一的学习组件——在CPU上以约650毫秒运行。在两个基准测试上的理想化分析揭示了编译瓶颈：检索召回率达到98.6%，但若缺乏智能排序，在令牌预算截断后仅有22.5%的关键证据得以保留。通过分数自适应截断技术且无需针对各数据集调优，SmartSearch在LoCoMo基准上达到93.5%的准确率，在LongMemEval-S基准上达到88.4%，在相同评估协议下超越两个基准测试中所有已知记忆系统，同时比全上下文基线少消耗8.5倍的令牌量。

摘要 (Abstract)

Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves from raw, unstructured conversation history using a fully deterministic pipeline: NER-weighted substring matching for recall, rule-based entity discovery for multi-hop expansion, and a CrossEncoder+ColBERT rank fusion stage – the only learned component – running on CPU in ~650ms. Oracle analysis on two benchmarks identifies a compilation bottleneck: retrieval recall reaches 98.6%, but without intelligent ranking only 22.5% of gold evidence survives truncation to the token budget. With score-adaptive truncation and no per-dataset tuning, SmartSearch achieves 93.5% on LoCoMo and 88.4% on LongMemEval-S, exceeding all known memory systems under the same evaluation protocol on both benchmarks while using 8.5x fewer tokens than full-context baselines.

关键词: conversational memory retrieval, LLM-based structuring, retrieval-augmented generation, ranking optimization, token budget efficiency, CrossEncoder+ColBERT rank fusion, long context processing, deterministic pipeline

272. ❌ Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise

作者: Naoto Tani, Futoshi Futami 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究线性上下文赌博机问题，属于经典强化学习/在线学习领域，专注于对抗性腐败和重尾噪声下的算法鲁棒性与计算效率。论文内容完全不涉及大语言模型、深度学习、科学AI应用或任何评分关键词中的技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型技术、训练方法、推理优化、AI应用等主题相关，而本文是纯粹的经典机器学习算法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在对抗性腐败和重尾噪声下的线性上下文赌博机问题，提出了一种基于在线镜像下降的计算高效算法，在不需要先验知识的情况下实现了次线性遗憾界。

摘要翻译

本研究针对存在对抗性干扰和重尾噪声的线性上下文赌博机问题展开，其中噪声的$(1+ε)$阶矩有限（$ε\in (0,1]$）。现有同时处理对抗性干扰和重尾噪声的研究依赖于有限方差（即有限二阶矩）假设，且存在计算效率低下的问题。我们提出一种基于在线镜像下降的计算高效算法，该算法能够同时对对抗性干扰和重尾噪声保持鲁棒性。现有算法的计算成本为$\mathcal{O}(t\log T)$，而我们的算法将每轮计算成本降低至$\mathcal{O}(1)$。我们建立了一个加性遗憾界，其中一项取决于噪声的$(1+ε)$阶矩界，另一项取决于干扰总量。特别地，当$ε= 1$时，我们的结果恢复了有限方差假设下的现有保证。在没有干扰的情况下，其结果与重尾噪声线性上下文赌博机的最佳已知速率相匹配。此外，该算法无需预先知道噪声矩界或干扰总量，仍能保证次线性遗憾。

摘要 (Abstract)

We study linear contextual bandits under adversarial corruption and heavy-tailed noise with finite $(1+ε)$-th moments for some $ε\in (0,1]$. Existing work that addresses both adversarial corruption and heavy-tailed noise relies on a finite variance (i.e., finite second-moment) assumption and suffers from computational inefficiency. We propose a computationally efficient algorithm based on online mirror descent that achieves robustness to both adversarial corruption and heavy-tailed noise. While the existing algorithm incurs $\mathcal{O}(t\log T)$ computational cost, our algorithm reduces this to $\mathcal{O}(1)$ per round. We establish an additive regret bound consisting of a term depending on the $(1+ε)$-moment bound of the noise and a term depending on the total amount of corruption. In particular, when $ε= 1$, our result recovers existing guarantees under finite-variance assumptions. When no corruption is present, it matches the best-known rates for linear contextual bandits with heavy-tailed noise. Moreover, the algorithm requires no prior knowledge of the noise moment bound or the total amount of corruption and still guarantees sublinear regret.

关键词: linear contextual bandits, adversarial corruption, heavy-tailed noise, online mirror descent, computational efficiency, regret bound, robust algorithm, finite moments

273. ❌ Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions

作者: Quoc Tran-Dinh, Nghia Nguyen-Trung 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是随机优化算法（forward-reflected-backward splitting方法）的方差缩减技术，属于数值优化和机器学习优化理论领域。论文内容与绝大多数关键词（涉及大模型架构、训练、推理、对齐、应用等）完全无关，因此评分为0。唯一可能的相关性是最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在强化学习策略评估中进行了实验，这属于AI在科学/工程领域的应用，但并非核心内容，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本文针对非单调随机复合包含问题，提出了新的方差缩减前向反射后向分裂方法，分别设计了无偏和有偏估计器，并证明了收敛率和计算复杂度，最后在分类AUC优化和强化学习策略评估上进行了实验验证。

摘要翻译

本文针对求解一类可能非单调的随机复合包含问题，提出了前向-反射-后向分裂（FRBS）方法的新方差缩减技术。与小型批处理等无偏估计器不同，开发随机有偏变体面临根本性的技术挑战，此前从未在包含问题与不动点问题中得到应用。我们通过设计一个能同时处理无偏与有偏估计器的新框架填补了这一空白。核心思想是为前向-反射方向构建随机方差缩减估计器，并利用其进行迭代更新。首先，我们提出一类无偏方差缩减估计器，证明递增小批量随机梯度下降、无循环随机方差缩减梯度及SAGA估计器均属于此类。针对这些无偏估计器，我们建立了期望残差范数平方的$\mathcal{O}(1/k)$最优迭代收敛率，并证明迭代序列以概率1收敛到解。由此得出：当采用无循环随机方差缩减梯度或SAGA估计器时，在$n$有限和与期望设置下的最优Oracle复杂度分别为$\mathcal{O}(n^{2/3}ε^{-2})$与$\mathcal{O}(ε^{-10/3})$，其中$ε$为目标精度。其次，我们为前向-反射方向引入了一类新的有偏方差缩减估计器，其特例包括SARAH、混合随机梯度下降及混合随机方差缩减梯度。虽然这些有偏估计器仍保持相同收敛率，但其在$n$有限和与期望设置下的Oracle复杂度分别为$\mathcal{O}(n^{3/4}ε^{-2})$与$\mathcal{O}(ε^{-5})$。最后，我们在不平衡分类的AUC优化与强化学习策略评估问题上进行了两组数值实验。

摘要 (Abstract)

This paper develops new variance-reduction techniques for the forward-reflected-backward splitting (FRBS) method to solve a class of possibly nonmonotone stochastic composite inclusions. Unlike unbiased estimators such as mini-batching, developing stochastic biased variants faces a fundamental technical challenge and has not been utilized before for inclusions and fixed-point problems. We fill this gap by designing a new framework that can handle both unbiased and biased estimators. Our main idea is to construct stochastic variance-reduced estimators for the forward-reflected direction and use them to perform iterate updates. First, we propose a class of unbiased variance-reduced estimators and show that increasing mini-batch SGD, loopless-SVRG, and SAGA estimators fall within this class. For these unbiased estimators, we establish a $\mathcal{O}(1/k)$ best-iterate convergence rate for the expected squared residual norm, together with almost-sure convergence of the iterate sequence to a solution. Consequently, we prove that the best oracle complexities for the $n$-finite-sum and expectation settings are $\mathcal{O}(n^{2/3}ε^{-2})$ and $\mathcal{O}(ε^{-10/3})$, respectively, when employing loopless-SVRG or SAGA, where $ε$ is a desired accuracy. Second, we introduce a new class of biased variance-reduced estimators for the forward-reflected direction, which includes SARAH, Hybrid SGD, and Hybrid SVRG as special instances. While the convergence rates remain valid for these biased estimators, the resulting oracle complexities are $\mathcal{O}(n^{3/4}ε^{-2})$ and $\mathcal{O}(ε^{-5})$ for the $n$-finite-sum and expectation settings, respectively. Finally, we conduct two numerical experiments on AUC optimization for imbalanced classification and policy evaluation in reinforcement learning.

关键词: variance-reduction, forward-reflected-backward splitting, stochastic composite inclusions, unbiased estimators, biased estimators, convergence analysis, AUC optimization, reinforcement learning policy evaluation

274. ❌ Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models

作者: Jay Sarkar, Vamsi Pavan Rayaprolu, Abhijeet Bhalerao 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于存储系统（SSD）的机器学习建模和协同设计，特别是使用可解释的ML算法优化错误管理子系统，与大多数大模型/深度学习关键词无关；仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到了’statistically interpretable and intuitively explainable ML algorithm’，但这不是核心焦点，而是方法工具。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于可解释机器学习的方法，用于协同设计固态硬盘（SSD）的错误管理子系统与内存组件，以优化可靠性和性能，并通过框架评估了数千个数据中心SSD，实现了数据驱动的架构设计。

摘要翻译

基于NAND或新兴存储器件（SSD）的固态存储架构，其根本设计与优化同时着眼于可靠性与性能。为实现这两个并行目标，需要对存储部件与固件架构的错误管理（EM）算法进行协同设计，以适应高密度与高性能扩展的存储技术。本文提出一种面向系统的机器学习（ML）方法及建模框架，用于协同设计EM子系统，同时考量SSD技术底层存储部件在硅工艺微缩过程中固有的自然变异。该建模通过采用统计可解释且直观可理解的ML算法，分析了NAND存储部件与EM算法在闪存转换抽象层中，与综合测试集（侧重压力测试及JEDEC标准）及仿真负载（YCSB及类似负载）的交互作用。这一可推广的协同设计框架评估了涵盖多代存储技术的数千个数据中心SSD。因此，该建模框架能够实现持续、整体、数据驱动的设计，以推动代际架构进步。我们还进一步证明，该框架支持对EM-负载领域进行表征学习，从而在广泛负载范围内增强架构设计空间的探索能力。

摘要 (Abstract)

Solid-state storage architectures based on NAND or emerging memory devices (SSD), are fundamentally architected and optimized for both reliability and performance. Achieving these simultaneous goals requires co-design of memory components with firmware-architected Error Management (EM) algorithms for density- and performance-scaled memory technologies. We describe a Machine Learning (ML) for systems methodology and modeling for co-designing the EM subsystem together with the natural variance inherent to scaled silicon process of memory components underlying SSD technology. The modeling analyzes NAND memory components and EM algorithms interacting with comprehensive suite of synthetic (stress-focused and JEDEC) and emulation (YCSB and similar) workloads across Flash Translation abstraction layers, by leveraging a statistically interpretable and intuitively explainable ML algorithm. The generalizable co-design framework evaluates several thousand datacenter SSDs spanning multiple generations of memory and storage technology. Consequently, the modeling framework enables continuous, holistic, data-driven design towards generational architectural advancements. We additionally demonstrate that the framework enables Representation Learning of the EM-workload domain for enhancement of the architectural design-space across broad spectrum of workloads.

关键词: Solid-state storage, Error Management, Machine Learning, Interpretable models, Co-design, SSD architecture, Workload awareness, Representation Learning

275. ❌ Mamba-3: Improved Sequence Modeling using State Space Principles

作者: Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Mamba-3专注于改进序列建模，通过状态空间模型（SSM）视角提出三个核心方法改进，旨在提升LLM的推理效率和质量。论文与’Large Language Models’高度相关（10分），因为其核心目标是改进LLM的推理效率；与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（10分），因为论文直接解决Transformer的二次计算和线性内存问题，提出线性模型改进；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文强调推理效率，并展示Mamba-3在性能-效率帕累托前沿的进展。其他关键词如MoE、SLMs、对齐、RAG等未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文Mamba-3通过状态空间模型（SSM）的视角，提出三个核心方法改进，以解决Transformer模型在推理时的二次计算和线性内存问题，从而在保持模型质量的同时显著提升推理效率，在1.5B规模上实现了下游任务准确率的提升和状态大小的优化。

摘要翻译

扩展推理时计算已成为提升大语言模型（LLM）性能的关键驱动力，使得推理效率与模型质量共同成为模型设计的核心焦点。尽管当前基于Transformer的模型展现出强大的模型质量，但其二次计算复杂度和线性内存需求导致推理成本高昂。这推动了亚二次模型的发展，此类模型降低了线性计算需求并实现了恒定内存占用。然而，许多近期提出的线性模型为追求算法效率而牺牲了模型质量与能力，在状态追踪等任务上表现不佳。此外，其理论上的线性推理在实践中仍存在硬件效率低下的问题。基于推理优先的视角，我们借鉴线性模型的状态空间模型（SSM）观点，引入了三项核心方法改进。我们结合了：（1）源自SSM离散化的更具表达力的递归结构，（2）支持更丰富状态追踪的复数值状态更新规则，以及（3）多输入多输出（MIMO）架构，在不增加解码延迟的前提下提升模型性能。结合架构优化，我们的Mamba-3模型在检索、状态追踪及下游语言建模任务上均取得显著提升。在15亿参数规模下，Mamba-3相比次优模型（Gated DeltaNet）平均下游准确率提升0.6个百分点，其MIMO变体进一步将准确率再提升1.2个百分点，总计提升达1.8个百分点。在不同状态规模的实验中，Mamba-3仅使用前代模型一半的状态规模，即可达到与Mamba-2相当的困惑度。评估结果表明，Mamba-3能够推进性能与效率的帕累托前沿。

摘要 (Abstract)

Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for better model performance without increasing decode latency. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with Mamba-3’s MIMO variant further improving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor’s state size. Our evaluations demonstrate Mamba-3’s ability to advance the performance-efficiency Pareto frontier.

关键词: Mamba-3, State Space Models, Inference Efficiency, Linear Models, Sequence Modeling, Transformer Alternatives, Performance-Efficiency Pareto Frontier, State Tracking

276. ❌ Predictive Uncertainty in Short-Term PV Forecasting under Missing Data: A Multiple Imputation Approach

作者: Parastoo Pashmchi, Jérôme Benoit, Motonobu Kanagawa 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于光伏发电预测中的缺失数据处理和不确定性传播问题，使用多重插补和Rubin规则等传统统计方法，未涉及任何大模型、深度学习技术原理或AI for Science的具体应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了光伏发电短期预测中缺失数据导致的预测不确定性传播问题，通过结合随机多重插补和Rubin规则的方法，提高了预测区间的校准度，同时保持了相似的点预测精度。

摘要翻译

光伏发电数据中普遍存在缺失值，但其引发的不确定性并未传递至预测分布中。本研究开发了一个框架，通过将随机多重插补与鲁宾规则相结合，将缺失数据的不确定性纳入短期光伏功率预测。该方法具有模型无关性，可与标准机器学习预测器集成。实证结果表明，忽略缺失数据不确定性会导致预测区间过度狭窄。考虑这种不确定性可在保持相当点预测精度的同时改善区间校准效果。这些结果证明了在数据驱动的光伏预测中传递插补不确定性的重要性。

摘要 (Abstract)

Missing values are common in photovoltaic (PV) power data, yet the uncertainty they induce is not propagated into predictive distributions. We develop a framework that incorporates missing-data uncertainty into short-term PV forecasting by combining stochastic multiple imputation with Rubin’s rule. The approach is model-agnostic and can be integrated with standard machine-learning predictors. Empirical results show that ignoring missing-data uncertainty leads to overly narrow prediction intervals. Accounting for this uncertainty improves interval calibration while maintaining comparable point prediction accuracy. These results demonstrate the importance of propagating imputation uncertainty in data-driven PV forecasting.

关键词: photovoltaic forecasting, missing data, predictive uncertainty, multiple imputation, Rubin’s rule, prediction intervals, interval calibration, machine learning

277. ❌ Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

作者: Muhammad Shoaib, Eva Riccomagno, Manuele Leonelli, Gherardo Varando 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是统计建模中的阶段树模型估计方法，使用概率单纯形上的层次聚类和散度度量，属于传统统计机器学习领域。论文内容完全不涉及大语言模型、深度学习、AI for Science等关键词相关的技术、方法或应用，所有关键词均无相关性。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于概率单纯形层次聚类的新框架来估计阶段树模型，发现总变差散度与Ward.D2链接方法的组合在模型拟合、结构恢复和计算效率方面表现最佳。

摘要翻译

阶段树模型通过基于阶段的结构纳入上下文特定依赖性，从而扩展了贝叶斯网络的功能。本研究提出了一种在概率单纯形上利用基于单纯形散度的层次聚类来估计阶段树的新框架。我们系统评估了包括总变差（Total Variation）、海林格（Hellinger）、费希尔（Fisher）和卡尼亚达基斯（Kaniadakis）散度在内的多种距离与散度度量，并结合了Ward.D2、平均（average）、完全（complete）和麦克奎蒂（McQuitty）等不同连接方法。模拟实验表明，总变差散度，特别是与Ward.D2连接方法结合时，能够持续生成具有更优模型拟合度、结构还原能力和计算效率的阶段树。我们使用相对贝叶斯信息准则（BIC）和汉明距离（Hamming distance）评估性能。研究结果显示，尽管后向爬山法（Backward Hill Climbing, BHC）能取得有竞争力的结果，但其计算成本显著更高。相比之下，总变差散度与Ward.D2连接方法的组合在达到相近性能的同时，提供了显著更优的计算效率，使其成为大规模或时间敏感任务中更可行的选择。

摘要 (Abstract)

Staged tree models enhance Bayesian networks by incorporating context-specific dependencies through a stage-based structure. In this study, we present a new framework for estimating staged trees using hierarchical clustering on the probability simplex, utilizing simplex basesd divergences. We conduct a thorough evaluation of several distance and divergence metrics including Total Variation, Hellinger, Fisher, and Kaniadakis; alongside various linkage methods such as Ward.D2, average, complete, and McQuitty. We conducted the simulation experiments that reveals Total Variation, especially when combined with Ward.D2 linkage, consistently produces staged trees with better model fit, structure recovery, and computational efficiency. We assess performance by utilizing relative Bayesian Information Criterion (BIC), and Hamming distance. Our findings indicate that although Backward Hill Climbing (BHC) delivers competitive outcomes, it incurs a significantly higher computational cost. On the other, Total Variation divergence with Ward.D2 linkage, achieves similar performance while providing significantly better computational efficiency, making it a more viable option for large-scale or time sensitive tasks.

关键词: staged tree models, hierarchical clustering, probability simplex, divergence metrics, Total Variation, Ward.D2 linkage, computational efficiency, model estimation

278. ❌ Bridging Local and Global Knowledge: Cascaded Mixture-of-Experts Learning for Near-Shortest Path Routing

作者: Yung-Fu Chen, Anish Arora 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文的核心创新点是提出了一种Cascaded Mixture of Experts (Ca-MoE)架构用于解决图路由问题，这与’Mixture of Experts OR MoE OR Sparse Models’关键词高度相关（10分）。论文属于深度学习在科学计算领域的应用，但具体聚焦于图神经网络和路由算法，而非大语言模型或AI for Science的典型应用场景，因此其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种Cascaded Mixture of Experts (Ca-MoE)架构来解决稀疏网络中近最短路径路由问题，通过分层专家系统结合局部和全局特征，在稀疏网络中比单专家基线提高了29.1%的准确率。

摘要翻译

尽管利用局部特征的深度学习模型在稠密欧几里得图中已展现出实现近乎最优路由的巨大潜力，但它们在稀疏网络中的泛化能力不佳，因为拓扑结构的不规则性需要更广泛的结构感知能力。为解决这一局限，我们训练了一个级联专家混合模型来解决全对近似最短路径路由问题。我们的Ca-MoE是一个模块化的双层架构，支持转发节点选择的决策过程：下层专家依赖局部特征，而上层专家依赖全局特征。它执行自适应推理，仅当下层专家不足以达成足够的决策质量时，才会触发上层专家。由此，计算效率通过仅在拓扑复杂性需要时才提升模型容量来实现，并避免了参数冗余。此外，我们引入了一种在线元学习策略，该策略便于专家独立微调，并采用一种注重稳定性的更新机制，以防止在遇到新图环境时发生灾难性遗忘。实验评估表明，与单专家基线相比，Ca-MoE路由在稀疏网络中的准确率提升了高达29.1%，并且在不同的图密度下，其性能保持在理论上限的1%-6%以内。

摘要 (Abstract)

While deep learning models that leverage local features have demonstrated significant potential for near-optimal routing in dense Euclidean graphs, they struggle to generalize well in sparse networks where topological irregularities require broader structural awareness. To address this limitation, we train a Cascaded Mixture of Experts (Ca-MoE) to solve the all-pairs near-shortest path (APNSP) routing problem. Our Ca-MoE is a modular two-tier architecture that supports the decision-making for forwarder selection with lower-tier experts relying on local features and upper-tier experts relying on global features. It performs adaptive inference wherein the upper-tier experts are triggered only when the lower-tier ones do not suffice to achieve adequate decision quality. Computational efficiency is thus achieved by escalating model capacity only when necessitated by topological complexity, and parameter redundancy is avoided. Furthermore, we incorporate an online meta-learning strategy that facilitates independent expert fine-tuning and utilizes a stability-focused update mechanism to prevent catastrophic forgetting as new graph environments are encountered. Experimental evaluations demonstrate that Ca-MoE routing improves accuracy by up to 29.1% in sparse networks compared to single-expert baselines and maintains performance within 1%-6% of the theoretical upper bound across diverse graph densities.

关键词: Cascaded Mixture of Experts, near-shortest path routing, sparse networks, adaptive inference, online meta-learning, graph routing, deep learning, topological complexity

279. ❌ Vib2ECG: A Paired Chest-Lead SCG-ECG Dataset and Benchmark for ECG Reconstruction

作者: Guorui Lu, Xiaohui Cai, Todor Stefanov, Qinyu Chen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	2.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究使用振动信号重建心电图（ECG）的生物医学AI应用，与大多数大模型技术关键词（如LLM、MoE、RLHF等）完全无关。仅与"Hallucination Mitigation"有微弱关联（2分），因为论文提到了模型生成虚假波形并分析缓解方向。与"AI for Science"高度相关（8分），因为这是AI在生物医学/生物信息学领域的应用研究。

!!! tip deepseek-chat TL;DR

该研究创建了首个配对的胸导联振动-心电图数据集Vib2ECG，并证明使用轻量级U-Net模型可以从低成本振动信号重建多导联心电图，同时分析了模型生成虚假波形的问题。

摘要翻译

十二导联心电图（ECG）对心血管诊断至关重要，但其在日常生活中的长期采集受限于复杂且昂贵的硬件。近期研究尝试从低成本心脏振动信号（如心震图，SCG）重建心电图，然而，由于缺乏配对数据集，现有方法仅限于肢体导联，而临床诊断需要包括胸导联在内的多导联心电图。本研究提出Vib2ECG，首个配对的多通道心电-机械信号数据集，包含从17名受试者采集的完整十二导联心电图，以及在六个胸导联位置通过惯性测量单元（IMU）获取的振动信号。基于此数据集，我们还提供了一个基准测试。实验结果表明，使用轻量化的364K参数U-Net模型，能够根据振动信号重建不同位置的心脏电信号。此外，我们观察到模型存在幻觉现象，即在无对应电活动的区域生成了心电图波形。我们分析了该现象的成因，并提出了可能的缓解方向。本研究通过从IMU传感器采集的低成本振动信号预测胸导联心电图，证明了移动设备友好的心电图监测的可行性。这项工作拓展了心脏振动信号的应用，并为心脏电活动与机械活动随空间位置变化的关系提供了新的见解。

摘要 (Abstract)

Twelve-lead electrocardiography (ECG) is essential for cardiovascular diagnosis, but its long-term acquisition in daily life is constrained by complex and costly hardware. Recent efforts have explored reconstructing ECG from low-cost cardiac vibrational signals such as seismocardiography (SCG), however, due to the lack of a dataset, current methods are limited to limb leads, while clinical diagnosis requires multi-lead ECG, including chest leads. In this work, we propose Vib2ECG, the first paired, multi-channel electro-mechanical cardiac signal dataset, which includes complete twelve-lead ECGs and vibrational signals acquired by inertial measurement units (IMUs) at six chest-lead positions from 17 subjects. Based on this dataset, we also provide a benchmark. Experimental results demonstrate the feasibility of reconstructing electrical cardiac signals at variable locations from vibrational signals using a lightweight 364 K-parameter U-Net. Furthermore, we observe a hallucination phenomenon in the model, where ECG waveforms are generated in regions where no corresponding electrical activity is present. We analyze the causes of this phenomenon and propose potential directions for mitigation. This study demonstrates the feasibility of mobile-device-friendly ECG monitoring through chest-lead ECG prediction from low-cost vibrational signals acquired using IMU sensors. It expands the application of cardiac vibrational signals and provides new insights into the spatial relationship between cardiac electrical and mechanical activities with spatial location variation.

关键词: ECG reconstruction, seismocardiography, vibrational signals, chest-lead ECG, dataset, U-Net, hallucination, cardiac monitoring

280. ❌ Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

作者: Ido Pinto, Yizhak Yisrael Elboher, Haoze Wu, Nina Narodytska, Guy Katz 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究小语言模型（SLMs）在程序验证领域的应用，通过数据整理和微调提升性能。高度相关关键词：‘Small Language Models’（论文明确使用SLMs并展示其优势，核心内容，10分），‘Post-training/SFT’（论文核心方法是通过监督微调提升SLM性能，10分）。较强相关关键词：‘Large Language Models’（论文提到LLMs作为基准和工具，用于数据整理，8分），‘AI for Science’（程序验证属于计算机科学/形式化方法，是AI for Science的一个子领域，8分）。中等相关关键词：‘Scaling Laws AND Data Quality’（论文强调数据质量对模型性能的重要性，但未深入讨论扩展定律，5分）。其余关键词与论文内容（程序验证、数据整理、特定领域应用）无直接关联，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对程序验证中归纳循环不变式生成的瓶颈问题，提出了一种数据整理管道Wonda来提升训练数据质量，并通过微调小语言模型（SLMs）实现了与更大模型相当的性能，显著提高了不变式的正确性和验证效率。

摘要翻译

归纳循环不变式的合成是自动化程序验证中的关键瓶颈。尽管大型语言模型在缓解这一问题上展现出潜力，但其在处理困难实例时仍常失效，生成无效或计算效率低下的不变式。虽然微调是克服这一局限性的自然途径，但获取高质量的不变式生成训练数据仍是一个开放挑战。本文提出了一种严格的数据处理流程，旨在从验证器生成的原始不变式中提取高质量的训练信号。首先，我们形式化了高质量训练不变式所需具备的属性。其次，我们提出了Wonda流程，该流程通过基于抽象语法树的规范化来精炼噪声数据，随后利用大型语言模型驱动的语义重写与增强，并提供可证明的质量保证。实验表明，基于此精炼数据集对小型语言模型进行微调，能带来持续且显著的性能提升。具体而言，一个经过微调的40亿参数模型在效用上匹配了GPT-OSS-120B基线，并接近最先进的GPT-5.2，且未引入推理时间开销。在近期InvBench评估套件的挑战性实例上，我们的方法将基础模型的不变式正确率和加速率提升了一倍；并在验证任务上将其虚拟最佳性能率提升了最高达14.2%。

摘要 (Abstract)

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

关键词: program verification, inductive loop invariants, small language models (SLMs), data curation, fine-tuning, Wonda pipeline, InvBench, virtual best performance (VBP)

281. ❌ Deep Reinforcement Learning for Fano Hypersurfaces

作者: Marc Truter 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15437v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用深度强化学习算法探索高维整数格点以发现代数几何中的Fano超曲面，属于AI在科学领域的应用（具体为数学/代数几何），因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或生物信息学/化学信息学，其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该研究使用深度强化学习算法探索高维整数格点，发现了数千个先前未知的具有终端奇点的Fano四重超曲面，解决了代数几何中长期存在的分类难题。

摘要翻译

我们设计了一种深度强化学习算法，用于探索具有稀疏奖励的高维整数格点，通过训练前馈神经网络作为动态搜索启发式方法，将探索导向奖励密集区域。我们将此方法应用于发现具有终端奇点的法诺四维超曲面——这类对象在代数几何中具有核心重要性。具有终端奇点的法诺簇是代数簇的基本构成单元，其具体实例为理论的发展与推广提供了至关重要的测试平台。尽管经过数十年的努力，由于底层搜索空间的组合复杂性，其分类工作仍极不完整。我们的强化学习方法生成了数千个先前未知的实例，其中数百个被证明是已知搜索方法无法触及的。

摘要 (Abstract)

We design a deep reinforcement learning algorithm to explore a high-dimensional integer lattice with sparse rewards, training a feedforward neural network as a dynamic search heuristic to steer exploration toward reward dense regions. We apply this to the discovery of Fano 4-fold hypersurfaces with terminal singularities, objects of central importance in algebraic geometry. Fano varieties with terminal singularities are fundamental building blocks of algebraic varieties, and explicit examples serve as a vital testing ground for the development and generalisation of theory. Despite decades of effort, the combinatorial intractability of the underlying search space has left this classification severely incomplete. Our reinforcement learning approach yields thousands of previously unknown examples, hundreds of which we show are inaccessible to known search methods.

关键词: deep reinforcement learning, Fano hypersurfaces, algebraic geometry, terminal singularities, high-dimensional integer lattice, sparse rewards, search heuristic, combinatorial intractability

282. ❌ Local Urysohn Width: A Topological Complexity Measure for Classification

作者: Xin Li 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15412v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分类问题的拓扑几何复杂性度量（局部Urysohn宽度），属于纯理论机器学习/计算几何领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术。所有关键词都与大模型技术、训练方法、推理优化、对齐、应用等具体技术相关，而本文是基础理论工作，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了局部Urysohn宽度这一度量分类问题拓扑几何复杂性的新指标，证明了它与VC维度的分离性、拓扑几何缩放定律以及样本复杂度下界。

摘要翻译

我们引入一种度量空间分类问题的复杂度度量——局部乌雷松宽度。与描述假设类丰富性的VC维、脂肪粉碎维和Rademacher复杂度不同，乌雷松宽度刻画的是分类问题本身的拓扑几何复杂度：即在保证边界安全的区域内正确分类所有点所需的最小连通、直径有界的局部专家数量。我们证明了四个主要结果。第一，严格层次定理：对每个整数$w \geq 1$，存在一个定义在连通紧度量空间（一簇贝蒂数$β_1 = w$的圆环束）上的分类问题，其乌雷松宽度恰好为$w$，这证明输入空间的拓扑复杂度必然导致分类器复杂度。第二，拓扑×几何缩放定律：宽度按$Ω(w \cdot L/D_0)$缩放，其中$w$计数独立环路的数量，$L/D_0$是环路周长与局部尺度之比。第三，与VC维的双向分离：存在一类问题族，其宽度无界增长而VC维有界于常数；反之，也存在问题族其VC维无界增长而宽度保持为1。第四，样本复杂度下界：任何必须正确分类宽度为$w$的问题安全区域内所有点的学习器，都需要$Ω(w \log w)$个样本，该下界独立于VC维。

摘要 (Abstract)

We introduce \emph{local Urysohn width}, a complexity measure for classification problems on metric spaces. Unlike VC dimension, fat-shattering dimension, and Rademacher complexity, which characterize the richness of hypothesis \emph{classes}, Urysohn width characterizes the topological-geometric complexity of the classification \emph{problem itself}: the minimum number of connected, diameter-bounded local experts needed to correctly classify all points within a margin-safe region. We prove four main results. First, a \textbf{strict hierarchy theorem}: for every integer $w \geq 1$, there exists a classification problem on a \emph{connected} compact metric space (a bouquet of circles with first Betti number $β_1 = w$) whose Urysohn width is exactly~$w$, establishing that topological complexity of the input space forces classifier complexity. Second, a \textbf{topology $\times$ geometry scaling law}: width scales as $Ω(w \cdot L/D_0)$, where $w$ counts independent loops and $L/D_0$ is the ratio of loop circumference to locality scale. Third, a \textbf{two-way separation from VC dimension}: there exist problem families where width grows unboundedly while VC dimension is bounded by a constant, and conversely, families where VC dimension grows unboundedly while width remains~1. Fourth, a \textbf{sample complexity lower bound}: any learner that must correctly classify all points in the safe region of a width-$w$ problem needs $Ω(w \log w)$ samples, independent of VC dimension.

关键词: local Urysohn width, classification complexity, topological complexity, VC dimension, sample complexity, metric spaces, geometric scaling law, hypothesis classes

283. ❌ Persistence Spheres: a Bi-continuous Linear Representation of Measures for Partial Optimal Transport

作者: Matteo Pegoraro 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于拓扑机器学习中的持久性球表示方法，涉及持久性图、最优传输距离和凸几何理论。论文内容与大多数关键词（如LLM、MoE、对齐、推理等）完全无关，因为这些关键词涉及大语言模型及其相关技术，而本文研究的是拓扑数据分析的数学表示方法。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（拓扑机器学习），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文改进了持久性球表示方法，通过凸几何和部分最优传输理论，将持久性图等度量映射到球面函数，实现了稳定且参数自由的表示，并在多种数据类型的机器学习任务中表现出竞争力。

摘要翻译

我们对~\cite{pegoraro2025persistence}中提出的持续性球面（persistence spheres）进行了改进与扩展。持续性球面将上半平面上的可积测度$μ$（包括作为计数测度的持续性图（persistence diagrams, PDs））映射为一个函数$S(μ)\in C(\mathbb{S}^2)$，且该映射关于1-Wasserstein部分传输距离$\mathrm{POT}_1$是稳定的。此外，据我们所知，持续性球面是拓扑机器学习中首个明确构建的表示方法，其逆映射在像空间上于每个紧支撑目标处均被证明是连续的。近期在部分传输空间中取得的有界基数双利普希茨嵌入结果虽然强大，但并非由本文所考虑的这类显式概要映射给出。我们的构造基于凸几何：对于正测度，其定义中的ReLU积分是提升带形（lift zonoid）的支撑函数。在~\cite{pegoraro2025persistence}的基础上，我们改进了定义以更好地匹配$\mathrm{POT}_1$的删除机制，通过带符号的对角增强来编码部分传输。特别地，对于可积测度$μ$，$S(0)$与$S(μ)$之间的一致范数仅取决于$μ$的持续性，无需任何特设的重加权处理，这反映了以持续性成本向对角线进行的最优传输。这就在测度层面（直至数值离散化）产生了一种无参数的表示，同时为未来可能的扩展（例如$μ$是由PDs导出的平滑测度，如持续性强度函数~\citep{wu2024estimation}）留出了空间。在涉及函数数据、时间序列、图、网格和点云的聚类、回归与分类任务中，更新后的持续性球面表现具有竞争力，并且常常优于持续性图像（persistence images）、持续性景观（persistence landscapes）、持续性样条（persistence splines）以及切片瓦瑟斯坦核基线方法。

摘要 (Abstract)

We improve and extend persistence spheres, introduced in~\cite{pegoraro2025persistence}. Persistence spheres map an integrable measure $μ$ on the upper half-plane, including persistence diagrams (PDs) as counting measures, to a function $S(μ)\in C(\mathbb{S}^2)$, and the map is stable with respect to 1-Wasserstein partial transport distance $\mathrm{POT}_1$. Moreover, to the best of our knowledge, persistence spheres are the first explicit representation used in topological machine learning for which continuity of the inverse on the image is established at every compactly supported target. Recent bounded-cardinality bi-Lipschitz embedding results in partial transport spaces, despite being powerful, are not given by the kind of explicit summary map considered here. Our construction is rooted in convex geometry: for positive measures, the defining ReLU integral is the support function of the lift zonoid. Building on~\cite{pegoraro2025persistence}, we refine the definition to better match the $\mathrm{POT}_1$ deletion mechanism, encoding partial transport via a signed diagonal augmentation. In particular, for integrable $μ$, the uniform norm between $S(0)$ and $S(μ)$ depends only on the persistence of $μ$, without any need of ad-hoc re-weightings, reflecting optimal transport to the diagonal at persistence cost. This yields a parameter-free representation at the level of measures (up to numerical discretization), while accommodating future extensions where $μ$ is a smoothed measure derived from PDs (e.g., persistence intensity functions~\citep{wu2024estimation}). Across clustering, regression, and classification tasks involving functional data, time series, graphs, meshes, and point clouds, the updated persistence spheres are competitive and often improve upon persistence images, persistence landscapes, persistence splines, and sliced Wasserstein kernel baselines.

关键词: persistence spheres, topological machine learning, partial optimal transport, persistence diagrams, convex geometry, 1-Wasserstein distance, measure representation, functional data analysis

284. ❌ Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches

作者: Alessandro Zambon, Francesca Caruso, Riccardo Zecchina, Guido Tiana 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15367v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是前馈神经网络的Boltzmann采样方法，提出了一种使用小批量的伪朗之万动力学算法来提高采样效率。论文的核心是神经网络训练中的采样算法优化，属于深度学习的基础方法研究。所有评分关键词都聚焦于大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化、推理加速等）、大模型应用（如智能体、科学AI）或大模型特定能力（如思维链、幻觉缓解）。该论文完全不涉及大语言模型，也不涉及评分关键词中提到的任何具体技术或应用领域。论文讨论的是通用的前馈神经网络训练采样方法，而非大模型相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用小批量的伪朗之万动力学算法，用于高效采样前馈神经网络的Boltzmann分布，解决了传统混合蒙特卡洛方法在大数据集上计算成本高的问题，并在百万参数网络上验证了其可扩展性和优于SGD的泛化性能。

摘要翻译

根据玻尔兹曼分布对人工神经网络的参数空间进行采样，能够揭示低损失解空间的几何特性，并为训练提供一种替代传统损失最小化的方法。然而，精确采样方法如混合蒙特卡洛（hybrid Monte Carlo, hMC）虽然在形式上是正确的，但由于需要重复计算全批量梯度，在处理实际数据集时计算成本过高而难以实现。本文提出了一种伪朗之万（pseudo-Langevin, pL）动力学方法，通过受控方式使用小批量数据，实现了对使用大型数据集训练的前馈神经网络进行高效的玻尔兹曼采样。该方法利用小批量梯度噪声的统计特性，并通过调整虚拟质量和摩擦系数，确保所诱导的随机过程能够高效地对目标平衡分布进行采样。我们通过将pL方法的平衡统计量与精确hMC采样的结果进行数值比较，验证了该方法的有效性。性能基准测试表明，随着网络规模增大，hMC的效率迅速下降，而pL方案则保持了较高的计算扩散效率，并能良好地扩展到参数数量超过一百万的网络。最后，我们证明在中等温度下采样可获得最优的泛化性能，其效果与随机梯度下降法（SGD）相当，且无需使用验证集或早停策略。这些结果确立了受控小批量朗之万动力学作为一种实用且可扩展的工具，可用于探索和利用大型神经网络的解空间。

摘要 (Abstract)

Sampling the parameter space of artificial neural networks according to a Boltzmann distribution provides insight into the geometry of low-loss solutions and offers an alternative to conventional loss minimization for training. However, exact sampling methods such as hybrid Monte Carlo (hMC), while formally correct, become computationally prohibitive for realistic datasets because they require repeated evaluation of full-batch gradients. We introduce a pseudo-Langevin (pL) dynamics that enables efficient Boltzmann sampling of feed-forward neural networks trained with large datasets by using minibatches in a controlled manner. The method exploits the statistical properties of minibatch gradient noise and adjusts fictitious masses and friction coefficients to ensure that the induced stochastic process samples efficiently the desired equilibrium distribution. We validate numerically the approach by comparing its equilibrium statistics with those obtained from exact hMC sampling. Performance benchmarks demonstrate that, while hMC rapidly becomes inefficient as network size increases, the pL scheme maintains high computational diffusion and scales favorably to networks with over one million parameters. Finally, we show that sampling at intermediate temperatures yields optimal generalization performance, comparable to SGD, without requiring a validation set or early stopping procedure. These results establish controlled minibatch Langevin dynamics as a practical and scalable tool for exploring and exploiting the solution space of large neural networks.

关键词: Boltzmann sampling, Langevin dynamics, feed-forward neural networks, minibatch gradients, hybrid Monte Carlo, computational efficiency, generalization performance, large-scale networks

285. ❌ Deep learning and the rate of approximation by flows

作者: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度残差网络的近似能力与深度之间的关系，采用连续动力系统框架，将问题表述为用给定向量场族驱动的流来逼近微分同胚所需的最小时间范围。论文核心是深度学习的基础数学理论（近似理论、微分几何、动力系统），探讨函数逼近的机制（通过组合或动力学），并与线性逼近理论进行比较。所有评分关键词均聚焦于大模型（LLMs）的具体技术、应用、训练方法、推理优化、对齐、代理系统等实践层面，而本论文是纯理论数学分析，不涉及任何具体的大模型架构、训练技术、应用领域或工程实践。因此，论文与所有关键词完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文在连续动力系统框架下，研究了深度残差网络逼近目标函数的能力与其深度之间的关系，将最小逼近时间识别为微分同胚子流形上的测地距离，揭示了深度学习中的函数逼近机制与线性逼近理论存在根本性差异。

摘要翻译

我们研究了在连续动力系统设定下，深度残差网络的逼近能力对其深度的依赖性。这一问题可表述为：量化通过由给定向量场族 $\mathcal F$ 驱动的流来逼近一个微分同胚所需的最短时间范围。我们证明，该最短时间可视为微分同胚构成的子芬斯勒流形上的测地距离，其中局部几何特征由涉及 $\mathcal F$ 的变分原理刻画。这将目标关系的学习效率与其对学习架构选择的适配性联系起来。进一步，研究结果表明，深度学习中关键的逼近机制——即通过复合或动力学来逼近函数——与线性逼近理论存在根本差异：在线性理论中，线性空间和基于范数的速率估计被替换为流形和测地距离。

摘要 (Abstract)

We investigate the dependence of the approximation capacity of deep residual networks on its depth in a continuous dynamical systems setting. This can be formulated as the general problem of quantifying the minimal time-horizon required to approximate a diffeomorphism by flows driven by a given family $\mathcal F$ of vector fields. We show that this minimal time can be identified as a geodesic distance on a sub-Finsler manifold of diffeomorphisms, where the local geometry is characterised by a variational principle involving $\mathcal F$. This connects the learning efficiency of target relationships to their compatibility with the learning architectural choice. Further, the results suggest that the key approximation mechanism in deep learning, namely the approximation of functions by composition or dynamics, differs in a fundamental way from linear approximation theory, where linear spaces and norm-based rate estimates are replaced by manifolds and geodesic distances.

关键词: deep residual networks, approximation capacity, continuous dynamical systems, diffeomorphism, flows, geodesic distance, sub-Finsler manifold, function approximation

286. ❌ Active Seriation: Efficient Ordering Recovery with Statistical Guarantees

作者: James Cheshire, Yann Issartel 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Active Seriation: Efficient Ordering Recovery with Statistical Guarantees》研究的是通过自适应查询成对相似性来恢复未知排序的算法问题，属于统计学、算法设计和信息论领域。论文内容涉及排序恢复、噪声测量、Robinson矩阵和统计保证，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词均与大模型技术、AI应用或相关方法论相关，而本文是纯理论算法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过自适应查询噪声成对相似性来高效恢复未知排序，并提出了一种具有统计保证的主动排序算法，在满足均匀分离条件下实现了最优性能保证。

摘要翻译

主动排序旨在通过自适应查询两两相似度来恢复 $n$ 个项目的未知顺序。观测数据是对一个潜在的 $n$ x $n$ 置换罗宾逊矩阵（permuted Robinson matrix）条目的噪声测量，该矩阵的置换编码了潜在顺序。该框架允许算法从关于潜在顺序的部分信息开始，其中完全从零开始的排序（seriation from scratch）作为一个特例。我们提出了一种主动排序算法，该算法能以高概率恢复潜在顺序。在相似度矩阵满足均匀分离条件的前提下，我们建立了关于错误概率和成功恢复所需观测数量的最优性能保证。

摘要 (Abstract)

Active seriation aims at recovering an unknown ordering of $n$ items by adaptively querying pairwise similarities. The observations are noisy measurements of entries of an underlying $n$ x $n$ permuted Robinson matrix, whose permutation encodes the latent ordering. The framework allows the algorithm to start with partial information on the latent ordering, including seriation from scratch as a special case. We propose an active seriation algorithm that provably recovers the latent ordering with high probability. Under a uniform separation condition on the similarity matrix, optimal performance guarantees are established, both in terms of the probability of error and the number of observations required for successful recovery.

关键词: Active Seriation, Ordering Recovery, Pairwise Similarities, Robinson Matrix, Statistical Guarantees, Adaptive Querying, Noisy Measurements, Permutation Recovery

287. ❌ Data Augmentation via Causal-Residual Bootstrapping

作者: Mateusz Gajewski, Sophia Xiao, Bijan Mazaheri 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于因果机制的数据增强方法，通过置换边际概率分布模型的残差来生成增强数据，属于传统机器学习/统计学习领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于因果残差自举的数据增强方法，通过置换边际概率分布模型的残差来生成增强数据，在理论分析和实验中证明了该方法能提升预测模型的准确性。

摘要翻译

数据增强通过对现有数据点进行基于领域知识的修改，将领域知识整合到数据集中。例如，可以通过在不同色调或方向上复制图像来增强图像数据，从而融入图像在这些维度上可能发生变化的知识。Teshima和Sugiyama最近的研究探索了如何整合因果知识（例如，A导致B导致C），直至条件独立性等价类。我们提出了一种适用于加性噪声场景的相关方法，该方法能够整合超出马尔可夫等价类的信息。该方法基于独立机制原理，对基于边缘概率分布构建的模型残差进行置换。基于我们增强数据构建的预测模型展现出更高的准确性，我们在线性高斯场景中为此提供了理论支持。

摘要 (Abstract)

Data augmentation integrates domain knowledge into a dataset by making domain-informed modifications to existing data points. For example, image data can be augmented by duplicating images in different tints or orientations, thereby incorporating the knowledge that images may vary in these dimensions. Recent work by Teshima and Sugiyama has explored the integration of causal knowledge (e.g, A causes B causes C) up to conditional independence equivalence. We suggest a related approach for settings with additive noise that can incorporate information beyond a Markov equivalence class. The approach, built on the principle of independent mechanisms, permutes the residuals of models built on marginal probability distributions. Predictive models built on our augmented data demonstrate improved accuracy, for which we provide theoretical backing in linear Gaussian settings.

关键词: Data Augmentation, Causal Knowledge, Independent Mechanisms, Residual Permutation, Marginal Probability Distributions, Predictive Models, Linear Gaussian Settings, Additive Noise

288. ❌ A scaled TW-PINN: A physics-informed neural network for traveling wave solutions of reaction-diffusion equations with general coefficients

作者: Seungwan Han, Kwanghyuk Park, Jiaxi Gu, Jae-Hun Jung 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是物理信息神经网络（PINN）在求解反应扩散方程行波解中的应用，属于AI for Science（科学AI）领域，因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。论文未涉及大语言模型（LLM）、深度学习技术原理创新、模型训练/微调方法、推理优化、智能体系统等主题，与其他所有关键词完全无关（评分0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种称为scaled TW-PINN的高效物理信息神经网络框架，用于计算具有一般系数的反应扩散方程的行波解，并通过数值实验证明了其准确性、灵活性和优越性能。

摘要翻译

本文提出了一种高效且可推广的物理信息神经网络（Physics-Informed Neural Network，简称PINN）框架，用于计算具有不同反应系数与扩散系数的$n$维反应-扩散方程的行波解。通过引入行波形式的尺度变换，原问题被简化为一个具有单位反应系数与单位扩散系数的一维尺度化反应-扩散方程。这一简化引出了所提出的框架——尺度化TW-PINN，其中针对尺度化方程训练得到的单一PINN求解器可重复用于不同系数选择与空间维度的问题。我们还证明了该PINN求解器对于行波解具有通用逼近性质。在一维与二维情形下的数值实验，以及与现有wave-PINN方法的对比，验证了尺度化TW-PINN的准确性、灵活性及其优越性能。最后，我们探讨了将该框架推广至具有一般初始条件的费希尔（Fisher’s）方程的可能性。

摘要 (Abstract)

We propose an efficient and generalizable physics-informed neural network (PINN) framework for computing traveling wave solutions of $n$-dimensional reaction-diffusion equations with various reaction and diffusion coefficients. By applying a scaling transformation with the traveling wave form, the original problem is reduced to a one-dimensional scaled reaction-diffusion equation with unit reaction and diffusion coefficients. This reduction leads to the proposed framework, termed scaled TW-PINN, in which a single PINN solver trained on the scaled equation is reused for different coefficient choices and spatial dimensions. We also prove a universal approximation property of the proposed PINN solver for traveling wave solutions. Numerical experiments in one and two dimensions, together with a comparison to the existing wave-PINN method, demonstrate the accuracy, flexibility, and superior performance of scaled TW-PINN. Finally, we explore an extension of the framework to the Fisher’s equation with general initial conditions.

关键词: physics-informed neural network, PINN, traveling wave solutions, reaction-diffusion equations, scaled transformation, numerical experiments, Fisher’s equation

289. ❌ CASHomon Sets: Efficient Rashomon Sets Across Multiple Model Classes and their Hyperparameters

作者: Fiona Katharina Ewald, Martin Binder, Matthias Feurer, Bernd Bischl, Giuseppe Casalicchio 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Rashomon sets和CASHomon sets，属于传统机器学习模型选择、超参数优化和模型解释性领域，不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习、AI科学应用等主题无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究跨多个模型类和超参数的CASHomon sets，提出TruVaRImp算法来高效识别这些性能相近的替代模型，并发现单一模型类的解释可能存在局限性。

摘要翻译

拉什蒙集合（Rashomon sets）是指在同一模型类别中，性能与同类参考模型近乎相当的模型集合。它们揭示了存在多种性能优异的替代模型的可能性，这些模型可能支持不同的解释。这使得我们能够选择符合领域知识、隐含约束或用户偏好的模型。然而，目前仅针对少数模型类别存在高效的构建方法。应用机器学习通常需要搜索多个模型类别，且最佳类别往往事先未知。因此，我们在算法选择与超参数优化（CASH）相结合的框架下研究拉什蒙集合，并将其称为CASHomon集合。我们提出了TruVaRImp算法——一种基于模型的主动学习方法，用于具有隐式阈值的水平集估计，并提供了收敛性保证。在合成数据集和真实数据集上，TruVaRImp能够可靠地识别CASHomon集合的成员，其性能与朴素采样、贝叶斯优化、经典及隐式水平集估计方法等基线方法相当或更优。我们对不同模型类别间预测多样性和特征重要性变异性的分析，对仅通过单一模型类别解释数据的常见实践提出了质疑。

摘要 (Abstract)

Rashomon sets are model sets within one model class that perform nearly as well as a reference model from the same model class. They reveal the existence of alternative well-performing models, which may support different interpretations. This enables selecting models that match domain knowledge, hidden constraints, or user preferences. However, efficient construction methods currently exist for only a few model classes. Applied machine learning usually searches many model classes, and the best class is unknown beforehand. We therefore study Rashomon sets in the combined algorithm selection and hyperparameter optimization (CASH) setting and call them CASHomon sets. We propose TruVaRImp, a model-based active learning algorithm for level set estimation with an implicit threshold, and provide convergence guarantees. On synthetic and real-world datasets, TruVaRImp reliably identifies CASHomon sets members and matches or outperforms naive sampling, Bayesian optimization, classical and implicit level set estimation methods, and other baselines. Our analyses of predictive multiplicity and feature-importance variability across model classes question the common practice of interpreting data through a single model class.

关键词: Rashomon sets, CASHomon sets, algorithm selection, hyperparameter optimization, model-based active learning, level set estimation, predictive multiplicity, feature-importance variability

290. ❌ A Kolmogorov-Arnold Surrogate Model for Chemical Equilibria: Application to Solid Solutions

作者: Leonardo Boledi, Dirk Bosbach, Jenna Poonoosamy 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究化学平衡的Kolmogorov-Arnold代理模型，应用于地质废物处置中的放射性核素固体溶解度预测。论文核心是机器学习在科学计算（化学/地质）中的应用，属于AI for Science范畴，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为涉及化学信息学（cheminformatics）和科学AI应用。但论文未涉及大模型（LLMs）、深度学习技术原理创新或其他关键词（如MoE、SFT、RAG等），这些关键词均与大模型相关，而本文使用传统神经网络（KANs和MLPs）作为代理模型，非大模型技术，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种基于Kolmogorov-Arnold网络的代理模型，用于预测化学平衡和放射性核素固体溶解度，以加速反应输运模拟，在水泥基准测试中比多层感知机减少误差约60%，并在二元和三元镭固体溶液模型中保持低预测误差。

摘要翻译

地球化学求解器的计算成本是一个具有挑战性的问题。对于反应性输运模拟而言，其化学计算可能需执行数十亿次，因此减少总计算时间至关重要。现有文献已探索了多种机器学习方法，以确定最有效的数据驱动替代模型。其中，多层感知机因其识别非线性关系的能力而被广泛采用。本研究聚焦于近期提出的科莫哥洛夫-阿诺德网络，该网络以基于可学习样条的函数取代了经典的固定激活函数。该架构以更少的可训练参数实现了更高的精度，在求解偏微分方程领域日益受到关注。首先，我们基于现有水泥体系基准训练了一个替代模型。随后，我们将其应用于核废料地质处置的应用案例，即测定含放射性核素固体的溶解度。据我们所知，本研究首次利用数据驱动替代模型研究放射性核素掺入的共沉淀过程，并考虑了从简单机械混合物到二元(Ba,Ra)SO$_4$及三元(Sr,Ba,Ra)SO$_4$体系非理想固溶体逐渐增加的热力学复杂度。在水泥基准测试中，我们证明科莫哥洛夫-阿诺德架构在绝对误差与相对误差指标上均优于多层感知机，分别降低了62%和59%。在二元及三元镭固溶体模型中，科莫哥洛夫-阿诺德网络的中位预测误差保持在$1\times10^{-3}$附近。这是利用替代模型加速反应性输运模拟、优化深地质废物处置库安全评估研究的第一步。

摘要 (Abstract)

The computational cost of geochemical solvers is a challenging matter. For reactive transport simulations, where chemical calculations are performed up to billions of times, it is crucial to reduce the total computational time. Existing publications have explored various machine-learning approaches to determine the most effective data-driven surrogate model. In particular, multilayer perceptrons are widely employed due to their ability to recognize nonlinear relationships. In this work, we focus on the recent Kolmogorov-Arnold networks, where learnable spline-based functions replace classical fixed activation functions. This architecture has achieved higher accuracy with fewer trainable parameters and has become increasingly popular for solving partial differential equations. First, we train a surrogate model based on an existing cement system benchmark. Then, we move to an application case for the geological disposal of nuclear waste, i.e., the determination of radionuclide-bearing solids solubilities. To the best of our knowledge, this work is the first to investigate co-precipitation with radionuclide incorporation using data-driven surrogate models, considering increasing levels of thermodynamic complexity from simple mechanical mixtures to non-ideal solid solutions of binary (Ba,Ra)SO$_4$ and ternary (Sr,Ba,Ra)SO$_4$ systems. On the cement benchmark, we demonstrate that the Kolmogorov-Arnold architecture outperforms multilayer perceptrons in both absolute and relative error metrics, reducing them by 62% and 59%, respectively. On the binary and ternary radium solid solution models, Kolmogorov-Arnold networks maintain median prediction errors near $1\times10^{-3}$. This is the first step toward employing surrogate models to speed up reactive transport simulations and optimize the safety assessment of deep geological waste repositories.

关键词: Kolmogorov-Arnold networks, surrogate model, chemical equilibria, solid solutions, reactive transport simulations, geochemical solvers, radionuclide solubility, machine learning

291. ❌ xplainfi: Feature Importance and Statistical Inference for Machine Learning in R

作者: Lukas Burk, Fiona Katharina Ewald, Giuseppe Casalicchio, Marvin N. Wright, Bernd Bischl 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个用于机器学习模型特征重要性分析的R包xplainfi，专注于传统机器学习（如基于mlr3生态系统）的特征重要性方法和统计推断工具。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文完全不涉及这些领域：未提及任何大语言模型（LLM）、深度学习架构、训练技术（如预训练、微调、对齐）、推理优化、代理系统或特定科学领域应用。唯一略有相关的是’Mechanistic Interpretability OR Explainable AI’，因为特征重要性分析是模型可解释性的一种形式，但本文专注于传统机器学习模型（非深度学习）的特定统计方法，而非大模型的机制可解释性，因此仅给5分（有一定关联）。其他所有关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了xplainfi，一个用于机器学习模型全局、基于损失的特征重要性分析和统计推断的R软件包，填补了现有R工具在条件重要性方法和统计推断程序方面的空白。

摘要翻译

我们推出xplainfi——一个基于mlr3生态系统构建的R语言工具包，用于实现基于损失函数的机器学习模型全局特征重要性分析方法。尽管R语言中已存在多种特征重要性方法，但在条件重要性方法及相关统计推断流程方面仍存在显著空白。该工具包实现了置换特征重要性、条件特征重要性、相对特征重要性、留一协变量排除法及其扩展方法，同时包含边际与条件两种形式的沙普利加性全局重要性方法。它提供基于高斯分布、对抗随机森林、条件推断树以及基于敲除采样器的模块化条件抽样架构，支持对连续型与混合型数据进行条件重要性分析。统计推断可通过多种途径实现，包括方差校正置信区间和条件预测影响框架。我们通过多组模拟场景和学习器类型的测试证明，xplainfi生成的重要性评分与现有实现方法保持一致性，同时具备优异的运行时效表现。该工具包已在CRAN平台发布，为研究者和实践者提供了在R语言中进行特征重要性分析与模型解释的综合性工具集。

摘要 (Abstract)

We introduce xplainfi, an R package built on top of the mlr3 ecosystem for global, loss-based feature importance methods for machine learning models. Various feature importance methods exist in R, but significant gaps remain, particularly regarding conditional importance methods and associated statistical inference procedures. The package implements permutation feature importance, conditional feature importance, relative feature importance, leave-one-covariate-out, and generalizations thereof, and both marginal and conditional Shapley additive global importance methods. It provides a modular conditional sampling architecture based on Gaussian distributions, adversarial random forests, conditional inference trees, and knockoff-based samplers, which enable conditional importance analysis for continuous and mixed data. Statistical inference is available through multiple approaches, including variance-corrected confidence intervals and the conditional predictive impact framework. We demonstrate that xplainfi produces importance scores consistent with existing implementations across multiple simulation settings and learner types, while offering competitive runtime performance. The package is available on CRAN and provides researchers and practitioners with a comprehensive toolkit for feature importance analysis and model interpretation in R.

关键词: feature importance, statistical inference, R package, machine learning, model interpretation, conditional importance, Shapley additive global importance, mlr3 ecosystem

292. ❌ Enhancing classification accuracy through chaos

作者: Panos Stinis 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种利用混沌动力学系统增强分类准确性的方法，涉及数据提升、混沌演化、softmax分类器等技术。然而，所有评分关键词均聚焦于大模型、深度学习技术原理及其应用（如LLM、MoE、RLHF、RAG、量化等），而本文研究的是传统机器学习分类问题，未涉及大模型、深度学习或AI for Science的具体技术。论文内容与所有关键词领域无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用混沌动力学系统演化数据来加速训练并提高分类准确性的新方法，相比标准softmax分类器取得了更好的性能。

摘要翻译

本文提出一种利用混沌提升分类准确率的新方法。具体而言，我们将待分类数据视为向量，首先将其提升至高维空间，随后将其作为混沌动力系统在指定时间区间内演化的初始条件。动力系统的演化状态随后被输入至可训练的softmax分类器，该分类器输出各类别的概率分布。作为概念验证，我们采用中等维度（2至20维）的随机扰动正交向量样本，其类别数量与向量维度相对应，并证明相较于直接在原始向量上运行的标准softmax分类器，以及仅将向量提升至高维空间而不进行演化操作的softmax分类器，本方法不仅能显著加速训练过程，还能有效提高分类准确率。文中同时对混沌增强型分类器的性能提升机制给出了理论解释。

摘要 (Abstract)

We propose a novel approach which exploits chaos to enhance classification accuracy. Specifically, the available data that need to be classified are treated as vectors that are first lifted into a higher-dimensional space and then used as initial conditions for the evolution of a chaotic dynamical system for a prescribed temporal interval. The evolved state of the dynamical system is then fed to a trainable softmax classifier which outputs the probabilities of the various classes. As proof-of-concept, we use samples of randomly perturbed orthogonal vectors of moderate dimension (2 to 20), with a corresponding number of classes equal to the vector dimension, and show how our approach can both significantly accelerate the training process and improve the classification accuracy compared to a standard softmax classifier which operates on the original vectors, as well as a softmax classifier which only lifts the vectors to a higher-dimensional space without evolving them. We also provide an explanation for the improved performance of the chaos-enhanced classifier.

关键词: chaos, classification accuracy, dynamical system, softmax classifier, training acceleration, higher-dimensional space, initial conditions, orthogonal vectors

293. ❌ Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control

作者: Dickens Kwesiga, Angshuman Guin, Khaled Abdelghany, Michael Hunter 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15283v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习在自适应交通信号控制中的应用，研究内容包括算法设计、鲁棒性评估和训练效率优化。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文的核心是传统强化学习在交通工程领域的应用，未涉及任何大模型技术、深度学习创新或生物医药等科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于强化学习的自适应交通信号控制算法，在多种交通需求条件下评估其鲁棒性，结果显示该算法比传统优化方法平均延迟减少11-32%，且在多样化训练数据下表现出良好的泛化能力。

摘要翻译

强化学习（RL）因其能够以无模型方式直接从与交通环境的交互中学习控制策略，在自适应交通信号控制领域受到日益增长的关注。然而，在基于RL的信号控制能够投入实际部署之前，仍存在若干挑战。许多现有研究依赖于简化的信号配时结构，训练模型在不同交通需求条件下的鲁棒性仍未得到充分评估，且在交通微观仿真环境中训练RL算法时，运行效率仍面临挑战。本研究提出了一种基于RL的信号控制算法，该算法能够完整表示与现场信号控制器一致的八相位环栅栏结构。该算法在不同交通需求条件下进行训练和评估，并与当前最优实践中的感应式信号控制（ASC）进行基准比较。为评估鲁棒性，实验在多种交通流量和具有不同结构相似度的起讫点（O-D）需求模式下进行。为提高训练效率，研究采用了一种分布式异步训练架构，支持跨多个计算节点的并行仿真。案例研究交叉口的结果表明，所提出的基于RL的信号控制显著优于优化后的ASC，在各转向流向上平均延误降低了11-32%。在单一O-D模式上训练的模型能够很好地泛化到相似的未知需求模式，但在显著不同的需求条件下性能会下降。相比之下，在多样化O-D模式上训练的模型展现出强大的鲁棒性，即使在高度不同的未知需求场景下，其性能也始终优于ASC。

摘要 (Abstract)

Reinforcement learning (RL) has attracted increasing interest for adaptive traffic signal control due to its model-free ability to learn control policies directly from interaction with the traffic environment. However, several challenges remain before RL-based signal control can be considered ready for field deployment. Many existing studies rely on simplified signal timing structures, robustness of trained models under varying traffic demand conditions remains insufficiently evaluated, and runtime efficiency continues to pose challenges when training RL algorithms in traffic microscopic simulation environments. This study formulates an RL-based signal control algorithm capable of representing a full eight-phase ring-barrier configuration consistent with field signal controllers. The algorithm is trained and evaluated under varying traffic demand conditions and benchmarked against state-of-the-practice actuated signal control (ASC). To assess robustness, experiments are conducted across multiple traffic volumes and origin-destination (O-D) demand patterns with varying levels of structural similarity. To improve training efficiency, a distributed asynchronous training architecture is implemented that enables parallel simulation across multiple computing nodes. Results from a case study intersection show that the proposed RL-based signal control significantly outperforms optimized ASC, reducing average delay by 11-32% across movements. A model trained on a single O-D pattern generalizes well to similar unseen demand patterns but degrades under substantially different demand conditions. In contrast, a model trained on diverse O-D patterns demonstrates strong robustness, consistently outperforming ASC even under highly dissimilar unseen demand scenarios.

关键词: Reinforcement Learning, Adaptive Traffic Signal Control, Robustness Evaluation, Traffic Demand Patterns, Distributed Asynchronous Training, Actuated Signal Control, Traffic Simulation, Delay Reduction

294. ❌ Mechanistic Foundations of Goal-Directed Control

作者: Alma Lago 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于将机制可解释性（Mechanistic Interpretability）框架扩展到具身控制系统（如婴儿运动学习），研究控制电路的形成、门控机制和相位转变。虽然论文涉及可解释AI，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有其他关键词（如LLMs、MoE、Scaling Laws、训练方法、推理技术、代理系统等）均与论文内容无关。因此，仅’Mechanistic Interpretability OR Explainable AI’得10分（核心内容），其余得0分。

!!! tip deepseek-chat TL;DR

该研究将机制可解释性框架扩展到具身控制系统，揭示了婴儿运动学习中反应性和前瞻性控制策略如何通过因果控制电路和门控机制形成与竞争，并确定了上下文窗口k作为电路形成的关键参数。

摘要翻译

机械可解释性通过将模型行为分解为相互竞争的算法、识别训练中的相变现象，并对策略转换的时机与原因推导闭式预测，彻底改变了变压器电路的分析范式。然而，该方法目前主要局限于序列预测架构，尚未为具身控制系统提供可类比的机械论解释。本研究将该框架拓展至感觉运动-认知发展领域，以婴儿运动学习作为模型系统。研究表明，基础归纳偏置催生了因果控制电路，其中习得的门控机制会收敛至理论驱动的不确定性阈值。由此产生的动力学过程揭示了仲裁门中存在清晰的相变，其承诺行为可通过闭式指数移动平均替代模型精确描述。我们确定上下文窗口k为支配电路形成的关键参数：低于最小阈值（k≤4）时仲裁机制无法形成；高于该阈值（k≥8）时，门控置信度按log k渐近缩放。二维相图进一步揭示了任务需求依赖的路径仲裁机制，这与“仅当预测误差保持在任务容限窗口内时前瞻性执行才具有优势”的理论预测一致。这些结果共同揭示了反应式与前瞻式控制策略在学习过程中如何形成并竞争的机械原理。更广泛而言，本研究深化了对认知发展的机械论解释，并为设计可解释的具身智能体提供了原则性指导。

摘要 (Abstract)

Mechanistic interpretability has transformed the analysis of transformer circuits by decomposing model behavior into competing algorithms, identifying phase transitions during training, and deriving closed-form predictions for when and why strategies shift. However, this program has remained largely confined to sequence-prediction architectures, leaving embodied control systems without comparable mechanistic accounts. Here we extend this framework to sensorimotor-cognitive development, using infant motor learning as a model system. We show that foundational inductive biases give rise to causal control circuits, with learned gating mechanisms converging toward theoretically motivated uncertainty thresholds. The resulting dynamics reveal a clean phase transition in the arbitration gate whose commitment behavior is well described by a closed-form exponential moving-average surrogate. We identify context window k as the critical parameter governing circuit formation: below a minimum threshold (k$\leq$4) the arbitration mechanism cannot form; above it (k$\geq$8), gate confidence scales asymptotically as log k. A two-dimensional phase diagram further reveals task-demand-dependent route arbitration consistent with the prediction that prospective execution becomes advantageous only when prediction error remains within the task tolerance window. Together, these results provide a mechanistic account of how reactive and prospective control strategies emerge and compete during learning. More broadly, this work sharpens mechanistic accounts of cognitive development and provides principled guidance for the design of interpretable embodied agents.

关键词: Mechanistic interpretability, Embodied control systems, Sensorimotor-cognitive development, Infant motor learning, Causal control circuits, Phase transition, Context window, Prospective control

295. ❌ Decomposing Probabilistic Scores: Reliability, Information Loss and Uncertainty

作者: Arthur Charpentier, Agathe Fernandes-Machado 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是概率评分校准的理论分解框架，属于机器学习中的概率预测和模型评估领域，与所有关键词（均涉及大模型、深度学习技术原理或特定AI应用）无直接关联。论文未提及任何大模型、深度学习技术或特定科学领域的AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于分解概率评分校准误差的理论框架，将期望损失分解为可靠性项和剩余不确定性项，并应用于分类任务中的后处理校准、模型聚合等场景。

摘要翻译

校准是一种依赖于预测因子所保留信息的条件属性。我们为任意适当损失函数建立了分解恒等式，以明确揭示这种依赖性。在任意信息层级 $\mathcal A$ 上，一个 $\mathcal A$-可测预测因子的期望损失可分解为一个适当遗憾（可靠性）项和一个条件熵（剩余不确定性）项。对于嵌套层级 $\mathcal A\subseteq\mathcal B$，链式分解量化了从 $\mathcal A$ 到 $\mathcal B$ 的信息增益。将此应用于具有特征 $\boldsymbol{X}$ 和得分 $S=s(\boldsymbol{X})$ 的分类问题，可得到一个三项恒等式：失准度、一个衡量从 $\boldsymbol{X}$ 到 $S$ 信息损失的{\em 分组}项，以及特征层级的不可约不确定性。我们利用该框架分析了事后重新校准、已校准模型的聚合以及分阶段/提升构建方法，并给出了Brier损失和对数损失的具体形式。

摘要 (Abstract)

Calibration is a conditional property that depends on the information retained by a predictor. We develop decomposition identities for arbitrary proper losses that make this dependence explicit. At any information level $\mathcal A$, the expected loss of an $\mathcal A$-measurable predictor splits into a proper-regret (reliability) term and a conditional entropy (residual uncertainty) term. For nested levels $\mathcal A\subseteq\mathcal B$, a chain decomposition quantifies the information gain from $\mathcal A$ to $\mathcal B$. Applied to classification with features $\boldsymbol{X}$ and score $S=s(\boldsymbol{X})$, this yields a three-term identity: miscalibration, a {\em grouping} term measuring information loss from $\boldsymbol{X}$ to $S$, and irreducible uncertainty at the feature level. We leverage the framework to analyze post-hoc recalibration, aggregation of calibrated models, and stagewise/boosting constructions, with explicit forms for Brier and log-loss.

关键词: calibration, proper loss, reliability, information loss, uncertainty, decomposition, Brier score, log-loss

296. ❌ Geometric framework for biological evolution

作者: Vitaly Vanchurin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Geometric framework for biological evolution》研究生物进化动力学的几何框架，将进化建模为适应度景观上的学习过程。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于理论生物学和进化动力学，未涉及任何人工智能、机器学习或大模型技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个协变几何框架来描述生物进化动力学，证明最大熵原理揭示了逆度量张量与协方差矩阵的基本关系，并将进化建模为适应度景观上的学习过程。

摘要翻译

我们提出了一种在基因型和表型空间中均能一致运作的广义协变进化动力学描述。研究表明，最大熵原理揭示了逆度规张量与协方差矩阵之间的基本等价关系，从而将朗德方程表现为一种协变梯度上升方程。这证明进化过程可被建模为适应度景观上的学习过程，其具体学习算法由度规张量与微观动力学产生的噪声协方差之间的函数关系所决定。尽管度规（或逆基因型协方差矩阵）已通过实证研究得到广泛表征，但噪声协方差及其相关观测指标（进化变化的协方差）从未被直接测量。这提出了一个关键实验挑战：如何确定度规与噪声协方差之间关联的具体函数形式。

摘要 (Abstract)

We develop a generally covariant description of evolutionary dynamics that operates consistently in both genotype and phenotype spaces. We show that the maximum entropy principle yields a fundamental identification between the inverse metric tensor and the covariance matrix, revealing the Lande equation as a covariant gradient ascent equation. This demonstrates that evolution can be modeled as a learning process on the fitness landscape, with the specific learning algorithm determined by the functional relation between the metric tensor and the noise covariance arising from microscopic dynamics. While the metric (or the inverse genotypic covariance matrix) has been extensively characterized empirically, the noise covariance and its associated observable (the covariance of evolutionary changes) have never been directly measured. This poses the experimental challenge of determining the functional form relating metric to noise covariance.

关键词: evolutionary dynamics, geometric framework, fitness landscape, covariant description, maximum entropy principle, metric tensor, noise covariance, Lande equation

297. ❌ Massive Redundancy in Gradient Transport Enables Sparse Online Learning

作者: Aur Shalev Merin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究稀疏梯度传输在实时循环学习(RTRL)中的应用，核心是证明梯度传输存在大量冗余，通过稀疏传播(仅使用6%的路径)即可恢复84%的完整RTRL适应能力。这与关键词"Mixture of Experts OR MoE OR Sparse Models"高度相关(8分)，因为论文直接研究稀疏模型技术。其他关键词主要涉及大语言模型的具体技术(如RLHF、RAG、量化等)或特定应用领域(如AI for Science)，而本文专注于基础梯度计算方法和稀疏优化，属于更底层的深度学习技术原理，与这些具体关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文证明了在实时循环学习中梯度传输存在大量冗余，通过稀疏传播(仅使用少量路径)即可有效恢复大部分梯度信息，并在RNN、LSTM和Transformer中验证了该方法的有效性和数值稳定性。

摘要翻译

实时循环学习（RTRL）通过在前向传播过程中计算雅可比张量来获取精确的在线梯度，但每一步的计算成本高达O(n^4)。先前的研究尝试了结构化近似方法（如秩-1压缩、基于图的稀疏化、克罗内克分解）。我们发现，在连续误差信号条件下，循环雅可比矩阵存在大量冗余：仅通过随机传播6%的路径（在n=64中选取k=4条路径），即可恢复完整RTRL适应能力的84 ± 6%（基于五组随机种子验证），且绝对数量k=4在n=64至n=256的范围内持续有效（路径比例从6%降至1.6%，恢复能力从84%微降至78%），这意味着随着网络规模扩大，稀疏RTRL的相对计算优势愈发显著。在循环神经网络中，这种恢复能力具有选择不变性（即使对抗性路径选择同样有效），并呈现从零到任意非零传播的阶跃函数式转变。谱分析揭示了其机制：雅可比矩阵虽满秩但近乎各向同性（条件数介于2.6-6.5），因此任何随机子集都能提供具有方向代表性的梯度估计。在混沌动力学系统（洛伦兹吸引子）中，稀疏传播比完整RTRL具有更高的数值稳定性（变异系数为13%对比88%），因为子采样避免了病态谱模式的放大。这种冗余性同样存在于长短期记忆网络（LSTM）（k=4即可匹配完整RTRL效果），并通过稀疏梯度传递扩展至Transformer架构（50%注意力头稀疏度优于稠密基准；33%处于临界阈值），更高的稀疏阈值反映了注意力头功能特异性而非各向同性特性。在真实灵长类神经数据实验中，稀疏RTRL（k=4）能在线适应跨会话电极漂移（恢复率达80 ± 11%，五组种子），其中稀疏传播再次展现出比完整RTRL更佳的稳定性。若无连续误差信号，雅可比传播会累积数值漂移并导致所有RTRL变体性能下降，这是所有前向模式方法的适用范围限制。该结论在使用随机梯度下降（SGD）时依然成立（恢复率达92 ± 1%），表明其与优化器选择无关。

摘要 (Abstract)

Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL’s adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.

关键词: sparse gradient transport, real-time recurrent learning, Jacobian propagation, online learning, redundancy, numerical stability, transformer, LSTM

298. ❌ PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing

作者: Benjamin Uhrich, Tim Häntschel, Erhard Rahm 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PiGRAND专注于物理信息图神经网络在增材制造热传输预测中的应用，属于AI for Science范畴，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为其将机器学习与物理模型结合用于工程科学问题。然而，论文未涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、代理系统、模型压缩、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型或深度学习技术原理的创新，因此其他关键词均评0分。加权总分计算为5.0（仅一个关键词得5分，权重1.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一个物理信息图神经扩散框架（PiGRAND），用于优化3D打印中的热传输预测，通过结合物理原理和高效图学习，在预测精度和计算性能上显著优于传统方法。

摘要翻译

全面理解热传导对于优化包括3D打印在内的各类机械与工程应用至关重要。近年来，机器学习与基于物理的模型相结合，实现了数值方法与数据驱动算法的有力融合。这一进展得益于各工程与科学领域中有限传感器数据的可用性，这些领域的数据采集成本高昂且部分测量难以实现。为此，我们提出了PiGRAND——一种物理信息图神经扩散框架。为降低图学习的计算复杂度，我们开发了一种高效的图构建流程。该方法受连续热传导建模中的显式欧拉法与隐式克兰克-尼科尔森方法启发，利用子学习模型确保图节点间的精确扩散。为提升计算性能，本方法结合了高效的迁移学习技术。我们在3D打印热成像数据上评估PiGRAND，结果表明相较于传统图神经扩散（GRAND）与物理信息神经网络（PINNs），其在预测精度与计算性能上均有显著提升。这些改进源于将偏微分方程（PDEs）理论研究推导的物理原理融入学习模型。PiGRAND代码已在GitHub开源：https://github.com/bu32loxa/PiGRAND

摘要 (Abstract)

A comprehensive understanding of heat transport is essential for optimizing various mechanical and engineering applications, including 3D printing. Recent advances in machine learning, combined with physics-based models, have enabled a powerful fusion of numerical methods and data-driven algorithms. This progress is driven by the availability of limited sensor data in various engineering and scientific domains, where the cost of data collection and the inaccessibility of certain measurements are high. To this end, we present PiGRAND, a Physics-informed graph neural diffusion framework. In order to reduce the computational complexity of graph learning, an efficient graph construction procedure was developed. Our approach is inspired by the explicit Euler and implicit Crank-Nicolson methods for modeling continuous heat transport, leveraging sub-learning models to secure the accurate diffusion across graph nodes. To enhance computational performance, our approach is combined with efficient transfer learning. We evaluate PiGRAND on thermal images from 3D printing, demonstrating significant improvements in prediction accuracy and computational performance compared to traditional graph neural diffusion (GRAND) and physics-informed neural networks (PINNs). These enhancements are attributed to the incorporation of physical principles derived from the theoretical study of partial differential equations (PDEs) into the learning model. The PiGRAND code is open-sourced on GitHub: https://github.com/bu32loxa/PiGRAND

关键词: Physics-informed learning, Graph neural networks, Heat transport, Additive manufacturing, 3D printing, Partial differential equations, Transfer learning, Computational efficiency

299. ❌ The Sampling Complexity of Condorcet Winner Identification in Dueling Bandits

作者: El Mehdi Saad, Victor Thuot, Nicolas Verzelen 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究随机对决赌博机中的Condorcet赢家识别问题，属于经典机器学习/统计学习领域，专注于样本复杂度分析和算法理论证明。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术（如MoE、RLHF、RAG、量化等），与所有关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究了在随机对决赌博机中识别Condorcet赢家的样本复杂度问题，提出了一种利用完整差距矩阵的新识别方法，并证明了其非渐近最优性。

摘要翻译

我们在仅假设存在孔多塞胜者（即至少以$1/2$概率在每轮含噪两两比较中胜出的臂）的条件下，研究随机对决赌博机中的最优臂识别问题。我们提出了一种新的识别方法，该方法充分利用完整的差距矩阵$Δ_{i,j}=q_{i,j}-\tfrac12$（其中$q_{i,j}$表示臂$i$战胜臂$j$的概率），而非仅利用孔多塞胜者与其他臂之间的差距。通过利用涉及胜者之外的信息性比较，我们推导出了高概率、依赖问题实例的样本复杂度保证，该结果（在对数因子范围内）改进了已知最优结果。我们进一步通过新的下界分析对这些结果进行了补充——据我们所知，这是针对随机对决赌博机中孔多塞胜者识别问题的首次下界研究。我们的下界分析分离了在差距矩阵中定位信息性条目并以所需置信度估计它们的内在成本，从而证明了我们非渐近边界的最优性。总体而言，我们的结果揭示了样本复杂度中未被仅基于期望预算的渐近分析所捕捉的新机制与权衡关系。

摘要 (Abstract)

We study best-arm identification in stochastic dueling bandits under the sole assumption that a Condorcet winner exists, i.e., an arm that wins each noisy pairwise comparison with probability at least $1/2$. We introduce a new identification procedure that exploits the full gap matrix $Δ_{i,j}=q_{i,j}-\tfrac12$ (where $q_{i,j}$ is the probability that arm $i$ beats arm $j$), rather than only the gaps between the Condorcet winner and the other arms. We derive high-probability, instance-dependent sample-complexity guarantees that (up to logarithmic factors) improve the best known ones by leveraging informative comparisons beyond those involving the winner. We complement these results with new lower bounds which, to our knowledge, are the first for Condorcet-winner identification in stochastic dueling bandits. Our lower-bound analysis isolates the intrinsic cost of locating informative entries in the gap matrix and estimating them to the required confidence, establishing the optimality of our non-asymptotic bounds. Overall, our results reveal new regimes and trade-offs in the sample complexity that are not captured by asymptotic analyses based only on the expected budget.

关键词: dueling bandits, Condorcet winner, sample complexity, best-arm identification, stochastic bandits, gap matrix, lower bounds, non-asymptotic analysis

300. ❌ Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks

作者: Xiaoyu He, Weicai Li, Tiejun Lv, Xi Yu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究去中心化联邦学习（D-FL）中的通信优化问题，提出了一种联合路由和模型剪枝框架，以在带宽受限的多跳无线网络中减少传输延迟并提高模型精度。论文的核心内容涉及联邦学习、模型剪枝、路由优化和通信约束，但未涉及任何大语言模型（LLM）相关技术、训练方法（如预训练、微调、对齐）、推理优化（如注意力机制、上下文扩展）、代理系统、模型解释性、科学AI应用等关键词。所有关键词均与大语言模型或特定AI子领域相关，而本文专注于联邦学习的通信和优化问题，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对去中心化联邦学习在多跳无线网络中通信资源受限的问题，提出了一种联合路由和模型剪枝的优化框架，在仿真中实现了平均传输延迟降低27.8%和测试精度提升约12%的效果。

摘要翻译

去中心化联邦学习（D-FL）能够在无需中心服务器的情况下实现隐私保护训练，但其多跳模型交换与聚合过程常受限于通信资源约束。为解决此问题，我们提出了一种联合路由与剪枝框架，通过优化路由路径与剪枝率，将通信延迟维持在预设限度内。我们分析了所有客户端模型偏差之和如何影响D-FL的收敛界，并构建了一个优化问题，目标是在通信约束下最大化模型保留率以最小化这些偏差。进一步分析表明，每个客户端的模型保留率具有路径依赖性，从而将原问题简化为路由优化。基于这一发现，我们设计了一种路由算法，选择延迟高效的传输路径，使得在时间预算内能传递更多参数，从而提升D-FL的收敛性能。仿真实验表明，与未剪枝系统相比，所提框架将平均传输延迟降低了27.8%，测试精度提升了约12%。此外，相较于标准基准路由算法，所提出的路由方法将精度提高了约8%。

摘要 (Abstract)

Decentralized federated learning (D-FL) enables privacy-preserving training without a central server, but multi-hop model exchanges and aggregation are often bottlenecked by communication resource constraints. To address this issue, we propose a joint routing-and-pruning framework that optimizes routing paths and pruning rates to maintain communication latency within prescribed limits. We analyze how the sum of model biases across all clients affects the convergence bound of D-FL and formulate an optimization problem that maximizes the model retention rate to minimize these biases under communication constraints. Further analysis reveals that each client’s model retention rate is path-dependent, which reduces the original problem to a routing optimization. Leveraging this insight, we develop a routing algorithm that selects latency-efficient transmission paths, allowing more parameters to be delivered within the time budget and thereby improving D-FL convergence. Simulations demonstrate that, compared with unpruned systems, the proposed framework reduces average transmission latency by 27.8% and improves testing accuracy by approximately 12%. Furthermore, relative to standard benchmark routing algorithms, the proposed routing method improves accuracy by roughly 8%.

关键词: Decentralized Federated Learning, Model Pruning, Routing Optimization, Communication Latency, Multi-hop Wireless Networks, Convergence Bound, Bandwidth Constraints, Model Retention Rate

301. ❌ Sequential Transport for Causal Mediation Analysis

作者: Agathe Fernandes-Machado, Iryna Voitsitska, Arthur Charpentier, Ewen Gallic 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于因果中介分析的统计方法学创新，提出了一种结合最优传输（OT）和中介有向无环图（DAG）的分布框架（sequential transport），用于构建单位水平的中介反事实并分解直接/间接效应。论文内容完全属于因果推断、统计建模和计量经济学领域，未涉及任何大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、AI代理、科学AI应用等相关，与该论文的研究主题无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为sequential transport的分布框架，结合最优传输和中介有向无环图，用于因果中介分析，以构建单位水平的中介反事实并分解直接和间接效应，在理论和模拟中验证了其有效性。

摘要翻译

我们提出序列传输（ST）——一种结合最优传输（OT）与中介变量有向无环图（DAG）的分布式中介分析框架。该方法不依赖于跨世界反事实假设，而是通过将每个中介变量（在边际或条件意义上）最小程度地传输至替代处理下的分布，同时保持DAG所编码的因果依赖关系，从而构建单元层面的中介变量反事实。对于数值型中介变量，ST采用基于条件累积分布函数/分位数估计的单调（条件）OT映射；对于分类型中介变量，则通过基于单纯形的传输自然扩展。我们在标准正则性和支撑条件下，证明了所估计的传输映射及其诱导的单元层面分解（归因于直接效应与间接效应）的一致性。当处理是随机化或可忽略时（可能需以协变量为条件），这些分解具有因果解释；否则，它们提供了一种与中介结构对齐的、基于原则的组间差异分布归因。高斯示例表明ST能够恢复经典中介分析公式，而额外的模拟实验证实了其在非线性和混合类型场景中的良好性能。对COMPAS数据集的应用展示了ST如何生成确定性的、符合DAG一致性的反事实中介变量，并提供差异的细粒度中介层面归因。

摘要 (Abstract)

We propose sequential transport (ST), a distributional framework for mediation analysis that combines optimal transport (OT) with a mediator directed acyclic graph (DAG). Instead of relying on cross-world counterfactual assumptions, ST constructs unit-level mediator counterfactuals by minimally transporting each mediator, either marginally or conditionally, toward its distribution under an alternative treatment while preserving the causal dependencies encoded by the DAG. For numerical mediators, ST uses monotone (conditional) OT maps based on conditional CDF/quantile estimators; for categorical mediators, it extends naturally via simplex-based transport. We establish consistency of the estimated transport maps and of the induced unit-level decompositions into mutatis mutandis direct and indirect effects under standard regularity and support conditions. When the treatment is randomized or ignorable (possibly conditional on covariates), these decompositions admit a causal interpretation; otherwise, they provide a principled distributional attribution of differences between groups aligned with the mediator structure. Gaussian examples show that ST recovers classical mediation formulas, while additional simulations confirm good performance in nonlinear and mixed-type settings. An application to the COMPAS dataset illustrates how ST yields deterministic, DAG-consistent counterfactual mediators and a fine-grained mediator-level attribution of disparities.

关键词: causal mediation analysis, optimal transport, directed acyclic graph, counterfactuals, direct and indirect effects, distributional framework, sequential transport, mediator decomposition

302. ❌ Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

作者: Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究领域适应问题中的潜在偏移和代理变量方法，属于因果推断和机器学习领域。论文的核心是提出Latent Equivalent Classes (LECs)和Proximal Quasi-Bayesian Active learning (PQAL)框架来解决不完美代理下的点识别问题。论文与大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型技术、训练方法、推理优化等具体技术，而本文是理论方法研究。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文涉及domain adaptation问题，但论文关注的是理论框架而非具体的大模型预训练或持续预训练技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了在潜在混淆变量导致分布偏移且代理变量不完美的情况下，如何通过引入潜在等价类和主动学习框架来实现稳健预测器的点识别问题，并提出了PQAL方法在合成数据上优于现有方法。

摘要翻译

当领域间的分布偏移源于同时影响协变量与结果的潜在混杂因子时，解决领域自适应问题变得更具挑战性。现有的基于代理变量的方法处理潜在偏移时，依赖于一个强完备性假设来唯一确定（点识别）一个稳健预测器。完备性要求代理变量必须包含关于潜在混杂因子变异的充分信息。对于不完美的代理变量，从混杂因子到代理变量分布空间的映射是非单射的，多个潜在混杂因子值可能生成相同的代理变量分布。这破坏了完备性假设，且观测数据与多个可能的预测器（集合识别）相一致。为解决此问题，我们引入了潜在等价类（Latent Equivalent Classes, LECs）。LECs被定义为能够诱导相同条件代理变量分布的潜在混杂因子组。我们证明，只要多个领域在如何混合代理变量诱导的LECs以形成稳健预测器方面存在足够差异，该稳健预测器的点识别仍然可以实现。这一领域多样性条件被形式化为混合权重上的跨领域秩条件，该假设远弱于完备性假设。我们提出了近端准贝叶斯主动学习（Proximal Quasi-Bayesian Active learning, PQAL）框架，该框架主动查询满足此秩条件的最小多样化领域集合。PQAL能够高效地恢复点识别的预测器，在不同程度的偏移下表现出鲁棒性，并在合成数据与半合成dSprites数据集上优于先前方法。

摘要 (Abstract)

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a minimal set of diverse domains that satisfy this rank condition. PQAL can efficiently recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites dataset.

关键词: domain adaptation, latent shift, proxy variables, point-identification, robust predictor, latent equivalent classes, active learning, causal inference

303. ❌ Storage and selection of multiple chaotic attractors in minimal reservoir computers

作者: Francesco Martinuzzi, Holger Kantz 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是储层计算（Reservoir Computing）中最小化拓扑结构对多混沌吸引子存储和选择能力的影响，属于经典机器学习/动力系统领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型、深度学习、AI for Science等主题，也未使用相关技术或概念。

!!! tip deepseek-chat TL;DR

该论文研究了最小化拓扑结构的储层计算机能否存储多个混沌吸引子并在外部线索下切换，发现它们能存储多个吸引子但难以实现线索依赖的切换，且性能不依赖于特定拓扑结构。

摘要翻译

现代预测建模日益要求单一学习到的动力学基底能在多种机制下运行。从动力学系统视角看，这种能力可分解为多重吸引子的存储能力以及根据情境线索选择相应吸引子的能力。在储层计算领域，多吸引子学习主要依赖于大规模随机连接的储层网络，其假设是随机连接能产生足够丰富的内部动力学。与此同时，近期研究表明，在单一混沌系统预测任务中，极简的确定性储层网络可达到与随机设计相当的性能。那么，在何种条件下极简拓扑结构能够学习多个混沌吸引子？本文发现，极简架构能够成功存储多个混沌吸引子，但这些架构在任务切换方面存在困难——即系统必须根据外部线索在不同吸引子间进行转换。我们基于八个三维混沌系统形成的全部28组无序系统对，测试了存储与选择性能。研究未发现多吸引子性能与储层拓扑结构存在显著相关性。在所考察的十种拓扑结构中，无论是存储能力还是线索依赖的选择能力，均未出现某种拓扑持续优于其他拓扑的情况。我们的结果表明：尽管极简基底具备表征共存吸引子的能力，但它们可能缺乏实现线索驱动转换所需的稳健时序记忆能力。

摘要 (Abstract)

Modern predictive modeling increasingly calls for a single learned dynamical substrate to operate across multiple regimes. From a dynamical-systems viewpoint, this capability decomposes into the storage of multiple attractors and the selection of the appropriate attractor in response to contextual cues. In reservoir computing (RC), multi-attractor learning has largely been pursued using large, randomly wired reservoirs, on the assumption that stochastic connectivity is required to generate sufficiently rich internal dynamics. At the same time, recent work shows that minimal deterministic reservoirs can match random designs for single-system chaotic forecasting. Under which conditions can minimal topologies learn multiple chaotic attractors? In this paper, we find that minimal architectures can successfully store multiple chaotic attractors. However, these same architectures struggle with task switching, in which the system must transition between attractors in response to external cues. We test storage and selection on all 28 unordered system pairs formed from eight three-dimensional chaotic systems. We do not observe a robust dependence of multi-attractor performance on reservoir topology. Over the ten topologies investigated, we find that no single one consistently outperforms the others for either storage or cue-dependent selection. Our results suggest that while minimal substrates possess the representational capacity to model coexisting attractors, they may lack the robust temporal memory required for cued transitions.

关键词: reservoir computing, chaotic attractors, minimal topologies, task switching, dynamical systems, multi-attractor learning, cue-dependent selection, temporal memory

304. ❌ Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

作者: Yanghao Li, Changxin Liu, Yuhao Yi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于分布式机器学习中的拜占庭鲁棒性和通信压缩优化算法，属于分布式优化领域。所有评分关键词均围绕大模型（LLMs）及其相关技术（如训练、推理、对齐、应用等），而论文内容完全不涉及大模型、深度学习或AI在科学领域的应用，也没有讨论任何评分关键词中的具体技术。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Byz-DM21的新型拜占庭鲁棒且通信高效的随机分布式学习算法，通过双动量机制和方差减少技术，在不需要大批量的情况下实现了收敛到ε-稳定点，并证明了其理论收敛速度。

摘要翻译

在协作式分布式学习中，拜占庭鲁棒性是优化算法的一个重要方面。此类分布式算法通常需要传输大量参数，因此通信压缩对于实现高效解决方案至关重要。本文提出Byz-DM21，一种新型的拜占庭鲁棒且通信高效的随机分布式学习算法。我们的核心创新在于基于双动量机制的新型梯度估计器，该设计融合了误差反馈技术的最新进展。利用该估计器，我们设计了标准版与加速版算法，在保持对拜占庭工作节点鲁棒性的同时，无需依赖大批次训练。我们证明Byz-DM21算法具有更小的邻域规模，并能在$\mathcal{O}(\varepsilon^{-4})$次迭代中收敛至$\varepsilon$-平稳点。为进一步提升效率，我们提出分布式变体Byz-VR-DM21，该版本在每个节点引入局部方差缩减技术，逐步消除随机近似带来的方差。我们证明Byz-VR-DM21可在$\mathcal{O}(\varepsilon^{-3})$次迭代中收敛至$\varepsilon$-平稳点。此外，我们将结果拓展至目标函数满足Polyak-Łojasiewicz条件的情形。最后的数值实验验证了所提方法的有效性。

摘要 (Abstract)

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.

关键词: Byzantine-robust distributed learning, communication compression, double-momentum mechanism, variance reduction, stochastic optimization, gradient estimator, convergence analysis, Polyak-Łojasiewicz condition

305. ❌ Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds

作者: Gustav Conradie, Nicolas Boullé, Jean-Christophe Loiseau, Steven L. Brunton, Matthew J. Colbrook 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Koopman算子理论在非线性动力学中的数学方法、误差分析和验证框架，属于计算数学和动力系统领域。论文内容完全不涉及大语言模型、深度学习、AI技术原理或AI在科学领域的应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文解决了数据驱动Koopman方法中的验证问题，提出了一种量化不变性和投影误差的后验方法，并推导了误差界限，为可靠的谱分析和预测提供了工具。

摘要翻译

库普曼算子理论为非线性动力学提供了全局线性表示，并支撑着许多数据驱动方法。然而在实践中，由用户选定字典诱导的有限维特征空间极少具有不变性，因此闭合失效与投影误差会导致虚假特征值、误导性库普曼模态以及过度自信的预测。本文针对数据驱动库普曼方法中的一个核心验证问题展开研究：如何仅利用快照数据量化任意特征空间的不变性及投影误差？如何运用这些诊断工具生成可操作的保证并指导字典优化？我们建立了一套统一的后验方法论，用于判定库普曼近似何时可信，并在其不可信时进行改进。通过计算子空间与其库普曼像之间的主角度，我们量化了库普曼不变性，由此得到主可观测量及主角度分解（PAD）——这是一种基于动力学信息的替代方案，可取代SVD截断并显著提升性能。本文推导了库普曼模态分解与佩龙-弗罗贝尼乌斯模态分解的多步误差界，包括基于再生核希尔伯特空间的逐点保证，并辅以高斯过程期望误差代理模型。所构建的工具箱实现了经过验证的谱分析、具备保证的预测能力以及基于原理的字典与核学习，并在混沌系统、高维基准测试以及包括空腔流动和冥王星-卡戎系统在内的实际数据集中得到验证。

摘要 (Abstract)

Koopman operator theory provides a global linear representation of nonlinear dynamics and underpins many data-driven methods. In practice, however, finite-dimensional feature spaces induced by a user-chosen dictionary are rarely invariant, so closure failures and projection errors lead to spurious eigenvalues, misleading Koopman modes, and overconfident forecasts. This paper addresses a central validation problem in data-driven Koopman methods: how to quantify invariance and projection errors for an arbitrary feature space using only snapshot data, and how to use these diagnostics to produce actionable guarantees and guide dictionary refinement? A unified a posteriori methodology is developed for certifying when a Koopman approximation is trustworthy and improving it when it is not. Koopman invariance is quantified using principal angles between a subspace and its Koopman image, yielding principal observables and a principal angle decomposition (PAD), a dynamics-informed alternative to SVD truncation with significantly improved performance. Multi-step error bounds are derived for Koopman and Perron–Frobenius mode decompositions, including RKHS-based pointwise guarantees, and are complemented by Gaussian process expected error surrogates. The resulting toolbox enables validated spectral analysis, certified forecasting, and principled dictionary and kernel learning, demonstrated on chaotic and high-dimensional benchmarks and real-world datasets, including cavity flow and the Pluto–Charon system.

关键词: Koopman operator, invariance diagnostics, error bounds, principal angle decomposition, spectral analysis, nonlinear dynamics, data-driven methods, certified forecasting

306. ❌ Affordable Precision Agriculture: A Deployment-Oriented Review of Low-Cost, Low-Power Edge AI and TinyML for Resource-Constrained Farming Systems

作者: Riya Samanta, Bidyut Saha 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究农业领域的边缘AI和TinyML部署，重点关注低成本和低功耗系统。与绝大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关，因为这些关键词涉及大语言模型架构、训练、对齐、推理优化等，而论文讨论的是传统机器学习模型在嵌入式设备上的部署优化。唯一相关的关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文明确提到量化是主导的优化策略（约50%的工作使用），并讨论了模型压缩技术，因此给予10分（高度相关）。‘AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及AI在农业科学中的应用，但并非核心，给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

这篇综述研究了在资源受限的农业系统中部署低成本、低功耗的边缘AI和TinyML所面临的挑战，发现量化是主导的优化策略，并提出了一个隐私保护的层次化边缘AI架构，以推动从研究原型到可部署系统的转变。

摘要翻译

精准农业日益融合人工智能技术，以加强作物监测、灌溉管理和资源利用效率。然而，当前绝大多数系统仍主要基于云端运行，并依赖稳定的网络连接，这阻碍了其在小规模农户及欠发达国家农业体系中的推广应用。基于2023年至2026年的近期文献综述，本文回顾了边缘人工智能在低成本、低功耗农业领域的部署情况，重点关注微型机器学习的发展与应用。一项以硬件为导向、侧重部署的研究显示，系统架构存在显著差异：微控制器级平台（如ESP32、STM32、ATMega）在推理方案中占据主导地位，同时单板计算机与无人机辅助解决方案并存。定量综合分析表明，量化是主流的优化策略——约50%的相关研究采用了量化技术；而结构化剪枝、多目标压缩及硬件感知的神经架构搜索等领域的研究相对不足。此外，资源评估实践尚未统一：尽管模型规模偶有报道，但具体的闪存、内存、乘加运算量、延迟及毫焦级能耗指标往往缺乏完整记录，这影响了研究的可复现性与跨系统比较。为进一步弥合研究原型与可部署系统之间的差距，本文还基于文献提出了一种隐私保护的分层边缘人工智能农业架构，综合提炼了现有研究中涌现的关键系统级设计思路。总体而言，研究结果清晰地揭示了农业人工智能系统正朝着“训练集中化、推理本地化”的非对称架构方向转变。

摘要 (Abstract)

Precision agriculture increasingly integrates artificial intelligence to enhance crop monitoring, irrigation management, and resource efficiency. Nevertheless, the vast majority of the current systems are still mostly cloud-based and require reliable connectivity, which hampers the adoption to smaller scale, smallholder farming and underdeveloped country systems. Using recent literature reviews, ranging from 2023 to 2026, this review covers deployments of Edge AI, focused on the evolution and acceptance of Tiny Machine Learning, in low-cost and low-powered agriculture. A hardware-targeted deployment-oriented study has shown pronounced variation in architecture with microcontroller-class platforms i.e. ESP32, STM32, ATMega dominating the inference options, in parallel with single-board computers and UAV-assisted solutions. Quantitative synthesis shows quantization is the dominant optimization strategy; the approach in many works identified: around 50% of such works are quantized, while structured pruning, multi-objective compression and hardware aware neural architecture search are relatively under-researched. Also, resource profiling practices are not uniform: while model size is occasionally reported, explicit flash, RAM, MAC, latency and millijoule level energy metrics are not well documented, hampering reproducibility and cross-system comparison. Moreoever, to bridge the gap between research prototypes and deployment-ready systems, the review also presents a literature-informed deployment perspective in the form of a privacy-preserving layered Edge AI architecture for agriculture, synthesizing the key system-level design insights emerging from the surveyed works. Overall, the findings demonstrate a clear architectural shift toward localized inference with centralized training asymmetry.

关键词: Precision Agriculture, Edge AI, Tiny Machine Learning, Low-cost, Low-power, Quantization, Model Compression, Deployment-oriented

307. ❌ Interpretable Classification of Time Series Using Euler Characteristic Surfaces

作者: Salam Rabindrajit Luwang, Sushovan Majhi, Vishal Mandal, Atish J. Mitra, Md. Nurujjaman, Buddha Nath Sharma 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15079v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文提出了一种基于欧拉特征曲面（ECS）的时间序列分类方法，属于拓扑数据分析在生物医学信号处理中的应用。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词都特指大语言模型或深度学习技术，而本文使用的是传统机器学习方法（AdaBoost）和拓扑特征。唯一相关的关键词是：1. “Mechanistic Interpretability OR Explainable AI”（5分）：论文强调其方法保留了完全可解释性，与可解释AI有一定关联，但不是核心。2. “AI for Science OR Bioinformatics OR Cheminformatics”（8分）：论文应用于ECG和EEG等生物医学数据集，属于生物信息学/AI for Science领域，是核心应用场景。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于欧拉特征曲面的可解释时间序列分类方法，在生物医学心电图和脑电图数据集上取得了高精度，同时保持了计算效率和完全可解释性。

摘要翻译

持久同调（Persistent Homology，PH）作为拓扑数据分析的常规方法，其计算成本高昂，在应用于机器学习前需对其特征进行向量化处理，且仅能沿空间轴捕获信息。针对时间序列数据，我们提出欧拉特征曲面（Euler Characteristic Surfaces，ECS）作为一种基于欧拉特征数（$χ$）——这一基本拓扑不变量——的替代性拓扑特征表示。ECS提供了一种计算高效、兼具时空特性且本质离散化的特征表示，可直接作为机器学习模型的输入。我们证明了一个稳定性定理，确保ECS在输入时间序列受到微小扰动时保持稳定。我们首先展示了ECS能有效捕捉Rössler系统中极限环与奇异吸引子之间的非平凡拓扑差异。随后，我们开发了一个基于ECS的分类框架，并将其应用于UCR/UEA档案库中的五个基准生物医学数据集（四个心电图ECG，一个脑电图EEG）。在$\textit{ECG5000}$数据集上，我们的单特征ECS分类器以$O(n+R\cdot T)$的复杂度达到了$98%$的准确率，而近期一项基于PH的方法报告准确率为$62%$。通过AdaBoost扩展，准确率提升至$98.6%$，在保持完全可解释性的同时，与最佳深度学习结果相当。在$\textit{TwoLeadECG}$（$94.1%$）和$\textit{Epilepsy2}$（$92.6%$）数据集上也取得了优异的结果。

摘要 (Abstract)

Persistent homology (PH) – the conventional method in topological data analysis – is computationally expensive, requires further vectorization of its signatures before machine learning (ML) can be applied, and captures information along only the spatial axis. For time series data, we propose Euler Characteristic Surfaces (ECS) as an alternative topological signature based on the Euler characteristic ($χ$) – a fundamental topological invariant. The ECS provides a computationally efficient, spatiotemporal, and inherently discretized feature representation that can serve as direct input to ML models. We prove a stability theorem guaranteeing that the ECS remains stable under small perturbations of the input time series. We first demonstrate that ECS effectively captures the nontrivial topological differences between the limit cycle and the strange attractor in the Rössler system. We then develop an ECS-based classification framework and apply it to five benchmark biomedical datasets (four ECG, one EEG) from the UCR/UEA archive. On $\textit{ECG5000}$, our single-feature ECS classifier achieves $98%$ accuracy with $O(n+R\cdot T)$ complexity, compared to $62%$ reported by a recent PH-based method. An AdaBoost extension raises accuracy to $98.6%$, matching the best deep learning results while retaining full interpretability. Strong results are also obtained on $\textit{TwoLeadECG}$ ($94.1%$) and $\textit{Epilepsy2}$ ($92.6%$).

关键词: Euler Characteristic Surfaces, time series classification, topological data analysis, interpretable machine learning, biomedical signal processing, ECG, EEG, AdaBoost

308. ❌ Generative Semantic HARQ: Latent-Space Text Retransmission and Combining

作者: Bin Han, Yulin Hu, Hans D. Schotten 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是语义通信中的文本重传和合并技术，使用Transformer-VAE作为编解码器，属于通信工程和信号处理领域。虽然使用了Transformer架构，但论文的核心是通信协议设计、语义质量度量和软合并策略，而非大模型或深度学习技术原理的创新，也未涉及大模型在不同领域的应用。所有给定的关键词均围绕大模型技术、训练方法、推理优化、对齐、应用等主题，与该论文的通信工程研究内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于文本语义通信的混合自动重传请求框架，通过Transformer-VAE编解码器在潜在空间生成多样表示，并评估了多种语义质量度量和软合并策略，结果表明加权平均或MRC启发的结合与基于自一致性的HARQ触发能获得最佳性能。

摘要翻译

语义通信传递的是意义而非原始比特，但语义层面的可靠性仍是开放挑战。本文提出一种面向文本通信的语义层混合自动重传请求（HARQ）框架，其中基于Transformer的变分自编码器（VAE）编解码器作为轻量级覆盖层运行于传统协议栈之上。该随机编码器在重传过程中能内生地生成多样化的潜在表征——仅通过单一模型即可提供增量知识（IK），无需专门的协议设计。在接收端，软质量估计器触发重传请求，而质量感知组合器在一致的潜在空间内融合接收到的潜在向量。我们在混合系统性偏差与加性噪声的语义失真场景下，系统性地评估了六种语义质量度量指标与四种软组合策略。结果表明：采用加权平均或受MRC启发的组合策略，结合基于自一致性的HARQ触发机制，可获得最佳性能。

摘要 (Abstract)

Semantic communication conveys meaning rather than raw bits, but reliability at the semantic level remains an open challenge. We propose a semantic-level hybrid automatic repeat request (HARQ) framework for text communication, in which a Transformer-variational autoencoder (VAE) codec operates as a lightweight overlay on the conventional protocol stack. The stochastic encoder inherently generates diverse latent representations across retransmissions-providing incremental knowledge (IK) from a single model without dedicated protocol design. On the receiver side, a soft quality estimator triggers retransmissions and a quality-aware combiner merges the received latent vectors within a consistent latent space. We systematically benchmark six semantic quality metrics and four soft combining strategies under hybrid semantic distortion that mixes systematic bias with additive noise. The results suggest combining Weighted-Average or MRC-Inspired combining with self-consistency-based HARQ triggering for the best performance.

关键词: Semantic Communication, Hybrid Automatic Repeat Request (HARQ), Transformer-VAE, Latent-Space, Text Retransmission, Soft Combining, Semantic Quality Metrics, Incremental Knowledge

309. ❌ Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

作者: Hideaki Iiduka 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Muon优化器在非凸Hölder平滑经验风险最小化问题中的收敛性，特别是处理重尾噪声的情况。所有关键词均与大模型技术、深度学习应用或科学AI应用直接相关，而本文专注于优化算法理论分析，不涉及大模型架构、训练技术、推理加速、对齐、应用等领域，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文证明了Muon优化器在非凸Hölder平滑经验风险最小化问题中，即使存在重尾噪声，也能收敛到平稳点，且收敛速度优于小批量随机梯度下降。

摘要翻译

Muon是一种近期提出的优化器，它通过将梯度投影至Stiefel流形来强制参数更新的正交性，从而在大规模深度神经网络中实现稳定高效的训练。同时，已有研究指出实际机器学习中的随机噪声可能呈现重尾特性，这违背了传统的有界方差假设。本文研究了在非凸Hölder平滑经验风险最小化问题中，如何有效处理重尾随机噪声的影响。我们证明，在考虑重尾随机噪声的有界性条件下，Muon能够收敛至经验风险的平稳点。此外，我们证明了Muon相比小批量随机梯度下降法具有更快的收敛速度。

摘要 (Abstract)

Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.

关键词: Muon optimizer, nonconvex Hölder-smooth empirical risk minimization, heavy-tailed stochastic noise, Stiefel manifold, gradient projection, convergence analysis, mini-batch SGD comparison

310. ❌ Spatio-temporal probabilistic forecast using MMAF-guided learning

作者: Leonardo Bardi, Imma Valentina Curato, Lorenzo Proietti 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是使用随机前馈神经网络和MMAF引导学习进行时空概率预测，属于传统的深度学习应用（概率预测），而非大模型（LLM）相关研究。论文中未涉及任何大模型技术、架构、训练方法、推理优化、对齐、代理系统等关键词内容，也未涉及AI for Science的具体应用（如生物信息学、化学信息学）。所有关键词均与大模型技术或特定科学AI应用相关，与该论文的时空预测研究完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于MMAF引导学习和随机前馈神经网络的时空概率预测方法，在合成和真实数据上实现了跨多个时间范围的校准预测，且性能与卷积或扩散模型相当甚至更好。

摘要翻译

我们采用具有高斯分布权重的随机前馈神经网络来确定时空栅格数据集的概率预测。该网络通过MMAF引导学习进行训练，这是一种广义贝叶斯方法，其中观测数据通过嵌入进行预处理，该嵌入旨在生成低维表示以捕捉数据的依赖性与因果结构。嵌入的设计基于理论指导，其假设观测数据由具有有限二阶矩的时空奥恩斯坦-乌伦贝克过程生成。在推理模式下，训练完成的网络通过在不同时间跨度应用不同初始条件来生成集合预测。在合成数据与真实数据上进行的实验表明，我们的预测在多个时间跨度上均保持校准性。此外，我们证明在此类数据上，简单的前馈神经网络架构能够实现与概率预测任务中常用的卷积或扩散深度学习架构相当、甚至在某些情况下更优的性能。

摘要 (Abstract)

We employ stochastic feed-forward neural networks with Gaussian-distributed weights to determine a probabilistic forecast for spatio-temporal raster datasets. The networks are trained using MMAF-guided learning, a generalized Bayesian methodology in which the observed data are preprocessed using an embedding designed to produce a low-dimensional representation that captures their dependence and causal structure. The design of the embedding is theory-guided by the assumption that a spatio-temporal Ornstein-Uhlenbeck process with finite second-order moments generates the observed data. The trained networks, in inference mode, are then used to generate ensemble forecasts by applying different initial conditions at different horizons. Experiments conducted on both synthetic and real data demonstrate that our forecasts remain calibrated across multiple time horizons. Moreover, we show that on such data, simple feed-forward architectures can achieve performance comparable to, and in some cases better than, convolutional or diffusion deep learning architectures used in probabilistic forecasting tasks.

关键词: probabilistic forecast, spatio-temporal, stochastic feed-forward neural networks, MMAF-guided learning, Ornstein-Uhlenbeck process, ensemble forecasts, calibrated forecasts, deep learning architectures

311. ❌ CrossADR: enhancing adverse drug reactions prediction for combination pharmacotherapy with cross-layer feature integration and cross-level associative learning

作者: Y. Cheung 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15047v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文《CrossADR》专注于使用图神经网络（Gated-Residual-Flow Graph Neural Network）进行药物组合不良反应（ADR）预测，属于生物信息学/计算生物医学领域。与评分关键词列表对比：1）仅与“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（10分），因为论文明确应用AI方法解决生物医学问题（ADR预测、药物-蛋白质相互作用）。2）与“Mechanistic Interpretability OR Explainable AI”有一定关联（5分），因为摘要提到“interpretable computational methods”和“high-resolution insights”，但未深入探讨可解释AI的具体技术。3）其他所有关键词均涉及大语言模型（LLM）相关技术（如MoE、RLHF、RAG等）、模型优化（如量化、注意力机制）或代理系统，而本文未使用或提及任何大模型技术，完全基于传统图神经网络方法，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了CrossADR框架，通过跨层特征整合和跨级关联学习，使用图神经网络预测药物组合的器官级不良反应，在新构建的大规模数据集上实现了最先进的性能，并为临床决策提供了高分辨率见解。

摘要翻译

联合药物治疗虽能带来显著疗效优势，但也伴随着较高的药物不良反应风险。利用可解释的计算方法准确预测药物不良反应，对于临床安全管理、药物研发及精准医疗至关重要。然而，由于药物组合的搜索空间巨大且生理反应复杂，药物不良反应的管理仍面临挑战。现有的基于图结构的模型往往难以有效整合多尺度生物信息，且常依赖固定的关联矩阵，这限制了其捕捉动态器官水平依赖关系及跨数据集泛化的能力。本文提出CrossADR——一种通过跨层特征整合与跨层级关联学习实现器官水平药物不良反应预测的分层框架。该框架采用门控残差流图神经网络融合多尺度分子特征，并利用可学习的药物不良反应嵌入空间动态捕捉跨15个器官系统的潜在生物相关性。在新构建的CrossADR数据集（涵盖1,376种药物及94.6万种独特组合）上的系统评估表明，CrossADR在80种不同实验场景中均保持领先性能，并能提供药物相关蛋白质相互作用及通路的高分辨率解析。总体而言，CrossADR是一个实现跨尺度生物医学信息整合、跨层特征融合以及跨层级关联学习的稳健工具，可有效应用于临床决策中以预防药物不良反应。

摘要 (Abstract)

Combination pharmacotherapy offers substantial therapeutic advantages but also poses substantial risks of adverse drug reactions (ADRs). The accurate prediction of ADRs with interpretable computational methods is crucial for clinical safety management, drug development, and precision medicine. However, managing ADRs remains a challenge due to the vast search space of drug combinations and the complexity of physiological responses. Current graph-based architectures often struggle to effectively integrate multi-scale biological information and frequently rely on fixed association matrices, which limits their ability to capture dynamic organ-level dependencies and generalize across diverse datasets. Here we propose CrossADR, a hierarchical framework for organ-level ADR prediction through cross-layer feature integration and cross-level associative learning. It incorporates a gated-residual-flow graph neural network to fuse multi-scale molecular features and utilizes a learnable ADR embedding space to dynamically capture latent biological correlations across 15 organ systems. Systematic evaluation on the newly constructed CrossADR-Dataset-covering 1,376 drugs and 946,000 unique combinations-demonstrates that CrossADR consistently achieves state-of-the-art performance across 80 distinct experimental scenarios and provides high-resolution insights into drug-related protein protein interactions and pathways. Overall, CrossADR represents a robust tool for cross-scale biomedical information integration, cross-layer feature integration as well as cross-level associative learning, and can be effectively utilized to prevent ADRs in clinical decision-making.

关键词: adverse drug reactions prediction, combination pharmacotherapy, graph neural network, cross-layer feature integration, cross-level associative learning, organ-level prediction, biomedical information integration, drug-protein interactions

312. ❌ A convolutional autoencoder and neural ODE framework for surrogate modeling of transient counterflow flames

作者: Mert Yakup Baykan, Weitao Liu, Thorsten Zirwes, Andreas Kronenburg, Hong G. Im, Dong-hyuk Shin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于卷积自编码器和神经ODE的降阶模型框架，用于模拟瞬态二维对向火焰，属于计算流体力学和燃烧科学领域。论文的核心是深度学习在科学计算中的应用，具体是物理信息神经网络和降阶建模技术。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文属于AI在科学（具体是燃烧科学）中的应用，但并非生物信息学或化学信息学，因此给5分。其他关键词均涉及大语言模型（LLM）及其相关技术（如MoE、对齐、推理、代理等），而本文完全不涉及任何语言模型或文本处理，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种卷积自编码器-神经ODE（CAE-NODE）框架，用于构建瞬态二维对向火焰的降阶模型，能够准确预测包括点火、火焰传播在内的整个瞬态过程，主要物种的相对误差小于约2%。

摘要翻译

本文提出了一种新型卷积自编码器神经常微分方程框架，用于构建瞬态二维逆流火焰的降阶模型。该框架将均相反应系统中AE-NODE方法扩展至空间分辨流动问题。通过卷积层提取多维场的空间相关性，CAE能够将高保真二维快照数据（256x256网格，21个变量）压缩超过10万倍，从而自主构建物理一致的六维连续潜流形。随后训练NODE来描述该非线性流形上的连续时间动力学，使得模型能够从初始条件出发通过时间前向积分，预测火焰的完整瞬态演化过程。结果表明，该网络能精确捕捉包括点火、火焰传播及向非预混状态渐变在内的整个瞬态过程，主要组分的相对误差低于约2%。本研究首次揭示了CAE-NODE框架在多维反应流非定常动力学代理建模方面的潜力。

摘要 (Abstract)

A novel convolutional autoencoder neural ODE (CAE-NODE) framework is proposed for a reduced-order model (ROM) of transient 2D counterflow flames, as an extension of AE-NODE methods in homogeneous reactive systems to spatially resolved flows. The spatial correlations of the multidimensional fields are extracted by the convolutional layers, allowing CAE to autonomously construct a physically consistent 6D continuous latent manifold by compressing high-fidelity 2D snapshots (256x256 grid, 21 variables) by over 100,000 times. The NODE is subsequently trained to describe the continuous-time dynamics on the non-linear manifold, enabling the prediction of the full temporal evolution of the flames by integrating forward in time from an initial condition. The results demonstrate that the network can accurately capture the entire transient process, including ignition, flame propagation, and the gradual transition to a non-premixed condition, with relative errors less than ~2% for major species. This study, for the first time, highlights the potential of CAE-NODE for surrogate modeling of unsteady dynamics of multi-dimensional reacting flows.

关键词: convolutional autoencoder, neural ODE, reduced-order model, transient counterflow flames, surrogate modeling, reacting flows, latent manifold, temporal evolution

313. ❌ Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion

作者: Sonia Laguna, Jorge da Silva Goncalves, Moritz Vandenhirtz, Alain Ryser, Irene Cannistraci, Julia E. Vogt 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器遗忘（Machine Unlearning）问题，提出了一种名为MUNKEY的基于键值删除的遗忘设计方法，使用记忆增强的Transformer架构。虽然论文涉及Transformer模型，但研究重点是完全不同的领域——机器遗忘的模型设计范式，而非大模型技术原理、训练方法、推理优化、对齐、应用等关键词所指向的核心内容。所有关键词均与论文主题无直接关联，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MUNKEY的机器遗忘设计范式，通过将实例特定记忆与模型权重解耦，实现了无需权重更新或访问原始数据的零样本遗忘，在多个数据集上超越了现有后处理方法。

摘要翻译

机器遗忘正迅速成为一项实际需求，这主要受到隐私法规、数据错误以及需要移除有害或损坏训练样本的驱动。尽管如此，现有方法大多仅从后处理视角解决该问题，试图通过参数更新来消除目标训练样本的影响，而这类更新通常需要访问完整训练数据。这与实际部署场景存在脱节——在真实场景中，遗忘请求是可预见的，这揭示了后处理方法的根本局限。我们提出“设计式遗忘”这一新范式，其核心是直接训练模型，使遗忘成为其固有能力。我们通过基于密钥删除的机器遗忘方法（MUNKEY）具体实现这一理念：该方法采用记忆增强型Transformer架构，将实例特定的记忆与模型权重解耦。在此框架中，遗忘操作对应于删除实例标识密钥，无需权重更新或访问原始样本及标签即可实现直接零样本遗忘。在自然图像基准测试、细粒度识别和医疗数据集上的实验表明，MUNKEY在所有后处理基线方法中均表现更优。我们的研究证实，设计式遗忘能够实现快速、面向部署的遗忘，同时保持模型的预测性能。

摘要 (Abstract)

Machine unlearning is rapidly becoming a practical requirement, driven by privacy regulations, data errors, and the need to remove harmful or corrupted training samples. Despite this, most existing methods tackle the problem purely from a post-hoc perspective. They attempt to erase the influence of targeted training samples through parameter updates that typically require access to the full training data. This creates a mismatch with real deployment scenarios where unlearning requests can be anticipated, revealing a fundamental limitation of post-hoc approaches. We propose \textit{unlearning by design}, a novel paradigm in which models are directly trained to support forgetting as an inherent capability. We instantiate this idea with Machine UNlearning via KEY deletion (MUNKEY), a memory augmented transformer that decouples instance-specific memorization from model weights. Here, unlearning corresponds to removing the instance-identifying key, enabling direct zero-shot forgetting without weight updates or access to the original samples or labels. Across natural image benchmarks, fine-grained recognition, and medical datasets, MUNKEY outperforms all post-hoc baselines. Our results establish that unlearning by design enables fast, deployment-oriented unlearning while preserving predictive performance.

关键词: Machine Unlearning, Unlearning by Design, MUNKEY, Transformer, Key Deletion, Zero-shot Forgetting, Memory Augmented Model, Post-hoc Methods

314. ❌ MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers

作者: Jérémy Morlier, Robin Geens, Stef Cuyckens, Arne Symons, Marian Verhelst, Vincent Gripon, Mathieu Léonardon 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MONET专注于神经网络训练阶段的硬件-软件协同设计框架开发，研究内容为训练工作负载建模、异构数据流加速器上的训练优化、层融合配置和激活检查点权衡。所有评分关键词均涉及大模型技术原理、训练方法、推理优化、对齐技术、应用领域等具体方向，而本文的核心是通用的神经网络训练建模框架（以ResNet-18和小型GPT-2为例进行验证），并未深入探讨任何特定的大模型技术、训练方法或应用领域。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

本文提出了MONET框架，用于在异构数据流加速器上建模神经网络训练过程，解决了现有工具无法捕捉训练阶段内存占用和反向传播复杂性的问题，并通过优化层融合和激活检查点配置展示了更好的硬件架构设计。

摘要翻译

尽管硬件-软件协同设计已显著提升了神经网络推理的效率，但对训练阶段的建模仍是一个关键且尚未充分探索的挑战。训练工作负载带来了独特的约束，特别是在内存占用和反向传播复杂性方面，这是现有以推理为中心的工具所无法捕捉的。本文介绍了MONET，这是一个专为异构数据流加速器上的神经网络训练建模而设计的框架。MONET建立在Stream之上——Stream是一个经过实验验证的、用于对具有层融合功能的异构数据流加速器上的神经网络推理进行建模的框架。利用MONET，我们探索了ResNet-18和一个小型GPT-2的设计空间，展示了该框架为训练工作流建模并寻找更优硬件架构的能力。随后，我们进一步研究了在神经网络训练中因设计空间扩大而变得更加复杂的问题，例如确定最佳的层融合配置。此外，我们借助遗传算法，使用我们的框架来探索激活检查点技术中有价值的权衡取舍。我们的研究结果凸显了采用整体性硬件-软件协同设计方法对于实现可扩展且高效的深度学习部署的重要性。

摘要 (Abstract)

While hardware-software co-design has significantly improved the efficiency of neural network inference, modeling the training phase remains a critical yet underexplored challenge. Training workloads impose distinct constraints, particularly regarding memory footprint and backpropagation complexity, which existing inference-focused tools fail to capture. This paper introduces MONET, a framework designed to model the training of neural networks on heterogeneous dataflow accelerators. MONET builds upon Stream, an experimentally verified framework that that models the inference of neural networks on heterogeneous dataflow accelerators with layer fusion. Using MONET, we explore the design space of ResNet-18 and a small GPT-2, demonstrating the framework’s capability to model training workflows and find better hardware architectures. We then further examine problems that become more complex in neural network training due to the larger design space, such as determining the best layer-fusion configuration. Additionally, we use our framework to find interesting trade-offs in activation checkpointing, with the help of a genetic algorithm. Our findings highlight the importance of a holistic approach to hardware-software co-design for scalable and efficient deep learning deployment.

关键词: neural network training, hardware-software co-design, heterogeneous dataflow accelerators, layer fusion, activation checkpointing, training workload modeling, design space exploration, genetic algorithm optimization

315. ❌ Lightweight User-Personalization Method for Closed Split Computing

作者: Yuya Okada, Takayuki Nishio 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14958v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Split Computing系统中的轻量级自适应方法（SALT），专注于边缘计算、分布式推理、用户个性化、通信鲁棒性和隐私保护。所有评分关键词都涉及大语言模型（LLMs）及相关技术（如MoE、RLHF、RAG、量化等），而本文研究的是传统的卷积神经网络（ResNet-18）在图像分类任务上的应用，未涉及任何大语言模型技术、架构或应用场景，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对封闭式Split Computing系统中因用户特定数据分布偏移、通信不可靠和隐私扰动导致的推理性能下降问题，提出了轻量级自适应框架SALT，通过在客户端侧引入紧凑适配器来优化中间表示，实验表明SALT在CIFAR数据集上实现了更高的个性化准确率（从88.1%提升至93.8%）并显著降低了训练成本。

摘要翻译

分割计算通过将深度神经网络划分为边缘端头部和服务器端尾部，实现边缘设备与云端的协同推理，从而降低延迟并限制原始输入数据的暴露。然而在实际部署中，由于用户特定的数据分布偏移、不可靠的通信以及面向隐私的扰动，推理性能往往会出现下降，特别是在模型架构和参数不可访问的封闭环境中。为应对这一挑战，我们提出SALT（分割自适应轻量调优），一种面向封闭式分割计算系统的轻量自适应框架。SALT引入了一个紧凑的客户端适配器，用于优化由冻结头部网络产生的中间表示，从而在不修改头部或尾部网络、不增加通信开销的情况下实现有效的模型自适应。通过仅修改训练条件，SALT支持多种自适应目标，包括用户个性化、通信鲁棒性和隐私感知推理。在CIFAR-10和CIFAR-100数据集上使用ResNet-18进行的实验表明，SALT相比传统的重新训练和微调方法实现了更高的准确率，同时显著降低了训练成本。在CIFAR-10数据集上，SALT将个性化准确率从88.1%提升至93.8%，同时训练延迟降低超过60%。在75%数据包丢失情况下，SALT仍保持90%以上的准确率；在噪声注入条件下（sigma = 1.0时）也保持了较高准确率（约88%）。这些结果表明，SALT为现实世界的分割计算系统提供了一个高效实用的自适应框架。

摘要 (Abstract)

Split Computing enables collaborative inference between edge devices and the cloud by partitioning a deep neural network into an edge-side head and a server-side tail, reducing latency and limiting exposure of raw input data. However, inference performance often degrades in practical deployments due to user-specific data distribution shifts, unreliable communication, and privacy-oriented perturbations, especially in closed environments where model architectures and parameters are inaccessible. To address this challenge, we propose SALT (Split-Adaptive Lightweight Tuning), a lightweight adaptation framework for closed Split Computing systems. SALT introduces a compact client-side adapter that refines intermediate representations produced by a frozen head network, enabling effective model adaptation without modifying the head or tail networks or increasing communication overhead. By modifying only the training conditions, SALT supports multiple adaptation objectives, including user personalization, communication robustness, and privacy-aware inference. Experiments using ResNet-18 on CIFAR-10 and CIFAR-100 show that SALT achieves higher accuracy than conventional retraining and fine-tuning while significantly reducing training cost. On CIFAR-10, SALT improves personalized accuracy from 88.1% to 93.8% while reducing training latency by more than 60%. SALT also maintains over 90% accuracy under 75% packet loss and preserves high accuracy (about 88% at sigma = 1.0) under noise injection. These results demonstrate that SALT provides an efficient and practical adaptation framework for real-world Split Computing systems.

关键词: Split Computing, Lightweight Adaptation, User Personalization, Edge Computing, Privacy-aware Inference, Communication Robustness, SALT Framework, ResNet-18

316. ❌ SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning

作者: Ran Tao, Qiugang Zhan, Shantian Yang, Xiurui Xie, Qi Tian, Guisong Liu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究脉冲联邦学习（Spiking Federated Learning）中的异构性问题，提出SFedHIFI框架，使用脉冲神经网络（SNNs）和基于发放率的异构信息融合。所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于脉冲神经网络和联邦学习的特定交叉领域，未涉及LLMs、MoE、量化、推理加速、对齐、RAG等关键词所描述的技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现实场景中资源受限客户端的系统异构性问题，提出了SFedHIFI框架，通过基于发放率的异构信息融合和通道级矩阵分解，实现了异构脉冲联邦学习，在三个公开基准测试中优于基线方法，并在保证精度的同时显著节省了能耗。

摘要翻译

脉冲联邦学习（Spiking Federated Learning, SFL）凭借脉冲神经网络（Spiking Neural Networks, SNNs）的能效优势已被广泛研究。然而，现有SFL方法要求模型同构，并假设所有客户端均拥有充足的计算资源，导致部分资源受限的客户端被排除在外。为解决现实场景中普遍存在的系统异构性问题，构建异构SFL系统至关重要，该系统允许客户端根据本地资源自适应部署不同规模的模型。为此，我们提出SFedHIFI——一种基于发放率的异构信息融合脉冲联邦学习新框架。具体而言，SFedHIFI采用通道级矩阵分解技术，在资源异构的客户端上部署自适应复杂度的SNN模型。在此基础上，所提出的异构信息融合模块实现了不同宽度模型间的跨尺度聚合，从而提升对多样化本地知识的利用效率。在三个公开基准数据集上的大量实验表明，SFedHIFI能有效实现异构SFL，其性能持续优于全部三种基线方法。与基于人工神经网络（ANN）的联邦学习相比，该方法在仅牺牲微小精度损失的前提下实现了显著的节能效果。

摘要 (Abstract)

Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SFL systems that allow clients to adaptively deploy models of different scales based on their local resources is crucial. To this end, we introduce SFedHIFI, a novel Spiking Federated Learning framework with Fire Rate-Based Heterogeneous Information Fusion. Specifically, SFedHIFI employs channel-wise matrix decomposition to deploy SNN models of adaptive complexity on clients with heterogeneous resources. Building on this, the proposed heterogeneous information fusion module enables cross-scale aggregation among models of different widths, thereby enhancing the utilization of diverse local knowledge. Extensive experiments on three public benchmarks demonstrate that SFedHIFI can effectively enable heterogeneous SFL, consistently outperforming all three baseline methods. Compared with ANN-based FL, it achieves significant energy savings with only a marginal trade-off in accuracy.

关键词: Spiking Federated Learning, Heterogeneous Information Fusion, Spiking Neural Networks, Channel-wise Matrix Decomposition, Energy Efficiency, Model Heterogeneity, Cross-scale Aggregation, Resource-constrained Clients

317. ❌ Spiking Layer-Adaptive Magnitude-based Pruning

作者: Junqiao Wang, Zhehang Ye, Yuqi Ouyang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究脉冲神经网络（SNN）的剪枝方法（SLAMP），属于神经网络压缩和高效推理领域。与绝大多数关键词（涉及大语言模型、训练对齐、推理技术、智能体等）完全无关。仅与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为模型剪枝是模型压缩的一种重要技术，但论文专注于SNN的特定剪枝，并非直接讨论量化或低比特权重。

!!! tip deepseek-chat TL;DR

该论文针对脉冲神经网络（SNN）部署时因密集连接和高脉冲操作成本受限的问题，提出了一种理论引导的层自适应幅度剪枝框架（SLAMP），通过控制跨层和时间的输出失真，在多个数据集上实现了显著的连接和操作减少，同时保持了精度。

摘要翻译

脉冲神经网络（SNNs）虽能提供高能效计算，但其部署受限于密集连接与高昂的脉冲操作开销。现有基于幅度的剪枝策略若直接应用于SNNs，往往忽略时间累积效应、非均匀时间步贡献以及膜电位稳定性，易导致严重的性能下降。本文提出脉冲层自适应幅度剪枝（Spiking Layer-Adaptive Magnitude-based Pruning, SLAMP），这是一个理论指导的剪枝框架，通过显式控制各层及各时间步的最坏情况输出失真，将层自适应幅度剪枝推广至时序SNNs。SLAMP将稀疏度分配建模为时序失真约束优化问题，生成具有时间感知的层重要性评分，其在单时间步极限下可退化为传统的层自适应剪枝。本文推导出一种高效的两阶段流程，结合时序评分估计、全局稀疏度分配以及幅度剪枝与重训练以恢复稳定性。在CIFAR10、CIFAR100以及基于事件的CIFAR10-DVS数据集上的实验表明，SLAMP在保持精度的同时，显著减少了连接数与脉冲操作量，为实现高效且可部署的SNN推理提供了可能。

摘要 (Abstract)

Spiking Neural Networks (SNNs) provide energy-efficient computation but their deployment is constrained by dense connectivity and high spiking operation costs. Existing magnitude-based pruning strategies, when naively applied to SNNs, fail to account for temporal accumulation, non-uniform timestep contributions, and membrane stability, often leading to severe performance degradation. This paper proposes Spiking Layer-Adaptive Magnitude-based Pruning (SLAMP), a theory-guided pruning framework that generalizes layer-adaptive magnitude pruning to temporal SNNs by explicitly controlling worst-case output distortion across layers and timesteps. SLAMP formulates sparsity allocation as a temporal distortion-constrained optimization problem, yielding time-aware layer importance scores that reduce to conventional layer-adaptive pruning in single-timestep limit. An efficient two-stage procedure is derived, combining temporal score estimation, global sparsity allocation, and magnitude pruning with retraining for stability recovery. Experiments on CIFAR10, CIFAR100, and the event-based CIFAR10-DVS datasets demonstrate that SLAMP achieves substantial connectivity and spiking operation reductions while preserving accuracy, enabling efficient and deployable SNN inference.

关键词: Spiking Neural Networks, SNNs, Pruning, Model Compression, Energy-efficient Inference, Layer-adaptive Pruning, Temporal Distortion, Sparsity Allocation

318. ❌ Ultra-Early Prediction of Tipping Points: Integrating Dynamical Measures with Reservoir Computing

作者: Xin Li, Qunxi Zhu, Chengli Zhao, Bolin Zhao, Xue Zhang, Xiaojun Duan, Wei Lin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究复杂动力系统（如气候、生态系统）的临界点（tipping points）预测问题，提出了一种结合动力系统稳定性度量与储层计算（Reservoir Computing）的无模型框架。论文核心内容属于AI在科学领域的应用（特别是复杂系统分析），与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为其应用机器学习方法解决科学预测问题。然而，论文未涉及大模型（LLMs）、深度学习技术原理、或任何其他评分关键词中的具体技术（如MoE、SFT、RAG、量化等），因此其他关键词均评0分。论文使用储层计算（一种轻量级机器学习方法），而非大模型或深度学习，因此不符合研究背景中’大模型和深度学习在科学领域的应用’或’大模型和深度学习技术原理的创新’的核心要求。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合动力系统稳定性度量与储层计算的无模型框架，用于从观测时间序列数据中实现复杂动力系统临界点的超早期预测，并在合成系统和真实数据集（如大西洋经向翻转环流系统）上验证了其优于基线方法的性能。

摘要翻译

复杂动力系统——如气候、生态系统和经济系统——可能发生灾难性且潜在不可逆的状态转变，这种转变通常由环境参数漂移和随机扰动触发。这些被称为临界点（tipping points）的关键阈值，提出了一个兼具理论与实际意义的预测难题，但至今仍未得到充分解决。为此，我们提出了一种无模型框架，该框架仅利用观测时间序列数据，将表征动力系统稳定性与敏感性的动力学指标与一种轻量级机器学习技术——储备池计算（Reservoir Computing, RC）相结合。该框架包含两个阶段：第一阶段利用RC从分段窗口化的观测数据中稳健地学习局部复杂动力学；第二阶段则通过分析学习得到的自主RC动力学，借助包括雅可比矩阵主导特征值、最大弗洛凯乘子（Floquet multiplier）和最大李雅普诺夫指数（Lyapunov exponent）在内的动力学指标，以精确检测临界点的早期预警信号。此外，当这些动力学指标呈现趋势性变化模式时，对其外推可实现在临界转变发生前的超早期预测。我们对所提方法进行了严格的理论分析，并在多个代表性合成系统与八个真实世界数据集上进行了广泛的数值评估，同时定量预测了大西洋经向翻转环流系统的临界转变时间。实验结果表明，我们的框架在综合评估中展现出优于基线方法的优势，尤其在动力学可解释性、预测稳定性与鲁棒性以及超早期预测能力方面。

摘要 (Abstract)

Complex dynamical systems-such as climate, ecosystems, and economics-can undergo catastrophic and potentially irreversible regime changes, often triggered by environmental parameter drift and stochastic disturbances. These critical thresholds, known as tipping points, pose a prediction problem of both theoretical and practical significance, yet remain largely unresolved. To address this, we articulate a model-free framework that integrates the measures characterizing the stability and sensitivity of dynamical systems with the reservoir computing (RC), a lightweight machine learning technique, using only observational time series data. The framework consists of two stages. The first stage involves using RC to robustly learn local complex dynamics from observational data segmented into windows. The second stage focuses on accurately detecting early warning signals of tipping points by analyzing the learned autonomous RC dynamics through dynamical measures, including the dominant eigenvalue of the Jacobian matrix, the maximum Floquet multiplier, and the maximum Lyapunov exponent. Furthermore, when these dynamical measures exhibit trend-like patterns, their extrapolation enables ultra-early prediction of tipping points significantly prior to the occurrence of critical transitions. We conduct a rigorous theoretical analysis of the proposed method and perform extensive numerical evaluations on a series of representative synthetic systems and eight real-world datasets, as well as quantitatively predict the tipping time of the Atlantic Meridional Overturning Circulation system. Experimental results demonstrate that our framework exhibits advantages over the baselines in comprehensive evaluations, particularly in terms of dynamical interpretability, prediction stability and robustness, and ultra-early prediction capability.

关键词: tipping points, reservoir computing, dynamical systems, early warning signals, time series prediction, complex systems, model-free framework, ultra-early prediction

319. ❌ Intelligent Control of Differential Drive Robots Subject to Unmodeled Dynamics with EKF-based State Estimation

作者: Amos Alwala, Yuchen Hu, Gabriel da Silva Lima, Wallace Moreira Bessa 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人控制领域，研究差分驱动机器人的智能控制和状态估计，使用自适应神经网络（ANN）和扩展卡尔曼滤波器（EKF）等技术。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、对齐、微调、推理加速、AI for Science等主题。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合自适应神经网络和扩展卡尔曼滤波器的统一控制与状态估计框架，用于解决差分驱动机器人在动态不确定环境中的轨迹跟踪问题，实验表明该方法相比基线方法能显著降低线速度和角速度误差。

摘要翻译

在动态与不确定环境中运行的差速驱动机器人（Differential Drive Robot, DDR），其可靠控制与状态估计仍面临挑战，尤其当系统动力学部分未知且传感器测量易出现退化时。本研究提出一种统一控制与状态估计框架，该框架结合了基于李雅普诺夫的非线性控制器、自适应神经网络（Adaptive Neural Network, ANN）以及基于扩展卡尔曼滤波（Extended Kalman Filter, EKF）的多传感器融合方法。所提出的控制器利用神经网络的通用逼近特性实时建模未知非线性动态。通过在线自适应方案，对选为ANN架构的径向基函数（Radial Basis Function, RBF）权重进行更新。学习到的动态模型被整合到反馈线性化（Feedback Linearization, FBL）控制律中，并通过类李雅普诺夫稳定性分析，在轨迹跟踪任务中为闭环系统的稳定性与渐近收敛性提供了理论保证。为实现鲁棒的状态估计，EKF融合了来自单目相机、二维激光雷达和轮式编码器的惯性测量单元（Inertial Measurement Unit, IMU）数据与里程计信息。融合后的状态估计驱动智能控制器，即使在漂移、车轮打滑、传感器噪声与故障条件下仍能保持稳定性能。通过Gazebo仿真与真实DDR平台实验验证了该方法的有效性：与基准FBL方法相比，所提方法在线速度和角速度跟踪误差上分别降低了53.91%和29.0%，显著提升了速度跟踪性能。

摘要 (Abstract)

Reliable control and state estimation of differential drive robots (DDR) operating in dynamic and uncertain environments remains a challenge, particularly when system dynamics are partially unknown and sensor measurements are prone to degradation. This work introduces a unified control and state estimation framework that combines a Lyapunov-based nonlinear controller and Adaptive Neural Networks (ANN) with Extended Kalman Filter (EKF)-based multi-sensor fusion. The proposed controller leverages the universal approximation property of neural networks to model unknown nonlinearities in real time. An online adaptation scheme updates the weights of the radial basis function (RBF), the architecture chosen for the ANN. The learned dynamics are integrated into a feedback linearization (FBL) control law, for which theoretical guarantees of closed-loop stability and asymptotic convergence in a trajectory-tracking task are established through a Lyapunov-like stability analysis. To ensure robust state estimation, the EKF fuses inertial measurement unit (IMU) and odometry from monocular, 2D-LiDAR and wheel encoders. The fused state estimate drives the intelligent controller, ensuring consistent performance even under drift, wheel slip, sensor noise and failure. Gazebo simulations and real-world experiments are done using DDR, demonstrating the effectiveness of the approach in terms of improved velocity tracking performance with reduction in linear and angular velocity errors up to $53.91%$ and $29.0%$ in comparison to the baseline FBL.

关键词: differential drive robots, adaptive neural networks, extended Kalman filter, state estimation, trajectory tracking, feedback linearization, Lyapunov stability, multi-sensor fusion

320. ❌ Masked BRep Autoencoder via Hierarchical Graph Transformer

作者: Yifei Li, Kang Wu, Wenming Wu, Xiaoming Fu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文提出了一种用于计算机辅助设计（CAD）模型的自监督学习框架，主要涉及图神经网络、Transformer架构和自监督学习技术。论文与大多数大语言模型（LLM）相关关键词无关，因为这些关键词主要针对自然语言处理领域的大模型技术。然而，论文在以下方面与关键词有一定关联：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文使用自监督预训练学习CAD模型的表示，然后进行下游任务适应，这与预训练和领域适应的概念相关。2）“Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文在预训练后使用少量标注数据进行任务特定网络训练，类似于监督微调。3）“AI for Science OR Bioinformatics OR Cheminformatics”（8分）：论文将AI应用于CAD模型分析，属于AI在科学和工程领域的应用，与"AI for Science"高度相关。其他关键词如LLM、MoE、对齐、RAG等均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分层图Transformer的掩码BRep自编码器框架，通过自监督学习从CAD模型中学习表示，并在少量标注数据下实现了下游任务的高性能。

摘要翻译

本文提出了一种新颖的自监督学习框架，能够自动从输入的计算机辅助设计（CAD）模型中学习表征，以应用于下游任务，包括零件分类、建模分割和加工特征识别。为训练我们的网络，我们构建了一个大规模、无标签的边界表示（BRep）模型数据集。我们算法的成功依赖于两个关键组件。其一是掩码图自编码器，它通过重建随机掩码的BRep几何结构与属性来进行表征学习，以增强泛化能力。其二是分层图Transformer架构，该架构通过跨尺度互注意力模块来建模长程几何依赖关系，并通过图神经网络模块来聚合局部拓扑信息，从而优雅地融合了全局与局部学习。在完成自编码器的预训练后，我们将其解码器替换为针对特定任务、使用少量标注数据进行训练的网络，以执行下游任务。我们在多种任务上进行了实验，即使仅使用少量标注数据，也取得了优异的性能，这证明了我们模型的实用性与泛化能力。与其他方法相比，在相同训练数据量下，我们的模型在下游任务上表现显著更优，尤其是在训练数据极为有限的情况下。

摘要 (Abstract)

We introduce a novel self-supervised learning framework that automatically learns representations from input computer-aided design (CAD) models for downstream tasks, including part classification, modeling segmentation, and machining feature recognition. To train our network, we construct a large-scale, unlabeled dataset of boundary representation (BRep) models. The success of our algorithm relies on two keycomponents. The first is a masked graph autoencoder that reconstructs randomly masked geometries and attributes of BReps for representation learning to enhance the generalization. The second is a hierarchical graph Transformer architecture that elegantly fuses global and local learning by a cross-scale mutual attention block to model long-range geometric dependencies and a graph neural network block to aggregate local topological information. After training the autoencoder, we replace its decoder with a task-specific network trained on a small amount of labeled data for downstream tasks. We conduct experiments on various tasks and achieve high performance, even with a small amount of labeled data, demonstrating the practicality and generalizability of our model. Compared to other methods, our model performs significantly better on downstream tasks with the same amount of training data, particularly when the training data is very limited.

关键词: self-supervised learning, graph Transformer, CAD models, BRep autoencoder, representation learning, downstream tasks, hierarchical graph, masked reconstruction

321. ❌ A multiscale discrete-to-continuum framework for structured population models

作者: Eleonora Agostinelli, Keith L. Chambers, Helen M. Byrne, Mohit P. Dalwadi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于数学建模和计算生物学领域，提出了一种用于结构化种群模型的离散到连续多尺度框架，并应用于动脉粥样硬化早期脂质结构模型。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，而本文属于纯数学和计算生物学范畴，未涉及任何AI、机器学习或大模型技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物信息学/计算生物学中的数学模型，但并非使用AI方法，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种多尺度离散到连续框架，用于系统推导结构化种群模型的连续近似，解决了传统上尺度方法在截断顺序、一致有效性和边界条件方面的模糊性问题，并在早期动脉粥样硬化的脂质结构模型中验证了其有效性。

摘要翻译

生物种群数学模型通常采用离散结构类来捕捉个体间的性状差异（如年龄、尺寸、表型、细胞内状态）。将这些离散模型升尺度为连续描述可提升解析处理能力与数值解的可扩展性。然而，仅基于泰勒展开的传统升尺度方法可能在截断阶数、一致有效性及边界条件方面引入模糊性。为此，本文提出一种离散多尺度框架，以系统性地推导结构化种群模型的连续近似。通过将多尺度方法与匹配渐近展开法应用于离散系统，我们识别了结构空间中适合连续描述的区域，并推导出相应的偏微分方程。主导阶动力学在主体区域表现为非线性平流方程，而在前沿波与停滞点附近的小内层区域则呈现平流-扩散过程。对于连续描述本质上不适用的区域，我们进一步推导了离散边界层描述。最后，我们以早期动脉粥样硬化的简单脂质结构模型为例演示该方法，并验证离散与连续描述间的一致性。所提出的多尺度框架可应用于其他具有离散结构的异质系统，从而获得具有渐近一致边界条件的恰当升尺度动力学。

摘要 (Abstract)

Mathematical models of biological populations commonly use discrete structure classes to capture trait variation among individuals (e.g. age, size, phenotype, intracellular state). Upscaling these discrete models into continuum descriptions can improve analytical tractability and scalability of numerical solutions. Common upscaling approaches based solely on Taylor expansions may, however, introduce ambiguities in truncation order, uniform validity and boundary conditions. To address this, here we introduce a discrete multiscale framework to systematically derive continuum approximations of structured population models. Using the method of multiple scales and matched asymptotic expansions applied to discrete systems, we identify regions of structure space for which a continuum representation is appropriate and derive the corresponding partial differential equations. The leading-order dynamics are given by a nonlinear advection equation in the bulk domain and advection-diffusion processes in small inner layers about the leading wavefronts and stagnation point. We further derive discrete boundary layer descriptions for regions where a continuum representation is fundamentally inappropriate. Finally, we demonstrate the method on a simple lipid-structured model for early atherosclerosis and verify consistency between the discrete and continuum descriptions. The multiscale framework we present can be applied to other heterogeneous systems with discrete structure in order to obtain appropriate upscaled dynamics with asymptotically consistent boundary conditions.

关键词: structured population models, discrete-to-continuum framework, multiscale analysis, asymptotic expansions, partial differential equations, lipid-structured model, atherosclerosis, boundary conditions

322. ❌ Fold-CP: A Context Parallelism Framework for Biomolecular Modeling

作者: Dejun Lin, Simon Chu, Vishanth Iyer, Youhan Lee, John St John, Kevin Boyd, Brian Roland, Xiaowei Ren, Guoqing Zhou, Zhonglin Cao, Polina Binder, Yuliya Zhautouskaya, Jakub Zakrzewski, Maximilian Stadler, Kyle Gion, Yuxing Peng, Xi Chen, Tianjing Zhang, Philipp Junk, Michelle Dimon, Paweł Gniewek, Fabian Ortega, McKinley Polen, Ivan Grubisic, Ali Bashir, Graham Holt, Danny Kovtun, Matthias Grass, Luca Naef, Rui Wang, Jian Peng, Anthony Costa, Saee Paliwal, Eddie Calleja, Timur Rvachov, Neha Tadimeti, Roy Tal, Emine Kucukbenli 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物分子建模的并行计算框架开发，核心贡献是Fold-CP框架，通过上下文并行化技术解决AlphaFold 3等模型在硬件内存上的限制，使大规模生物分子组装体的结构预测成为可能。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词主要针对语言模型和通用AI技术，而本文是特定领域的计算框架研究。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics"，因为论文明确属于生物信息学领域的AI应用（生物分子结构预测），且具有显著的科学应用价值，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了Fold-CP框架，通过上下文并行化技术解决了生物分子建模中的内存限制问题，使超过30,000个残基的大型生物分子组装体的结构预测成为可能，并成功应用于蛋白质复合物数据库评分和疾病相关复合物折叠。

摘要翻译

理解细胞机制需要对大型生物分子组装体进行原子尺度重构。然而，预测此类系统的结构一直受到AlphaFold 3等模型硬件内存需求的限制，导致在单个GPU上可处理的残基数存在数千的实际上限。本文介绍NVIDIA BioNeMo Fold-CP——一种上下文并行框架，该框架通过将共折叠模型的推理和训练流程分布到多个GPU上，突破了这一限制。我们采用Boltz模型作为开源参考架构，并实现了定制化的多维原语，能够高效并行化密集三角更新与窗口批处理局部注意力中不规则且数据依赖的模式。我们的方法实现了高效的内存扩展：对于分布在P个GPU上的N个标记输入，单设备内存需求按$O(N^2/P)$比例缩放，从而在64个NVIDIA B300 GPU上实现了超过30,000个残基的组装体结构预测。我们通过成功的开发者用例证明了该方法的科学实用性：Fold-CP实现了对哺乳动物蛋白质复合物综合资源数据库超过90%条目的评分，并完成了与疾病相关的PI4KA脂质激酶复合物（结合本质无序区域）的完整折叠而无需截断处理。通过为具有完整全局上下文的大规模系统建模提供可扩展路径，Fold-CP标志着向实现虚拟细胞目标迈出了重要一步。

摘要 (Abstract)

Understanding cellular machinery requires atomic-scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold-CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co-folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data-dependent pattern of window-batched local attention. Our approach achieves efficient memory scaling; for an N-token input distributed across P GPUs, per-device memory scales as $O(N^2/P)$, enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold-CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease-relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold-CP represents a significant step toward the realization of a virtual cell.

关键词: biomolecular modeling, context parallelism, AlphaFold 3, GPU memory scaling, protein structure prediction, large assemblies, parallel computing, bioinformatics

323. ❌ Countershading coloration in blue shark skin emerges from hierarchically organized and spatially tuned photonic architectures inside skin denticles

作者: Viktoriia Kamska, Emeline Raguin, Bodo D. Wilts, Luca Bertinetti, Chiara Micheletti, Clemens Schmitt, Shahrouz Amini, Maria Murace, Frederik H. Mollen, Michael Blumer, Maite Erauskin Extramiana, Ruien Hu, Stefan Redl, Mason N. Dean 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.13937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究蓝鲨皮肤中反荫蔽着色的物理机制，属于生物物理学和材料科学领域，与绝大多数大模型和深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及生物系统（蓝鲨）的物理特性分析，属于广义的’AI for Science’范畴，但论文本身并未使用AI方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究揭示了蓝鲨皮肤的反荫蔽着色并非源于真皮色素细胞，而是由覆盖皮肤的单个真皮小齿内部、以前未被识别的光子结构产生的，这种结构通过分层的细胞和纳米晶体组织生成颜色变化。

摘要翻译

蓝鲨（Prionace glauca）呈现出显著的背腹颜色梯度，从背部鲜艳的蓝色过渡到腹部的银白色，这种模式被广泛解释为远洋反荫蔽。尽管其生态意义重大，但这种色彩形成的物理基础仍未得到解决。本文揭示，该色彩系统并非像大多数脊椎动物那样源于真皮色素细胞，而是源自覆盖皮肤的单个真皮小齿髓腔内一种先前未被识别的光子结构。光学成像显示小齿冠部存在离散的颜色区域，而外部小齿形态在不同颜色区域间保持相似。通过光谱学、显微计算机断层扫描、组织学及相关电子显微镜技术，我们证明颜色变化是由相互耦合的微米级和纳米级结构共同调控的。在蓝色小齿中，虹彩细胞和黑色素细胞在扩大的、局限于冠部的髓腔内形成一个密集镶嵌的反射-吸收系统。过渡区小齿表现出部分细胞分层结构，而白色小齿则缺乏黑色素细胞，仅包含反射细胞。在纳米尺度上，有序排列的嘌呤晶体堆栈产生窄带蓝色反射，而无序组装则产生宽带白色散射。综上所述，这些结果表明小齿作为受机械保护的光学“像素”，其层级化的细胞和纳米晶体结构共同产生了鲨鱼的反荫蔽体色。

摘要 (Abstract)

The blue shark (Prionace glauca) exhibits a striking dorsoventral color gradient, transitioning from vibrant blue dorsally to silver and white ventrally, a pattern widely interpreted as pelagic countershading. Despite its ecological significance, the physical basis of this coloration remains unresolved. Here we show that this color system does not arise from dermal chromatophores, as in most vertebrates, but from a previously unrecognised photonic architecture housed within the pulp cavity of individual dermal denticles that cover the skin. Optical imaging reveals discrete color domains within denticle crowns, while external denticle morphology remains similar across color zones. Using spectroscopy, micro-computed tomography, histology, and correlative electron microscopy, we demonstrate that color variation is organized across coupled micro- and nanoscale architectures. In blue denticles, iridophores and melanophores form a densely packed tessellated reflector-absorber system within an expanded crown-restricted pulp cavity. Transition-zone denticles exhibit partial cellular layering, whereas white denticles lack melanophores and contain only reflective cells. At the nanoscale, ordered purine-crystal stacks generate narrowband blue reflection, whereas disordered assemblies produce broadband white scattering. Together, these results reveal denticles as mechanically protected optical “pixels” whose hierarchical cellular and nanocrystal organization generates the shark’s countershaded coloration.

关键词: blue shark, countershading coloration, dermal denticles, photonic architecture, iridophores, melanophores, purine-crystal stacks, hierarchical organization

324. ❌ Reproducible Orchestration of Best Practices for Reaction Path Optimization with the Nudged Elastic Band

作者: Rohit Goswami 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于化学计算科学领域，开发了一个自动化工作流（Snakemake）用于反应路径优化，结合了机器学习势能（PET-MAD）和传统计算化学软件（eOn）。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（5分），因为论文属于AI在科学计算（具体为化学信息学/计算化学）的应用，但未涉及大模型、深度学习技术原理创新或其他特定大模型相关技术。其他关键词均与大模型、深度学习技术原理、训练方法、推理优化、智能体等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了化学计算中Nudged Elastic Band方法因手动预处理步骤导致的错误和可重复性问题，通过开发一个全自动的Snakemake工作流，结合机器学习势能，成功实现了从端点优化到路径生成的自动化，并在HCN异构化反应中验证了其有效性。

摘要翻译

微动弹性带（Nudged Elastic Band, NEB）方法是寻找势能面上最小能量路径和过渡态的标准方法。实际的NEB计算需要若干预处理步骤：端点能量最小化、结构对齐以及初始路径生成。这些步骤通常通过临时脚本或人工干预处理，容易引入误差并阻碍结果的可重复性。我们提出了一种针对小气相分子的全自动化开源Snakemake工作流，该工作流将现代机器学习势函数（PET-MAD）与eOn鞍点搜索软件相结合。计算生命周期的每一步——从模型获取、端点准备到路径初始化和弹性带优化——均被编码为明确的依赖关系图。该工作流通过conda-forge解析所有软件依赖项，确保跨平台执行的一致性。以HCN至HNC异构化反应为案例的验证表明，该自动化流程无需人工干预即可复现已知的单势垒能量分布与产物能量。

摘要 (Abstract)

The nudged elastic band (NEB) method is the standard approach for finding minimum energy paths and transition states on potential energy surfaces. Practical NEB calculations require several pre-processing steps: endpoint minimization, structural alignment, and initial path generation. These steps are typically handled by ad-hoc scripts or manual intervention, introducing errors and hindering reproducibility. We present a fully automated, open-source Snakemake workflow for small gas phase molecules that couples modern machine learning potentials (PET-MAD) to the eOn saddle point search software. Each step of the calculation lifecycle is encoded as an explicit dependency graph, from model retrieval and endpoint preparation through path initialization and band optimization. The workflow resolves all software dependencies from conda-forge, ensuring identical execution across platforms. Validation on the HCN to HNC isomerization demonstrates that the automated pipeline recovers the known single-barrier energy profile and product energy without manual intervention.

关键词: Nudged Elastic Band, reaction path optimization, Snakemake workflow, machine learning potentials, reproducibility, automated pipeline, transition states, computational chemistry

325. ❌ Design Space of Self–Consistent Electrostatic Machine Learning Interatomic Potentials

作者: William J. Baldwin, Ilyes Batatia, Martin Vondrák, Johannes T. Margraf, Gábor Csányi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于机器学习原子间势能（MLIPs）在静电效应建模方面的研究，属于科学计算和材料科学领域的AI应用。论文内容与大多数关键词（主要涉及大语言模型技术、训练方法、推理优化、对齐等）完全无关，因为这些关键词针对的是自然语言处理领域的大模型技术。唯一相关的关键词是"AI for Science OR Bioinformatics OR Cheminformatics"，该论文属于AI for Science（科学AI）范畴，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文研究了机器学习原子间势能中自洽静电模型的局限性，提出了一个基于密度泛函理论的理论框架来统一现有模型，并通过金属-水界面和二氧化硅带电空位测试案例，证明了更复杂的自洽模型对于解决现有方法失败的必要性。

摘要翻译

机器学习原子间势（MLIPs）已成为原子尺度模拟中广泛使用的工具。在该领域发展的大部分历史中，最常采用的架构基于短程原子能量贡献，且局域性假设在许多现代基础模型中依然存在。尽管这种方法在许多应用场景中实现了高效且精确的建模，但对于那些长程静电相互作用、电荷转移或诱导极化起核心作用的体系，它存在固有的局限性。越来越多的研究提出了纳入静电效应的扩展方案，范围从局部预测的原子电荷到自洽模型。尽管这些模型已在特定案例中取得成功，但其基本假设和根本局限性尚未得到充分理解。在本工作中，我们提出了一个在MLIPs中处理静电作用的框架，将现有模型视为对密度泛函理论（DFT）的粗粒化近似。这一视角明确了所涉及的近似，阐明了学习量的物理意义，并揭示了若干先前提出模型之间的联系与等效性。利用这一形式体系，我们确定了关键的设计选择，这些选择定义了一个更广泛的自洽静电MLIPs设计空间。我们使用MACE架构和电荷密度的共享表示，实现了该空间中若干突出方案，从而能够对不同方法进行受控比较。最后，我们在两个具有启发性的测试案例上评估了这些模型：一是探究导体与绝缘体系统对比静电响应的金属-水界面，二是二氧化硅中的带电空位。我们的结果揭示了现有方法的局限性，并证明了需要更具表达力的自洽模型来解决其失效问题。

摘要 (Abstract)

Machine learning interatomic potentials (MLIPs) have become widely used tools in atomistic simulations. For much of the history of this field, the most commonly employed architectures were based on short-ranged atomic energy contributions, and the assumption of locality still persists in many modern foundation models. While this approach has enabled efficient and accurate modelling for many use cases, it poses intrinsic limitations for systems where long-range electrostatics, charge transfer, or induced polarization play a central role. A growing body of work has proposed extensions that incorporate electrostatic effects, ranging from locally predicted atomic charges to self-consistent models. While these models have demonstrated success for specific examples, their underlying assumptions, and fundamental limitations are not yet well understood. In this work, we present a framework for treating electrostatics in MLIPs by viewing existing models as coarse-grained approximations to density functional theory (DFT). This perspective makes explicit the approximations involved, clarifies the physical meaning of the learned quantities, and reveals connections and equivalences between several previously proposed models. Using this formalism, we identify key design choices that define a broader design space of self-consistent electrostatic MLIPs. We implement salient points in this space using the MACE architecture and a shared representation of the charge density, enabling controlled comparisons between different approaches. Finally, we evaluate these models on two instructive test cases: metal-water interfaces, which probe the contrasting electrostatic response of conducting and insulating systems, and charged vacancies in silicon dioxide. Our results highlight the limitations of existing approaches and demonstrate how more expressive self-consistent models are needed to resolve failures.

关键词: machine learning interatomic potentials, electrostatics, self-consistent models, density functional theory, MACE architecture, charge density, metal-water interfaces, charged vacancies

326. ❌ Stochastic Collision Theory of Magnetism in Radical Fluids

作者: Yoshiaki Uchida, Ryohei Kishi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是物理学中自由基溶液的量子主方程模型和磁性机制，属于理论物理和软物质物理领域。所有关键词均涉及大模型、深度学习、AI技术及其应用，与论文的物理理论内容完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过建立量子主方程模型，解释了自由基溶液中随机分子碰撞如何产生有效的铁磁耦合并增强磁化强度的微观机制，揭示了与传统理论不同的磁性行为。

摘要翻译

微观随机事件如何产生宏观确定性性质是物理学中的一个基本问题。我们通过建立浓自由基溶液的量子主方程模型来探讨这一问题，在该体系中随机分子碰撞主导着系统的磁性特性。我们的理论揭示了一个简明机制：一阶交换贡献在碰撞过程中平均为零，而二阶项则作为增强磁化的有效铁磁耦合得以保留。该模型捕捉到了实验中观测到的、与传统理论相偏离的磁性行为变化趋势。由于该机制源于统计平均效应，它可能适用于更广泛的软物质现象，包括液晶体系。

摘要 (Abstract)

How stochastic, microscopic events generate deterministic, macroscopic properties is a fundamental question in physics. We address this question by developing a quantum master equation model for concentrated radical solutions, where random molecular collisions govern the magnetic properties of the system. Our theory reveals a simple mechanism: the first-order exchange contribution averages to zero over collisions, while the second-order term survives as an effective ferromagnetic coupling that enhances magnetization. The model captures the experimentally observed trends in magnetic behavior that deviate from conventional theories. Because the mechanism arises from statistical averaging, it may apply to a broader class of soft matter phenomena, including liquid crystals.

关键词: stochastic collision theory, magnetism, radical fluids, quantum master equation, ferromagnetic coupling, magnetization, soft matter, molecular collisions

327. ❌ Acrylamide Conformers: A Revision of Published Density Functional Theory Studies

作者: William Scott, Estela Blaisten-Barojas 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.14675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇纯粹的化学计算研究，使用密度泛函理论（DFT）研究丙烯酰胺的构象异构体，不涉及任何大模型、深度学习、AI技术或相关方法，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过高精度密度泛函理论计算，澄清了丙烯酰胺稳定构象数量的争议，确认存在三个稳定构象（一个平面结构和两个镜像对称的三维结构）以及它们之间的三个过渡态能垒。

摘要翻译

丙烯酰胺（PubChem标识符CID=6579）被指存在四种稳定构象异构体，这与多篇期刊文献中仅描述两种或三种构象的结论相矛盾。本综述旨在澄清这一差异。通过极高精度的密度泛函理论（DFT）计算，我们验证了三种稳定构象异构体及其间存在的三个过渡态能垒，并利用自主DFT计算进行了确认。最稳定的构象异构体为平面分子结构，在文献中常称为“sys”或“trans”构型。同时，一种稳定性稍弱的构型称为“skew”，其对应两种三维结构，二者能量简并，但结构上互为镜像关系。本文总结了振动光谱、原子部分电荷、笛卡尔坐标及内禀反应坐标路径，并采用wB97XD/Def2TZVPP级别的DFT方法对三种稳定丙烯酰胺异构体——能量最低的sys/trans结构及两种镜像skew结构——进行了重新计算。

摘要 (Abstract)

Acrylamide, with PubChem identifier CID=6579 is broadcasted to have four stable conformers contrasting with several journal publications characterizing only two or three. In this revision summary the discrepancy is clarified. Through very high precision density functional theory (DFT) calculations, three stable conformers and the three transition state barriers existing between them are verified to exist and validated with our own DFT calculations The most stable conformer is a planar molecular structure termed “sys” or “trans” in the literature. Meanwhile, a less stable structure termed “skew” pertains to two 3-dimensional structures that are energy-degenerate, but differ in their structure for being mirrored images of each other. Vibrational spectra, partial atomic charges, Cartesian coordinates, and Intrinsic Reaction Coordinate paths are summarized and recalculated with DFT at the wB97XD/Def2TZVPP level for the three stable acrylamide isomers: the sys/trans lowest in energy structure, and the two skew mirrored structures.

关键词: Acrylamide, Conformers, Density Functional Theory, DFT calculations, Stable conformers, Transition state barriers, Vibrational spectra, Intrinsic Reaction Coordinate

328. ❌ Excited Pfaffians: Generalized Neural Wave Functions Across Structure and State

作者: Nicholas Gao, Till Grutschus, Frank Noé, Stephan Günnemann 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于量子化学和计算物理领域，研究变分蒙特卡洛方法中的神经网络波函数，特别是用于激发态计算的Excited Pfaffians架构和Multi-State Importance Sampling方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词都是针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI for Science（科学人工智能）在计算化学/物理领域的应用，与生物信息学或化学信息学有相近的科学计算属性，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Excited Pfaffians的神经网络波函数架构和Multi-State Importance Sampling方法，用于高效、准确地计算量子系统的多个激发态，并在碳二聚体和铍原子上验证了其优越性能。

摘要翻译

变分蒙特卡洛（VMC）中的神经网络波函数在精确表示基态和激发态方面已取得巨大成功。然而，要在态间重叠度上达到足够的数值精度，需要随态数量增加而提升蒙特卡洛样本数，从而导致计算成本上升。本文提出一种近乎恒定样本量的方法——多态重要性采样（MSIS），该方法利用所有态的样本来估计两两重叠度。为高效评估所有样本的所有态，我们引入了激发态普法夫行列式波函数。该架构受哈特里-福克方法启发，能在单一神经网络内表示多个态。激发态普法夫行列式还可作为广义波函数，使单一模型能够表示多态势能面。在碳二聚体体系中，我们实现了与具有$O(N_s^4)$标度性的自然激发态相同的精度，同时训练速度提升超过200倍，并能多建模50%的态。我们优越的标度性使得首次利用神经网络确定铍原子所有分立能级成为可能。最后，我们证明单一波函数能够表示不同分子的激发态。

摘要 (Abstract)

Neural-network wave functions in Variational Monte Carlo (VMC) have achieved great success in accurately representing both ground and excited states. However, achieving sufficient numerical accuracy in state overlaps requires increasing the number of Monte Carlo samples, and consequently the computational cost, with the number of states. We present a nearly constant sample-size approach, Multi-State Importance Sampling (MSIS), that leverages samples from all states to estimate pairwise overlap. To efficiently evaluate all states for all samples, we introduce Excited Pfaffians. Inspired by Hartree-Fock, this architecture represents many states within a single neural network. Excited Pfaffians also serve as generalized wave functions, allowing a single model to represent multi-state potential energy surfaces. On the carbon dimer, we match the $O(N_s^4)$-scaling natural excited states while training $>200\times$ faster and modeling 50% more states. Our favorable scaling enables us to be the first to use neural networks to find all distinct energy levels of the beryllium atom. Finally, we demonstrate that a single wave function can represent excited states across various molecules.

关键词: Neural-network wave functions, Variational Monte Carlo, Excited states, Excited Pfaffians, Multi-State Importance Sampling, Potential energy surfaces, Quantum chemistry, Computational physics

329. ❌ Explicit, Machine-Learned Two-Body Potentials for Molecular Simulations

作者: Kham Lek Chaton, Eric D. Boittier, Mike Devereux, Markus Meuwly 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于分子模拟中的机器学习势函数开发，属于AI在科学领域的应用（具体为化学/生物物理领域）。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用机器学习于分子模拟，属于AI for Science范畴，但并非核心的生物信息学或化学信息学应用，故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种用于大型异相凝聚相系统的混合机器学习/分子力学（ML/MM）成对势函数，通过结合PhysNet ML方法和经典MM力场来描述分子间相互作用，并在二氯甲烷和丙酮等测试系统中验证了其准确性，同时指出了成对方法在存在显著多体效应系统中的局限性。

摘要翻译

本文提出了一种新型成对混合机器学习/分子力学（ML/MM）势函数，其设计目标适用于大型、异质凝聚相体系。该势函数中，PhysNet机器学习方法用于描述单体及短程二聚体相互作用，而经典的分子力学力场则描述超出设定切换距离的成对相互作用。模型基于MP2水平的二聚体及成对团簇能量进行拟合，并通过在不同切换距离下、以及采用包含或不包含精细分布式电荷静电描述的分子力学方法，评估了各模型的质量。通过对一个小型模型系统的基础应用，验证了该方法在分子动力学模拟中的适用性。研究以二氯甲烷和丙酮作为测试体系，证明了该方法在描述成对参考数据方面的准确性，同时也揭示了成对近似方法在凝聚相中表现出显著多体效应体系时的局限性，为未来工作中引入通用多体校正项奠定了基础。

摘要 (Abstract)

A new pairwise hybrid machine-learning/molecular mechanics (ML/MM) potential is introduced that is conceived for application to large, heterogeneous condensed-phase systems. The PhysNet ML method describes monomers and short-range dimer interactions, while a classical MM force field describes pairwise interactions beyond a defined switching distance. Models are fitted to MP2 dimer and pairwise cluster energies, and the quality of each model is assessed at different switching distances and using MM approaches with and without detailed distributed charge electrostatics. The applicability of the approach to molecular dynamics simulations is demonstrated for a basic implementation applied to a small model system. Dichloromethane and acetone are used as test systems to demonstrate the accuracy of the approach in describing pairwise reference data, and also to highlight the limitations of the pairwise approach for systems that exhibit significant many-body effects in condensed phase, paving the way for the addition of a general many-body correction in future work.

关键词: machine-learning potential, molecular mechanics, ML/MM, pairwise interactions, molecular dynamics simulations, condensed-phase systems, many-body effects, PhysNet

330. ❌ Auto-WHATMD : Automated Wasserstein-based High-dimensional feature extraction Analysis of Trajectories from Molecular Dynamics

作者: Sosuke Asano, Ikki Yasuda, Katsuhiro Endo, Yoshinori Hirano, Kenji Yasuoka 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于分子动力学轨迹分析，提出了一种基于最优传输距离和模拟退火的特征提取方法（auto-WHATMD），用于识别区分不同蛋白质-配体系统的关键残基。论文内容属于计算生物学/生物信息学领域，涉及AI在科学中的应用，但与所有其他关键词（如大模型、微调、推理优化、智能体等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用AI方法解决生物信息学问题，但并非核心大模型或深度学习技术，故给5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为auto-WHATMD的自动化方法，利用最优传输距离和模拟退火从高维分子动力学轨迹中提取关键残基特征，有效识别了区分不同蛋白质-配体系统的信息性残基，并展示了其与配体结合亲和力的相关性。

摘要翻译

比较具有不同结合配体或突变等变异的多种蛋白质系统，并理解其效应，是分子动力学模拟的目标之一。用少量特征表示这些系统可实现定量比较。然而，由于分子动力学模拟轨迹是高维时空数据，关键特征的选择依赖于领域专业知识，有时会引入主观假设。本文提出一种方法，利用最优传输距离比较高维轨迹数据，并采用模拟退火算法识别最能区分多个系统的残基。我们将该算法命名为auto-WHATMD（基于Wasserstein距离的分子动力学轨迹高维特征自动提取分析）。我们将auto-WHATMD应用于溴结构域4（bromodomain 4）与不同配体的多种蛋白质-配体系统，识别出环区最具区分性的残基。此外，即使仅选取少量残基也足以捕捉其与配体结合亲和力的相关性，表明auto-WHATMD能有效筛选出信息量最大的残基。我们的方法可用于高效确定关键残基，并为多种类似系统设计特征。

摘要 (Abstract)

Comparing multiple protein systems with variation such as different binding ligands or mutations, and understanding their effects is one of the objectives in molecular dynamics simulations. Representation of these systems by a few features enables quantitative comparison. However, because molecular dynamics simulation trajectories are high-dimensional spatiotemporal data, selection of key features relies on domain expertise, sometimes introducing arbitrary assumptions. Here, we present an approach that uses the optimal transport distance to compare high-dimensional trajectory data, and employs simulated annealing to identify the residues that best distinguish multiple systems. We term this algorithm auto-WHATMD (automated Wasserstein-based High-dimensional feature extraction Analysis for Trajectories of Molecular Dynamics). We applied auto-WHATMD to multiple protein-ligand systems of bromodomain 4 with different ligands, identifying the most discriminative residues in the loop region. Moreover, even a few selected residues were sufficient to capture the correlation with ligand-binding affinities, indicating that auto-WHATMD effectively prioritizes the most informative residues. Our approach can be used to efficiently determine key residues and design features for multiple analogous systems.

关键词: molecular dynamics, trajectory analysis, feature extraction, optimal transport, Wasserstein distance, protein-ligand systems, simulated annealing, bioinformatics

331. ❌ Carbon black and hydrogen production from methane pyrolysis: measured and modeled insights from integrated gas and particle diagnostics in shock tubes

作者: Gibson Clark, Mohammad Adib, Chengze Li, Taylor M. Rault, Jesse W. Streicher, Enoch Dames, M. Reza Kholghy, Ronald K. Hanson 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.14314v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究甲烷热解过程中碳黑和氢气的生成，属于化学工程和燃烧科学领域，与所有大模型和深度学习技术关键词完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有微弱关联，因为该研究涉及科学计算和化学过程建模，但未使用AI方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过激波管实验和模拟研究了甲烷热解过程中碳黑和氢气的生成，提供了气体动力学、颗粒形成和纳米结构演化的综合基准数据，以改进相关生产模型。

摘要翻译

甲烷（CH4）热解是一种同时生产氢气（H2）和炭黑（CB）的潜在途径，可避免与蒸汽甲烷重整法和炉法炭黑工艺相关的排放。热解炭黑合成的模型开发需要对气体化学、颗粒形成和形态演化进行同步实验观测。本研究结合实验与模拟，对5% CH4/氩气（Argon）混合物在反射激波后的热解过程进行了探究，反射激波后温度（T5）范围为1850-2450 K，压力P5约为4.5 atm。激光吸收诊断技术量化了CH4、C2H4和C2H2的摩尔分数，而双波长消光法（633和1064 nm）则解析了随时间变化的颗粒形成过程以及光学成熟度随温度的演变。模拟结果较好地复现了小分子物种分布，但不同模型对多环芳烃（Polycyclic Aromatic Hydrocarbons, PAHs）的预测仍存在较大差异。耦合的气体-颗粒模拟捕捉到了准确的体积分数（fv）趋势和气体动力学的影响，但在高T5下对诱导时间的预测偏低。通过透射电子显微镜（Transmission Electron Microscopy, TEM）对激波管端壁收集的样品进行分析，以量化初级粒径分布和纳米结构排列。图像分割和手动测量显示，随着T5升高，初级粒径（dp）增长减弱，而石墨化纳米结构普遍增强。本研究通过约束气相动力学、PAHs驱动的颗粒成核、颗粒动力学和颗粒成熟度，为改进甲烷热解制炭黑和氢气的模型提供了一个综合基准。结果强调，在颗粒数量和颗粒尺寸之间进行准确的质量分配，是未来模型发展的重要约束条件。

摘要 (Abstract)

Methane (CH4) pyrolysis is a promising route to co-produce hydrogen (H2) and carbon black (CB) while avoiding emissions associated with steam-methane reforming and furnace black processes. Model development of pyrolytic CB synthesis requires experimental observations of concurrent gas chemistry, particulate formation, and morphology. This work presents a combined experimental and modeling study of CH4 pyrolysis behind reflected shock waves in 5% CH4/Argon mixtures at post-reflected shock temperatures (T5) of 1850-2450 K and P5 around 4.5 atm. Laser absorption diagnostics quantified CH4, C2H4, and C2H2 mole fractions, while multiwavelength extinction (633 and 1064 nm) resolved time-dependent particle formation and the temperature-dependent evolution of optical maturity. Simulations reproduce small-molecule speciation well, but large variations in predicted polycyclic aromatic hydrocarbons (PAHs) persist among models. Coupled gas-particle simulations capture accurate volume fraction (fv) trends and the influence of gas dynamics but underpredict induction times at high T5. Samples collected at the shock tube endwall were analyzed by transmission electron microscopy (TEM) to quantify primary particle size distributions and nanostructure arrangement. Image segmentation and manual measurements showed reduced primary particle size growth (dp) with increasing T5, while graphitic nanostructure generally increased. This study provides an integrated benchmark for improving models of CB and H2 production from CH4 pyrolysis by constraining gas-phase kinetics, PAH-driven inception, particle dynamics, and particle maturity. The results highlight that accurate partitioning of mass between particle number and particle size is an important constraint for further model development.

关键词: methane pyrolysis, carbon black, hydrogen production, shock tube, gas-particle diagnostics, polycyclic aromatic hydrocarbons, transmission electron microscopy, model development

332. ❌ The Python Simulations of Chemistry Framework: 10 years of an open-source quantum chemistry project

作者: Qiming Sun, Matthew R Hermes, Xiaojie Wu, Huanchen Zhai, Xing Zhang, Abdelrahman M. Ahmed, Juan José Aucar, Oliver J. Backhouse, Samragni Banerjee, Peng Bao, Nikolay A. Bogdanov, Kyle Bystrom, Frédéric Chapoton, Ning-Yuan Chen, Ivan Yu. Chernyshov, Helen S. Clifford, Sander Cohen-Janes, Zhi-Hao Cui, Nike Dattani, Linus Bjarne Dittmer, Sebastian Ehlert, Janus Juul Eriksen, Francesco A. Evangelista, Simon A. Ewing, Ardavan Farahvash, Kevin Focke, Yang Gao, Kevin E. Gasperich, Nathan Gillispie, Jonas Greiner, Matthew R. Hennefarth, Jan Hermann, Christopher Hillenbrand, Joonatan Huhtasalo, Basil Ibrahim, Bhavnesh Jangid, Alireza Nejati Javaremi, Andrew J. Jenkins, Yu Jin, Daniel S. King, Derk Pieter Kooi, Henrik R. Larsson, Bryan Tak Gwong Lau, Seunghoon Lee, Susi Lehtola, Chenghan Li, Hao Li, Jiachen Li, Rui Li, Shuhang Li, Aleksandr O. Lykhin, Nastasia Mauger, Pablo del Mazo-Sevillano, Jonathan Moussa, Kousuke Nakano, Verena A. Neufeld, Linqing Peng, Hung Q. Pham, Peter Pinski, Pavel Pokhilko, Zhichen Pu, Yubing Qian, Stephen Jon Quiton, Wanja T. Schulze, Thais R. Scott, Aniruddha Seal, James E. T. Smith, Kori E. Smyser, Terrence Stahl, Chong Sun, Kevin J. Sung, Egor Trushin, Shiv Upadhyay, Ethan A. Vo, Thijs Vogels, Shirong Wang, Tai Wang, Xiao Wang, Xubo Wang, Yuanheng Wang, Mark Williamson, Junjie Yang, Hong-Zhou Ye, Chia-Nan Yeh, Haiyang Yu, Jincheng Yu, Victor Wen-zhe Yu, Chaoqun Zhang, Dayou Zhang, Zijun Zhao, Zehao Zhou, Andrew J. Zhu, Tianyu Zhu, Timothy C. Berkelbach, Laura Gagliardi, Sandeep Sharma, Alexander Sokolov, Garnet Kin-Lic Chan 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.14155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于量子化学计算框架PySCF的综述，主要介绍电子结构理论和量子化学方法开发的开源平台。论文内容与大多数关键词（涉及大模型、深度学习、训练技术、推理优化等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术，而该论文专注于传统的量子化学计算方法和软件框架。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子化学计算属于科学计算领域，与化学信息学有一定关联，但论文并未涉及人工智能或机器学习方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文综述了PySCF量子化学计算框架在过去十年的发展，包括新模块、方法学改进、基础设施变化和性能基准测试，展示了其作为开源电子结构理论平台的重要进展。

摘要翻译

过去十年间，基于Python的化学模拟框架（PySCF）已发展成为一个广泛应用于电子结构理论和量子化学方法开发的开源平台。本文回顾了自2020年上一轮综述以来的主要进展，涵盖新模块与方法论、基础设施改进以及性能基准测试。

摘要 (Abstract)

Over the past decade, the Python-based Simulations of Chemistry Framework (PySCF) has developed into a widely used open-source platform for electronic structure theory and quantum chemical method development. This article reviews the major advances since the previous overview in 2020, covering new modules and methodology, infrastructure changes, and performance benchmarks.

关键词: PySCF, quantum chemistry, electronic structure theory, open-source platform, method development, performance benchmarks, Python framework

333. ❌ Universal method of selective detection of a wide range of pollutants in liquids using conductance quantization

作者: O. Pospelov, A. Herus, A. Savytskyi, V. Vakula, M. Sakhnenko, N. Kalashnyk, E. Faulques, G. Kamarchuk 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.14140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子点接触传感器在液体污染物检测中的应用，属于传感器技术和量子物理领域，与绝大多数大模型/深度学习关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及环境监测和化学检测，属于科学应用范畴，但论文未明确使用AI方法，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、AI代理等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于电导量子化的通用方法，利用量子点接触传感器检测液体中的重金属离子和有机溶剂等污染物，实现了低至ppb级别的痕量检测。

摘要翻译

现代传感器技术研究的主要目标，是为复杂分子系统的快速分析开发创新的检测方法。本研究表明，基于电导量子化的选择性检测量子机制，可有效用于创建一种通用方法，以检测液态介质中的多种物质，包括重金属和有机溶剂。该方法的有效性通过量子点接触传感器得以验证：此类传感器利用枝状杨森点接触，在循环切换效应期间经历量子转变。实验证明，这些传感器能够检测液态介质中宽浓度范围（包括低至十亿分之几的痕量水平）的铜、锌和铅离子，并能识别有机溶剂（如乙酸）。创新性量子检测原理的应用，为开发全面的下一代设备阵列铺平了道路，为先进的环境监测应用提供了前景广阔的解决方案。

摘要 (Abstract)

The primary objective of research in modern sensor technologies is to develop innovative detection methods for the rapid analysis of complex molecular systems. The present work demonstrates that the quantum mechanism of selective detection, based on conductance quantization, can be effectively employed to create a universal method for detecting a broad spectrum of agents in liquid media, including heavy metals and organic solvents. The efficacy of this approach is illustrated through the use of quantum point-contact sensors, which utilize dendritic Yanson point contacts undergoing quantum transformations during the cyclic switchover effect. These sensors have proven capable of detecting copper, zinc, and lead ions in liquid media across a wide range of concentrations, including trace levels as low as a few parts per billion (ppb). Furthermore, they can identify organic solvents, as demonstrated with acetic acid. The use of innovative quantum detection principles paves the way for the development of a comprehensive array of next-generation devices, offering promising solutions for advanced environmental monitoring applications.

关键词: conductance quantization, quantum point-contact sensors, pollutant detection, heavy metals, organic solvents, environmental monitoring, Yanson point contacts, trace detection

334. ❌ Nonadiabatic rare events from transition-path sampling of MASH trajectories

作者: Danial Ghamari, Jeremy O. Richardson 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.14102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究非绝热稀有事件的分子动力学模拟方法（MASH与过渡路径采样的结合），属于计算化学/分子模拟领域，与所有大模型/深度学习技术关键词完全无关（评分为0）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该方法可应用于生物信息学/化学信息学中的分子过程模拟，但论文本身未明确提及AI或深度学习，仅属于计算科学方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合MASH表面跳跃方法与过渡路径采样的框架，用于高效模拟分子过程中的稀有非绝热反应，并应用于自旋-玻色子模型以分析反应机制和统计动力学性质。

摘要翻译

稀有非绝热反应是许多重要分子过程的关键组成部分，但通过直接动力学模拟捕获这些反应具有挑战性。本文中，我们将新近发展的映射表面跳跃方法与过渡路径采样相结合，构建了一个高效模拟此类稀有事件的框架。该框架的可行性源于映射表面跳跃轨迹具有马尔可夫性、时间可逆性并遵循刘维尔定理。这一组合方法能够生成非绝热反应路径，且不干扰底层动力学。所得的路径集合支持对反应机理进行详细分析，并可揭示包括速率常数在内的统计与动力学性质。我们将此方法应用于研究处于热平衡状态的自旋-玻色子模型，并覆盖了广泛的非绝热耦合强度范围。研究结果表明，该方法为探究稀有非绝热过程提供了一个实用且系统化的工具，其应用潜力可能超越传统暴力模拟的范畴。

摘要 (Abstract)

Rare nonadiabatic reactions are a key component of many important molecular processes but are challenging to capture with direct dynamical simulations. In this paper, we combine our recently developed mapping approach to surface hopping (MASH) with transition-path sampling to create a framework to efficiently simulate these rare events. This is possible because MASH trajectories are Markovian, time-reversible and obey Liouville’s theorem. The combined approach generates nonadiabatic reactive pathways without biasing the underlying dynamics. The resulting ensemble allows for a detailed analysis of reaction mechanisms and the unraveling of statistical and dynamical properties, including rate constants. We apply the method to study a spin-boson model in thermal equilibrium over a wide range of diabatic coupling strengths. Our results demonstrate how this approach provides a practical and systematic tool for investigating rare nonadiabatic processes, potentially beyond the reach of brute-force simulations.

关键词: nonadiabatic reactions, transition-path sampling, MASH trajectories, rare events, molecular dynamics, spin-boson model, reaction mechanisms, rate constants

335. ❌ Systematically Improvable Numerical Atomic Orbital Basis Using Contracted Truncated Spherical Waves

作者: Yike Huang, Zuxin Jin, Linfeng Zhang, Mohan Chen, Rui Chen, Ling Li 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.13995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算材料科学领域，提出了一种基于截断球面波构建数值原子轨道基组的新方法，用于密度泛函理论中的Kohn-Sham方程求解。论文内容与绝大多数关键词（涉及大模型、深度学习、训练技术、推理优化、智能体等）完全无关，因为这些关键词都属于人工智能/机器学习领域，而该论文是纯粹的计算物理/材料科学方法学研究。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算（AI for Science的广义范畴），但论文并未使用任何AI/ML技术，而是传统的第一性原理计算方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用截断球面波构建数值原子轨道基组的新方案，用于密度泛函理论计算，该方案具有系统性可改进性、更好的可转移性，并在分子和体相系统的多种性质计算中达到了满意的精度。

摘要翻译

为在密度泛函理论框架内求解Kohn-Sham方程，我们提出了一种通过截断球面波（Truncated Spherical Waves, TSWs）收缩构建数值原子轨道（Numerical Atomic Orbital, NAO）基组的方法。该收缩方案通过最小化残差空间中动能算符的迹，推广了泄漏最小化方案[M. Chen et al., J. Phys. Condens. Matter 22, 445501 (2010); P. Lin et al., Phys. Rev. B 103, 235131 (2021)]。除了继承先前方案的系统可改进性外，使用TSW而非平面波作为展开基组，能更有效地连接参考态与NAO，并消除了周期镜像间的虚假相互作用，从而通过纳入广泛的参考态实现了更好的可转移性。基准测试表明，所构建的NAO基组对于分子和体相体系的各种性质均能达到令人满意的精度，包括总能量、键长、原子化能、晶格常数、内聚能、带隙以及能级排列。通过纳入未占据态，该方法在描述导带时展现出的改进可转移性被证明是有效且显著的。

摘要 (Abstract)

To solve the Kohn-Sham equation within the framework of density functional theory, we develop a scheme to construct numerical atomic orbital (NAO) basis sets by contracting truncated spherical waves (TSWs). The contraction minimizes the trace of the kinetic operator in the residual space, generalizing the spillage minimizing scheme [M. Chen et al., J. Phys. Condens. Matter 22, 445501 (2010); P. Lin et al., Phys. Rev. B 103, 235131 (2021)]. In addition to the systematic improvability inherited from previous schemes, the use of TSW instead of plane waves as the expansion basis bridges reference states and NAOs more effectively, and eliminates spurious interactions between periodic images, thereby enabling better transferability through the inclusion of extensive reference states. Benchmarks demonstrate that the constructed NAO achieves satisfactory precision for various properties of both molecules and bulk systems, including total energy, bond length, atomization energy, lattice constant, cohesive energy, band gap, and energy-level alignment. By incorporating unoccupied states, the improved transferability in describing the conduction band is demonstrated to be effective and substantial.

关键词: numerical atomic orbital, truncated spherical waves, density functional theory, Kohn-Sham equation, basis set construction, systematic improvability, transferability, computational materials science

336. ❌ A Primary Unified Geometric Framework of Molecular Reaction Dynamics Based on the Variational Principle

作者: Xingyu Zhang, Jinke Yu, Qingyong Meng 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.13923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究基于变分原理的分子反应动力学几何框架，属于理论物理化学和计算化学领域。摘要中仅提到一次"artificial intelligence (AI) techniques to build the potential energy surface (PES)"，表明AI作为辅助工具用于构建势能面，这与"AI for Science"有一定关联（5分）。但论文核心内容（变分原理、薛定谔方程、几何框架、哈密顿量等）与所有其他大模型/深度学习技术关键词（LLMs、MoE、训练方法、推理优化、对齐、代理等）完全无关（0分）。论文未涉及任何大模型技术原理或应用创新。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于变分原理的分子反应动力学统一几何框架，通过引入几何相位将电子结构和量子动力学的变分方法统一起来，并讨论了优化视角下的理论扩展。

摘要翻译

本研究提出了一种基于变分原理的分子反应动力学几何框架，其中必须求解薛定谔方程以“观察”反应如何发生。首先，通过讨论最小作用量原理和山路定理给出数学基础。其次，我们探讨了物理基础，包括推导动能算符的等效原理以及在一般时空中构建势能面的人工智能技术。此外，我们简化了分子系统在弯曲时空中的电磁相互作用，从而能够在非零曲率时空中构建核哈密顿量。这表明通过曲率引入规范场的可能性，例如锥形交叉附近核动能算符中的附加项。第三，单粒子近似为通过变分原理求解薛定谔方程提供了有力的拟设。因此，可以为电子结构或量子动力学构建变分方法。在本工作中，基于先前讨论（《物理化学化学物理》第27卷（2025年），第20397页），我们通过几何描述统一了二者，其中几何相位被自然地引入。最后，基于本理论的优化特性，我们还从优化视角对本理论进行了进一步探讨，包括两条基本假设、生成式人工智能技术、微扰的作用以及优化中的马尔可夫过程。

摘要 (Abstract)

This work describes a geometric framework on molecular reaction dynamics based on the variational principle, where the Schr{ö}dinger equation must be solved to ``see’’ how a reaction occurs. First, the mathematical preliminaries are given by discussing the principle of least action and the mountain pass theorem. Second, we discuss the physical preliminaries, including the principle of equivalence for deriving the kinetic energy operator (KEO) and artificial intelligence (AI) techniques to build the potential energy surface (PES) in general spacetime. Moreover, we simplified electromagnetic interactions in curved spacetime within the molecular system and consequently, we are able to construct the nuclear Hamiltonian in nonzero curvature spacetime. This indicates possibility to introduce gauge fields through the curvature, such as additional term in the nuclear KEO near a conical intersection. Third, the single-particle approximation provides a powful ansatz to solve the Schr{ö}dinger equation by variational principle. Thus, one can formulate the variational approaches for either electronic structure or quantum dynamics. In this work, based on previous discussions ({\it Phys. Chem. Chem. Phys.} {\bf 27} (2025), 20397) we unified them by a geometric description, where the geometric phase is naturally introduced. Finally, due to optimization characteristic of the present theory, further discussions on the present theory from optimization insight are also given, including two postulates, generative AI techniques, role of perturbation, and Markov process in optimization.

关键词: molecular reaction dynamics, variational principle, geometric framework, Schrödinger equation, potential energy surface, nuclear Hamiltonian, geometric phase, optimization

337. ❌ Revealing Hydroxide Ion Transport Mechanisms in Commercial Anion-Exchange Membranes at Nano-Scale from Machine-learned Interatomic Potential Simulations

作者: Jonas Hänseroth, Muhammad Nawaz Qaisrani, Mostafa Moradi, Karl Skadell, Christian Dreßler 期刊/来源: arxiv 发布日期: 2026-03-14 arXiv链接: http://arxiv.org/abs/2603.13705v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究氢氧根离子在阴离子交换膜中的传输机制，使用机器学习原子间势进行分子动力学模拟，属于AI在科学领域的应用（具体是化学/材料科学）。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用机器学习方法解决科学问题（材料模拟），但并非生物信息学或化学信息学核心领域，故给5分。

!!! tip deepseek-chat TL;DR

该研究通过机器学习原子间势的分子动力学模拟，揭示了商用阴离子交换膜中氢氧根离子的纳米尺度传输机制，发现水含量增加形成氢键网络促进长程质子转移，为优化膜设计以提高绿色氢生产效率提供了原子尺度见解。

摘要翻译

阴离子交换膜中的氢氧根离子传输从根本上限制了碱性水电解制绿氢的效率，但由于模拟离子动力学存在计算挑战，其原子尺度的传输机制仍不甚明晰。鉴于阴离子交换膜能够利用丰富的催化剂实现碱性电解，同时避免使用全氟烷基和多氟烷基材料，深入理解这些体系中氢氧根的传输机制对于推进可持续制氢至关重要。本文中，我们通过采用经微调的机器学习原子间势函数进行大规模分子动力学模拟，对一种商用膜中数十纳秒时间尺度及十几纳米空间尺度内的氢氧根迁移行为提供了原子层面的解析。我们发现，增加水含量会使孤立的水簇转变为连通的氢键网络，从而实现长程质子转移。在干燥条件下，氢氧根离子被束缚在带正电的基团附近，传输严重受阻；而充分水合的膜则表现出延展的质子迁移，其扩散系数接近稀水溶液中的数值。模拟结果再现了扩散系数与活化能的实验变化趋势。我们的研究建立了纳米尺度结构与宏观传输之间的直接联系。除机理认知外，所提出的模拟框架能够实现膜化学与结构的预测性、模拟引导的优化，为理性设计更高效的绿氢技术用阴离子交换膜开辟了道路。

摘要 (Abstract)

Hydroxide ion transport in anion-exchange membranes fundamentally limits the efficiency of alkaline water electrolysis for green hydrogen production, yet the atomic-scale transport mechanisms remain poorly understood due to the computational challenges associated with modeling ion dynamics. Given that anion-exchange membranes enable alkaline electrolysis with abundant catalysts while avoiding perfluoroalkyl and polyfluoroalkyl materials, a deeper mechanistic understanding of hydroxide transport in these systems is essential for advancing sustainable hydrogen production. Here, we show that large-scale molecular dynamics simulations with fine-tuned machine-learned interatomic potentials provide atomistic insight into hydroxide mobility in a commercial membrane over tens of nanoseconds and over ten nanometer. We find that increasing water content transforms isolated water clusters into a connected hydrogen-bond network that enables long-range proton transfer. Under dry conditions hydroxide ions are trapped near positively charged groups and transport is strongly hindered, whereas well-hydrated membranes exhibit extended proton migration and diffusion coefficients approaching those of dilute aqueous solutions. The simulations reproduce experimental trends in diffusion and activation energies. Our results establish a direct link between nano-scale structure and macroscopic transport. Beyond mechanistic insight, the presented simulation framework enables predictive, simulation-guided optimization of membrane chemistry and architecture, opening a pathway toward the rational design of more efficient anion-exchange membranes for green hydrogen technologies.

关键词: hydroxide ion transport, anion-exchange membranes, machine-learned interatomic potentials, molecular dynamics simulations, green hydrogen production, proton transfer, water content, nano-scale structure

Token 消耗统计

总计: 1,073,167 tokens（输入 728,109 / 输出 345,058）

模型	输入	输出	合计
deepseek-chat	597,591	330,612	928,203
glm-4.7	130,518	14,446	144,964

📊 ArXiv 研究报告 (2026-03-18)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs#

2. An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control with#

3. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad#

4. Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning#

受大脑启发的图多智能体系统用于大语言模型推理#

5. SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration#

SFCoT：通过主动安全评估与校准实现更安全的思维链#

6. Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agen#

7. VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining#

8. The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments#

9. CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents#

10. ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation#

11. Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks#

购物伴侣：用于现实世界电子商务任务的记忆增强型LLM智能体#

12. Questionnaire Responses Do not Capture the Safety of AI Agents#

问卷回答无法捕捉 AI 智能体的安全性#

13. MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-E#

14. Effective Distillation to Hybrid xLSTM Architectures#

混合xLSTM架构的有效蒸馏#

15. AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulat#

📋 所有论文列表#

1. ✅ Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs#

2. ✅ An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs#

3. ✅ CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad#

4. ✅ Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning#

5. ✅ SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration#

6. ✅ Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents#

7. ✅ VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining#

8. ✅ The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments#

9. ✅ CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents#

10. ✅ ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation#

11. ✅ Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks#

12. ✅ Questionnaire Responses Do not Capture the Safety of AI Agents#

13. ✅ MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering#

14. ✅ Effective Distillation to Hybrid xLSTM Architectures#

15. ✅ AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation#

16. ❌ Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks#

17. ❌ Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning#

18. ❌ Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models#

19. ❌ CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models#

20. ❌ A proof-of-concept for automated AI-driven stellarator coil optimization with in-the-loop finite-element calculations#

21. ❌ Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework#

22. ❌ A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression#

23. ❌ Do Metrics for Counterfactual Explanations Align with User Perception?#

24. ❌ Mechanistic Origin of Moral Indifference in Language Models#

25. ❌ Mixture-of-Depths Attention#

26. ❌ Computational Concept of the Psyche#

27. ❌ OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data#

28. ❌ From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation#

29. ❌ Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents#

30. ❌ Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask#

31. ❌ The PokeAgent Challenge: Competitive and Long-Context Learning at Scale#

32. ❌ InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems#

33. ❌ Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation#

34. ❌ DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning#

35. ❌ Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph#

36. ❌ Building Trust in PINNs: Error Estimation through Finite Difference Methods#

37. ❌ SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction#

38. ❌ Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains#

39. ❌ Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty#

40. ❌ Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold#

41. ❌ Agentic workflow enables the recovery of critical materials from complex feedstocks via selective precipitation#

42. ❌ RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance#

43. ❌ Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis#

44. ❌ TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins#

45. ❌ Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents#

46. ❌ Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents#

47. ❌ RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation#

48. ❌ Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting#

49. ❌ Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning#

50. ❌ Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches#

51. ❌ Physics-informed fine-tuning of foundation models for partial differential equations#

52. ❌ MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings#

53. ❌ Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities#

📊 ArXiv 研究报告 (2026-03-18)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

2. An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control with

3. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

4. Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

受大脑启发的图多智能体系统用于大语言模型推理

5. SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

SFCoT：通过主动安全评估与校准实现更安全的思维链

6. Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agen

7. VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

8. The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

9. CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

10. ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

11. Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

购物伴侣：用于现实世界电子商务任务的记忆增强型LLM智能体

12. Questionnaire Responses Do not Capture the Safety of AI Agents

问卷回答无法捕捉 AI 智能体的安全性

13. MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-E

14. Effective Distillation to Hybrid xLSTM Architectures

混合xLSTM架构的有效蒸馏

15. AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulat

📋 所有论文列表

1. ✅ Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

2. ✅ An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

3. ✅ CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

4. ✅ Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

5. ✅ SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

6. ✅ Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents

7. ✅ VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

8. ✅ The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

9. ✅ CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

10. ✅ ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

11. ✅ Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

12. ✅ Questionnaire Responses Do not Capture the Safety of AI Agents

13. ✅ MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

14. ✅ Effective Distillation to Hybrid xLSTM Architectures

15. ✅ AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

16. ❌ Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks

17. ❌ Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

18. ❌ Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

19. ❌ CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

20. ❌ A proof-of-concept for automated AI-driven stellarator coil optimization with in-the-loop finite-element calculations

21. ❌ Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

22. ❌ A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression

23. ❌ Do Metrics for Counterfactual Explanations Align with User Perception?

24. ❌ Mechanistic Origin of Moral Indifference in Language Models

25. ❌ Mixture-of-Depths Attention

26. ❌ Computational Concept of the Psyche

27. ❌ OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

28. ❌ From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

29. ❌ Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

30. ❌ Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask

31. ❌ The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

32. ❌ InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

33. ❌ Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

34. ❌ DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning

35. ❌ Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

36. ❌ Building Trust in PINNs: Error Estimation through Finite Difference Methods

37. ❌ SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

38. ❌ Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains

39. ❌ Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

40. ❌ Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold

41. ❌ Agentic workflow enables the recovery of critical materials from complex feedstocks via selective precipitation

42. ❌ RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

43. ❌ Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

44. ❌ TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

45. ❌ Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

46. ❌ Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

47. ❌ RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation

48. ❌ Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting

49. ❌ Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

50. ❌ Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

51. ❌ Physics-informed fine-tuning of foundation models for partial differential equations

52. ❌ MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

53. ❌ Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities