📊 ArXiv 研究报告 (2026-03-20)

生成时间: 2026-03-20 09:29:48 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 302 篇
及格论文: 9 篇 (3.0%)
深度分析: 4 篇

⭐ 及格论文详细分析

1. CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

作者: Che-Ming Chang, Prashanth Vijayaraghavan, Ashutosh Jadhav, Charles Mackin, Vandana Mukherjee, Hsinyu Tsai, Ehsan Degan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17204v1

评分: 62.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文CODMAS提出了一种基于大语言模型（LLMs）的多智能体协作框架，用于自动化RTL代码优化。核心相关关键词包括：1）“Large Language Models”（高度相关，论文明确使用LLMs）；2）“LLM Agents"和"Multi-agent Systems”（高度相关，框架包含多个智能体协同工作）；3）“Chain of Thought”、“System 2 Thinking"和"Self-Correction”（有一定关联，框架涉及逐步推理、深度反思和偏差修正）；4）“Tool Use”（有一定关联，智能体生成和评估代码可视为工具使用）。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了CODMAS框架，通过基于大语言模型的多智能体协作系统自动化优化RTL代码，在流水线和时钟门控任务中分别实现了约25%的关键路径延迟降低和约22%的功耗降低。

摘要翻译

优化寄存器传输级（RTL）代码是电子设计自动化（EDA）中提升功耗、性能和面积（PPA）的关键步骤。本文提出CODMAS（基于辩证多智能体系统的协同优化框架），该框架结合结构化辩证推理、领域感知的代码生成与确定性评估，以实现RTL优化的自动化。CODMAS的核心包含两个辩证智能体：其一为“阐述者”（Articulator），其设计灵感源于橡皮鸭调试法，负责阐述逐步转换计划并揭示潜在假设；其二为“假设协作者”（Hypothesis Partner），负责预测结果并调和预期行为与实际行为之间的偏差，以指导针对性改进。这些智能体驱动一个领域专用编码智能体（Domain-Specific Coding Agent, DCA）生成具备架构感知能力的Verilog代码修改，并引导一个代码评估智能体（Code Evaluation Agent, CEA）验证语法、功能及PPA指标。我们同时提出了RTLOPT基准测试集，包含120组针对流水线与时钟门控转换的Verilog三元组（未优化版本、优化版本、测试平台）。在专有及开源大语言模型上，CODMAS在流水线优化中实现了关键路径延迟降低约25%，在时钟门控优化中实现了功耗降低约22%，同时相较于强提示方法与智能体基线，显著减少了功能错误与编译失败案例。这些结果表明，结构化的多智能体推理能够显著增强自动化RTL优化能力，并有望扩展至更复杂的设计与更广泛的优化任务。

摘要 (Abstract)

Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

关键词: Multi-agent Systems, LLM Agents, RTL Optimization, Electronic Design Automation, Dialectic Reasoning, Code Generation, Verilog, Power Performance Area

深度分析:

CODMAS：用于结构化RTL优化的辩证多智能体协作框架

摘要:

该论文针对电子设计自动化（EDA）中寄存器传输级（RTL）代码优化的挑战，提出了CODMAS框架。该框架结合了结构化辩证推理、领域感知代码生成和确定性评估，实现了RTL优化的自动化。核心包含两个辩证智能体：Articulator负责制定转换计划并暴露潜在假设，Hypothesis Partner负责预测结果并调和偏差。它们指导领域特定编码代理（DCA）生成Verilog代码，并由代码评估代理（CEA）验证。此外，论文还引入了RTLOPT基准数据集，包含120个Verilog三元组。实验表明，CODMAS在流水线和时钟门控优化中显著降低了关键路径延迟和功耗，并减少了功能故障。

创新点:

提出了CODMAS多智能体框架，创新性地将辩证推理（Articulator与Hypothesis Partner）引入RTL优化循环，分离了设计阐述与假设生成。
构建了RTLOPT基准数据集，包含120个经过验证的Verilog代码三元组（未优化、优化、测试平台），填补了RTL优化评估数据的空白。
实现了闭环反馈机制，结合领域特定知识注入和确定性评估（语法、功能、PPA），在保证功能正确性的同时提升性能。
验证了结构化多智能体推理在自动化硬件设计优化中的有效性，显著优于单一智能体或强提示基线。

方法

!!! info

论文采用多智能体协作的方法论。首先构建RTLOPT数据集用于评估。CODMAS框架包含四个智能体：Articulator（规划）、Hypothesis Partner（预测与诊断）、DCA（生成代码）和CEA（评估代码）。工作流程是迭代的：Articulator制定计划 -> Hypothesis Partner预测 -> DCA生成代码 -> CEA使用Icarus Verilog和ABC工具进行仿真和综合评估 -> 反馈给智能体进行修正。

关键结果:

在流水线优化中，关键路径延迟减少了约25%。
在时钟门控优化中，功耗减少了约22%。
相比强提示和单一智能体基线，显著降低了功能和编译故障率。
证明了分离辩证角色（Articulator与Hypothesis Partner）对于优化收敛和性能提升的必要性。

技术栈: 大语言模型（LLM，如DeepSeek等）, 多智能体系统（Multi-Agent System）, 辩证推理（Dialectic Reasoning）, Icarus Verilog（仿真器）, ABC（综合工具）, Verilog（硬件描述语言）, PPA指标评估（功耗、性能、面积）

优点

创新的多智能体架构设计，将“小黄鸭调试”理念转化为AI智能体角色，增强了推理的可解释性和结构化。
提供了高质量的基准数据集RTLOPT，支持可复现的研究。
实际应用价值高，直接解决EDA领域的痛点，且在PPA指标上有显著量化提升。
闭环反馈机制结合了LLM的生成能力与传统EDA工具的确定性验证，提高了鲁棒性。

局限

数据集规模相对较小（120个三元组），可能限制模型在大规模或极复杂设计上的泛化能力。
目前主要关注流水线和时钟门控两种优化，对于其他RTL优化技术（如重定时、资源共享）的覆盖尚待扩展。
依赖外部EDA工具（Icarus Verilog, ABC）进行评估，可能受限于工具本身的精度或速度。
多智能体交互可能带来较高的计算开销和推理延迟。

与研究方向的相关性:

论文高度相关。它属于“大模型在不同领域的研究应用”（具体为EDA/芯片设计），同时也涉及“大模型技术原理的创新”（多智能体协作、辩证推理框架）。它展示了LLM在科学计算和工程设计中的具体落地应用，具有很高的技术创新性和应用价值。

2. Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents

作者: Mohsen Arjmandi 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17683v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM游戏智能体架构Sensi，涉及测试时学习、课程学习、多智能体协调、上下文窗口控制、幻觉缓解等。与"LLM Agents"和"Large Language Models"高度相关（10分），与"LLM Agents"相关的子主题如"Multi-agent Systems"、“Self-Correction”、“Context Window Extension”、“Chain of Thought”、“System 2 Thinking”、“Hallucination Mitigation”、“In-context Learning"有一定关联（5分）。其他关键词如MoE、量化、科学AI应用等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Sensi LLM智能体架构，通过课程学习和两玩家分离设计解决游戏环境中测试时学习效率低的问题，实现了50-94倍的样本效率提升，但发现感知层幻觉是主要瓶颈。

摘要翻译

部署于未知环境的大语言模型（LLM）智能体必须在测试时学习任务结构，但现有方法需要数千次交互才能形成有效假设。本文提出Sensi——一种面向ARC-AGI-3游戏挑战的LLM智能体架构，通过三重机制实现结构化测试时学习：（1）将感知与行动分离的双智能体架构，（2）由外部状态机管理的课程学习系统，（3）作为控制平面的数据库，使智能体的上下文窗口可通过编程方式引导。我们进一步引入配备动态生成评估标准的LLM-as-judge（大语言模型作为裁判）组件，用以判定智能体何时对当前主题学习充分并推进至下一阶段。通过两个迭代版本报告实验结果：Sensi v1仅采用双智能体架构即解决2个游戏关卡，而Sensi v2增加课程学习后虽未解决任何关卡（0个），但仅用约32次行动尝试即完成全部学习课程，其样本效率达到需1600-3000次尝试的同类系统的50-94倍。我们精确诊断其失效模式为源于感知层的自洽幻觉级联，证明架构瓶颈已从学习效率转移至感知 grounding（感知锚定）——这成为一个更易处理的问题。

摘要 (Abstract)

Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.

关键词: LLM agents, test-time learning, curriculum learning, two-player architecture, context window control, sample efficiency, hallucination mitigation, game-playing

深度分析:

Sensi：一次学一件事——基于课程的大模型游戏智能体测试时学习

摘要:

本文针对大模型智能体在未知环境中学习效率低下的问题，提出了Sensi架构。该架构通过双人设计（观察者与行动者）分离感知与行动，并引入基于课程的学习系统、外部状态机以及“数据库即控制平面”机制。Sensi v1成功解决了2个游戏关卡，表现出高可复现性。Sensi v2虽然未能解决关卡，但通过课程学习在约32次尝试内完成了学习课程，样本效率比现有系统提升了50-94倍。研究还诊断出失败源于感知层的自一致幻觉级联，表明瓶颈已从学习效率转移到感知基础。

创新点:

双人架构：将感知（观察者）与行动（行动者）分离，通过结构化假设列表进行通信，实现认知的解耦。
课程学习系统：引入外部状态机管理学习目标的顺序队列，将验证过的知识作为事实积累，实现结构化学习。
数据库即控制平面：将智能体的认知状态存储在SQLite数据库中，使上下文窗口可通过编程方式引导和注入。
LLM-as-judge：利用动态生成的评估标准来评估学习进度，由独立的LLM调用执行度量和评分。
样本效率突破：实现了50-94倍的样本效率提升，并精确定位了感知层的幻觉级联失败模式。

方法

!!! info

论文采用基于LLM的智能体架构，在ARC-AGI-3环境中进行测试。主要技术路线包括：1. 构建双人系统，分别处理观察和决策；2. 设计课程学习机制，通过状态机控制学习进度；3. 使用SQLite数据库作为控制平面，管理智能体的记忆和上下文；4. 实施LLM-as-judge评估机制，动态生成评分标准；5. 利用ChatGPT 5.1作为骨干模型进行推理和决策。

关键结果:

Sensi v1解决了2个游戏关卡，且pass@10等于pass@1，显示出高度的可复现性。
Sensi v2在约32次动作尝试中完成了整个学习课程，样本效率比需要1600-3000次尝试的基线系统提高了50-94倍。
Sensi v2未能解决任何关卡（0 solved levels），但完成了学习过程。
诊断出失败模式为感知层的自一致幻觉级联，即视觉差异检测中的错误传播导致产生错误但内部一致的游戏模型。

技术栈: ChatGPT 5.1 (骨干大模型), SQLite (数据库即控制平面), DSPy (提示词编程框架), POMDP (部分可观测马尔可夫决策过程), LLM-as-judge (大模型评判机制), State Machine (状态机), Frame Differencing (帧差分算法)

优点

极高的样本效率：相比现有方法，大幅减少了学习所需的游戏交互次数。
架构创新：引入数据库即控制平面和课程学习，使LLM智能体的行为更加模块化和可控。
深刻的失败分析：诚实地报告负面结果，并精确定位了感知基础作为当前架构的瓶颈，为后续研究指明了方向。
可复现性：v1版本显示出确定性结果，减少了LLM固有的随机性影响。

局限

任务完成率低：Sensi v2虽然学习效率高，但未能解决任何游戏关卡，实际任务表现不佳。
感知层脆弱性：架构严重依赖LLM的视觉感知能力，感知层的幻觉会导致整个系统的失败。
系统复杂性：每回合需要多达5次LLM调用，虽然样本效率高，但计算成本和延迟可能较高。
依赖特定模型：实验依赖于ChatGPT 5.1的能力，其视觉推理的局限性直接限制了系统的性能。

与研究方向的相关性:

该论文高度相关于’大模型技术原理创新’。它提出了测试时学习、上下文学习和智能体架构的新颖架构（双人、数据库即控制平面）。虽然应用场景是游戏（ARC-AGI-3），但其解决样本效率低和结构化学习的核心思想对大模型在科学发现等未知环境中的应用具有重要参考价值。论文的创新性强，技术深度高，符合高分标准。

3. Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarit

作者: Himadri Samanta 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17765v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出了一种用于胸部X光印象报告的检索增强生成（RAG）系统，核心是解决生成式大模型在医学领域中的幻觉和缺乏临床依据问题。因此，与"Retrieval-Augmented Generation"高度相关（10分），与"AI for Science"高度相关（10分），因为它是大模型在生物医学（放射学）领域的应用。论文提到使用大语言模型（LLMs）进行报告生成，因此与"Large Language Models"相关（8分）。系统通过检索历史报告来确保事实一致性，直接针对"Hallucination Mitigation"问题（8分）。系统输出具有可解释性和引用可追溯性，与"Explainable AI"有一定关联（5分）。其他关键词如MoE、SFT、RLHF、量化等未在摘要中提及或与论文核心内容无关，因此得0分。

!!! tip deepseek-chat TL;DR

该研究针对自动放射学报告生成中存在的幻觉和缺乏临床依据问题，提出了一种基于多模态检索增强生成（RAG）的系统，通过结合图像-文本嵌入和病例相似性检索来生成有引用依据的胸部X光印象报告，实验表明该方法显著提高了检索性能并增强了输出的可信度。

摘要翻译

随着深度学习与大语言模型的兴起，自动化放射学报告生成日益受到关注。然而，完全生成式方法常出现幻觉问题且缺乏临床依据，限制了其在实际工作流程中的可靠性。本研究提出一种多模态检索增强生成（RAG）系统，用于胸部X光影像印象的基于依据的草稿生成。该系统结合对比式图文嵌入、基于病例的相似性检索与引用约束的草稿生成机制，确保与历史放射学报告的事实对齐。我们使用MIMIC-CXR数据集的精选子集构建多模态检索数据库：图像嵌入通过CLIP编码器生成，文本嵌入则源自结构化的印象章节。通过FAISS索引实现可扩展的最近邻检索，构建了融合相似度框架。检索到的病例用于构建基于依据的提示，以生成印象草稿，并通过安全机制强制实施引用覆盖和基于置信度的拒绝生成。实验结果表明，与纯图像检索相比，多模态融合显著提升了检索性能，在临床相关发现上Recall@5超过0.95。基于依据的草稿生成流程可产生具有明确引用溯源性的可解释输出，相较于传统生成方法显著提升了可信度。本研究凸显了检索增强多模态系统在可靠临床决策支持与放射学工作流程增强方面的潜力。

摘要 (Abstract)

Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliability in real-world workflows. In this study, we propose a multimodal retrieval-augmented generation (RAG) system for grounded drafting of chest radiograph impressions. The system combines contrastive image-text embeddings, case-based similarity retrieval, and citation-constrained draft generation to ensure factual alignment with historical radiology reports. A curated subset of the MIMIC-CXR dataset was used to construct a multimodal retrieval database. Image embeddings were generated using CLIP encoders, while textual embeddings were derived from structured impression sections. A fusion similarity framework was implemented using FAISS indexing for scalable nearest-neighbor retrieval. Retrieved cases were used to construct grounded prompts for draft impression generation, with safety mechanisms enforcing citation coverage and confidence-based refusal. Experimental results demonstrate that multimodal fusion significantly improves retrieval performance compared to image-only retrieval, achieving Recall@5 above 0.95 on clinically relevant findings. The grounded drafting pipeline produces interpretable outputs with explicit citation traceability, enabling improved trustworthiness compared to conventional generative approaches. This work highlights the potential of retrieval-augmented multimodal systems for reliable clinical decision support and radiology workflow augmentation

关键词: retrieval-augmented generation, radiology report generation, multimodal retrieval, hallucination mitigation, clinical grounding, chest radiograph, MIMIC-CXR, citation-constrained generation

4. EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

作者: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17808v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出EVA框架，通过强化学习后训练（post-training）对齐视频世界模型，使其生成物理可执行的机器人动作。核心相关关键词：1) “Post-training” OR “Supervised Fine-tuning” OR “SFT”（10分）：论文明确使用强化学习后训练框架。2) “RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”（10分）：使用强化学习（RL）进行对齐，属于RLHF相关技术。3) “World Models” AND “General World Models”（10分）：论文聚焦视频世界模型（video world models）用于机器人控制。4) “Instruction Tuning” OR “Alignment” OR “Value Alignment”（8分）：涉及模型对齐（alignment）以改善可执行性。其他关键词如LLMs、MoE、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文解决了视频世界模型生成的视觉推演与机器人物理可执行动作之间的不匹配问题，通过提出EVA强化学习后训练框架，利用逆动力学奖励对齐模型，减少了特定于具体机器人的伪影并提高了下游任务执行成功率。

摘要翻译

视频生成模型正日益被用作机器人的世界模型，其中模型根据当前观测和任务指令生成未来的视觉推演序列，而逆动力学模型则将生成的帧转换为可执行的机器人动作。然而，当前的视频世界模型缺乏显式的可执行性约束。这导致视觉连贯的推演序列仍可能违反刚体运动学一致性，在通过逆动力学模型解码时产生不稳定或不可行的控制指令。我们将这种视觉生成与物理可执行控制之间的不匹配称为可执行性差距。虽然可通过拒绝采样等推理时技术缓解该差距，但由于视频生成成本高昂，此类方法效率低下。本文利用可执行性差距作为训练信号，提出可执行视频对齐——一种用于对齐视频世界模型的强化学习后训练框架。该框架在真实机器人轨迹上训练逆动力学模型，并将其重新用作奖励模型，通过生成视频所诱导的动作序列来评估视频质量：奖励由速度、加速度和急动度度量的平滑运动，同时惩罚违反本体约束的动作。值得注意的是，即使生成视频存在严重视觉伪影，该奖励仍能提供有效信息，因为此类伪影通常对应不稳定或越界的动作。在RoboTwin基准测试和真实双臂机器人上的实验表明，该方法能减少生成推演中特定于本体的伪影，并提升下游任务执行成功率。

摘要 (Abstract)

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

关键词: video world models, reinforcement learning, post-training, alignment, inverse dynamics, robot actions, executability gap, EVA

作者: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17372v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的越狱攻击和安全对齐问题，与"Large Language Models"高度相关（VLMs是大模型的一种），“Instruction Tuning” OR “Alignment” OR “Value Alignment"是核心主题（研究安全对齐失效），“Mechanistic Interpretability” OR “Explainable AI"是核心方法（通过表示空间分析解释越狱机制），“Hallucination Mitigation” OR “Factuality” OR “Truthfulness"有一定关联（涉及安全性和真实性），其他关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究发现视觉语言模型的越狱攻击源于视觉模态引起的表示偏移，并提出了一种通过移除越狱相关偏移来增强模型安全性的防御方法。

摘要翻译

大型视觉语言模型（VLMs）在整合视觉模态后，其安全对齐性常出现弱化现象。即使文本提示包含明确的有害意图，添加图像仍会显著提高越狱成功率。本文发现，在表征空间中，VLMs能够清晰区分良性输入与有害输入。此外，即使在有害输入中，越狱样本也会形成一种与拒绝样本明显分离的独特内部状态。这些观察表明，越狱行为并非源于模型未能识别有害意图，而是视觉模态将表征推向特定的越狱状态，从而导致未能触发拒绝机制。为量化这一转变，我们识别出一个越狱方向，并将图像诱导的表征偏移沿此方向的分量定义为越狱相关偏移。分析表明，越狱相关偏移能可靠地表征越狱行为，为多样化的越狱场景提供了统一解释。最后，我们提出一种防御方法，通过在推理时移除越狱相关偏移（JRS-Rem）来增强VLM的安全性。实验证明，JRS-Rem能在多种场景下提供强效防御，同时保持良性任务上的性能。

摘要 (Abstract)

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

关键词: Vision-Language Models, Jailbreak Attacks, Safety Alignment, Representation Shift, Defense Method, Visual Modality, Internal State Analysis, JRS-Rem

6. Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

作者: Federico Albanese, Pablo Ronco, Nicolás D’Ippolito 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17217v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种基于本地LLM的文本匿名化框架，核心是使用LLM替换文本中的个人身份信息。因此，与"Large Language Models"高度相关（10分）。论文提到使用BERT+LoRA进行微调，与"PEFT/LoRA"相关（8分）。论文评估了在问答代理前加入匿名化层的效果，与"LLM Agents"相关（8分）。论文提到使用本地LLM，与"Small Language Models/On-device AI"有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于本地大语言模型的隐私保护文本匿名化框架，通过用类型一致的替代品替换个人身份信息，在保护隐私的同时保持了文本的流畅性和语义效用，并在评估中显示出优于现有方法的性能。

摘要翻译

负责任地使用人工智能要求我们在不损害数据实用性的前提下保护敏感信息，这一需求在大语言模型时代变得尤为迫切。我们通过一种本地部署的、基于大语言模型的替换流程来应对这一挑战，该流程通过将个人可识别信息替换为符合类型特征的逼真替代内容来实现文本匿名化。该方法完全在组织边界内使用本地大语言模型执行，既能防止数据外泄，又能保持文本的流畅性和任务相关语义。

我们在基于对话的行为数据集上进行了系统性、多指标、跨技术的评估，以行业标准方案（微软Presidio和谷歌DLP）以及前沿方法（ZSTS，包括仅删除版本及删除加替换版本）作为基准。我们的评估方案通过一项面向全生命周期的标准，综合衡量了隐私性、语义效用及隐私保护下的可训练性——该标准通过对经脱敏处理的文本微调紧凑编码器（BERT+LoRA）获得。此外，我们通过在应答大语言模型前插入本地匿名化层，并评估其回答质量，来检验智能问答代理的性能。这一中间化的类型保持替换阶段确保敏感内容不会暴露给第三方API，从而在保障机密性的前提下实现问答代理的负责任部署。

我们的方法在隐私性、主题偏移控制、事实效用保持及可训练性损失方面均达到最优水平，在隐私-效用-可训练性的综合评估边界上超越了基于规则的方法、命名实体识别基准模型以及各版本ZSTS方法。这些结果表明，基于本地大语言模型的替换技术能够生成既符合负责任使用要求又具备操作价值的匿名语料：既适用于智能代理流程的安全部署，也适用于下游微调任务且性能衰减有限。

摘要 (Abstract)

Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic Q&A performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q&A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy–utility–trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation.

关键词: LLM-driven, privacy-preserving, anonymization, on-premise, PII substitution, agentic Q&A, LoRA fine-tuning, semantic utility

深度分析:

构建即匿名：一种用于隐私保护文本的LLM驱动框架

摘要:

针对大模型时代敏感信息保护与数据效用之间的矛盾，本文提出了一种本地化、由大语言模型（LLM）驱动的替换匿名化流水线。该方法利用本地部署的开源LLM，通过提示工程将文本中的个人身份信息（PII）替换为逼真且类型一致的虚构替代品，而非简单的删除。在ABCD数据集上的系统评估表明，该方法在隐私保护（PII召回率）、语义保留（情感、主题）、智能体问答性能以及模型可训练性方面，均优于现有的工业标准（如Microsoft Presidio、Google DLP）和最先进的ZSTS方法，实现了隐私与效用的最佳平衡，为负责任的AI部署提供了有效方案。

创新点:

提出了“构建即匿名”的本地化LLM替换架构，通过在组织边界内运行模型，彻底消除了数据流出风险，满足严格隐私法规。
设计了类型保持的替换策略，利用提示工程将PII替换为同类型的逼真虚构数据，而非传统的编辑删除，从而保留了文本的流畅性和任务相关语义。
建立了一套多维度综合评估协议，联合衡量隐私保护、语义效用、智能体问答质量以及在隐私保护下的模型可训练性（LoRA微调）。
实现了即插即用的通用性，利用预训练多语言LLM和少样本提示，无需特定任务训练即可支持跨语言和跨领域的即时部署。

方法

!!! info

论文采用本地部署的开源大模型（GPT-oss 20B和DeepSeek-r1 7B）作为核心引擎。通过设计特定的提示词，指令模型逐词检测并替换PII（如姓名、邮箱、数字、地址等）为同类型的虚构数据。采用零温度解码以确保替换过程的确定性。在评估阶段，使用Action-Based Conversation Dataset (ABCD) 数据集，对比Microsoft Presidio、Google DLP和ZSTS等基线方法，通过PII召回率、情感一致性、主题距离、问答质量以及基于BERT+LoRA的微调性能进行综合测试。

关键结果:

该方法在隐私保护方面达到了最先进的水平，PII召回率极高，有效消除了敏感信息泄露风险。
相比基线方法，实现了最小的主题漂移和强事实效用，情感一致性保持良好，文本可读性强。
在净化后的文本上进行模型微调时，表现出较低的可训练性损失，证明生成的匿名数据适合下游模型训练。
在智能体问答场景中，插入匿名化层后，LLM的回答质量未受显著影响，确保了在不暴露敏感内容的前提下利用第三方API。

技术栈: GPT-oss (20B), DeepSeek-r1 (7B), BERT (用于ZSTS基线), BERT + LoRA (用于可训练性评估), Microsoft Presidio, Google DLP, Few-shot Prompting, Zero-shot Text Sanitization, Recall (召回率), Embedding Similarity (嵌入相似度)

优点

隐私合规性极高：本地化处理架构完全符合GDPR、HIPAA等法规要求，解决了数据出境痛点。
数据效用保留优异：通过生成式替换而非删除，最大程度保留了文本的语义结构和上下文信息，有利于下游任务。
评估体系全面且务实：不仅关注传统的隐私和语义指标，还创新性地引入了“隐私下的可训练性”评估，为工程落地提供了关键依据。
部署灵活且通用：无需针对特定领域微调模型，利用提示工程即可快速适配不同语言和业务场景。

局限

计算资源消耗较高：使用20B或7B参数的本地LLM进行推理，相比基于规则的轻量级方法（如正则匹配），硬件成本和推理延迟较高。
潜在的幻觉风险：尽管通过提示词进行了约束，但生成式模型仍可能产生不符合上下文逻辑的虚构信息，影响事实准确性。
评估范围有限：主要在任务型对话数据集（ABCD）上进行了验证，在开放域对话或长文档处理中的效果尚需进一步研究。
确定性依赖：虽然使用了零温度解码，但不同模型架构或提示词的微小变化仍可能导致输出不一致。

与研究方向的相关性:

该论文高度契合用户关注的大模型技术原理创新及应用领域。它利用大模型（LLM）的生成能力解决传统NLP方法在隐私保护时语义丢失的痛点，属于大模型技术的创新应用。论文涉及提示工程、本地化部署、模型微调（LoRA）等核心技术，且其提出的框架具有通用性，可广泛应用于医疗、金融等高敏感度科学或商业领域。虽然不直接涉及生物医药AI，但其解决的数据隐私与效用平衡问题是AI落地的关键，创新性强，具有较高的参考价值。

7. Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

作者: Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17872v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉缓解，通过领域接地的分层检索和验证架构（RAG变体）来提升事实准确性。因此，与"Large Language Models”（论文明确研究LLM）、“Retrieval-Augmented Generation”（核心方法基于RAG架构）和"Hallucination Mitigation”（直接解决幻觉问题）高度相关（10分）。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、模型压缩等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种领域接地的分层检索和验证架构，通过四阶段自调节管道（包括内在验证、自适应搜索路由、纠正文档分级和外在再生）来缓解LLM幻觉，在多个基准测试中显著提升了事实准确性和可靠性。

摘要翻译

大型语言模型（LLMs）在流畅性方面取得了前所未有的成就，但仍易产生“幻觉”——即生成事实错误或无依据的内容。这一局限在可靠性至关重要的高风险领域中尤为关键。我们提出了一种领域锚定的分层检索与验证架构，旨在通过将LLMs从随机模式匹配器转变为经过验证的真相探寻者，系统性地拦截事实错误。该框架采用基于LangGraph实现的四阶段自调节流程：（I）利用早期退出逻辑进行内在验证以优化计算资源，（II）通过领域检测器实现自适应搜索路由以定向检索特定主题档案库，（III）采用纠正性文档分级（CRAG）过滤无关上下文，（IV）进行外部再生及原子化声明级验证。该系统在来自五个不同基准测试的650条查询中进行了评估：TimeQA v2、FreshQA v2、HaluEval General、MMLU Global Facts和TruthfulQA。实证结果表明，该流程在所有环境中均持续优于零样本基线。在TimeQA v2中胜率峰值达83.7%，在MMLU Global Facts中达78.0%，证实了其在需要精细时间和数值精度的领域具有高效性。在事实性答案行中，锚定分数稳定保持在78.8%至86.4%之间。尽管该架构为错误信息提供了强大的故障防护机制，但仍识别出“虚假前提过度断言”这一持续存在的失效模式。这些发现为多阶段检索增强生成（RAG）行为提供了详细的实证特征描述，并表明未来工作应优先发展检索前“可答性”节点，以进一步弥合对话式人工智能的可靠性差距。

摘要 (Abstract)

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to “hallucinations” - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of “False-Premise Overclaiming” was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval “answerability” nodes to further bridge the reliability gap in conversational AI.

关键词: LLM Hallucinations, Domain-Grounded Retrieval, Tiered Retrieval, Retrieval-Augmented Generation, Factual Verification, LangGraph, CRAG, Groundedness

深度分析:

通过领域基础分层检索缓解大模型幻觉

摘要:

针对大语言模型（LLM）在生成内容时容易产生“幻觉”的问题，本文提出了一种基于领域基础的分层检索与验证架构。该研究旨在通过四阶段自调节管道，将LLM从随机模式匹配器转变为经过验证的真理寻求者。方法包括内在验证与早退逻辑、利用领域检测器的自适应搜索路由、纠正性文档分级（CRAG）以及外在再生与原子声明级验证。在TimeQA v2、FreshQA v2等五个基准数据集上的实验表明，该管道在所有环境中均优于零样本基线，胜率最高达83.7%，且基础性得分稳定。尽管该架构能有效拦截事实错误，但仍存在“错误前提过度声明”的失败模式，建议未来工作优先考虑检索前的“可回答性”节点。

创新点:

提出了一种四阶段、自调节的验证管道，结合了内在和外部检查点以系统性地拦截事实错误。
引入了“内在验证与早退逻辑”，通过内部置信度评分优化计算效率，避免不必要的检索开销。
设计了“自适应搜索路由”，利用领域检测器针对特定领域的档案进行检索，提高了检索的精准度。
实现了“纠正性文档分级（CRAG）”和原子声明级验证，确保生成内容的每一个声明都有据可依。

方法

!!! info

论文采用了一种基于LangGraph实现的四阶段技术路线。首先，系统尝试利用内部参数记忆（零样本基线）回答问题，并通过分解原子声明进行内在验证，若置信度高则触发早退机制；若不确定，则进入第二阶段，通过领域检测器路由到特定领域档案进行检索；第三阶段利用CRAG过滤不相关上下文；最后进行外在再生，并对生成的原子声明进行逐一验证。

关键结果:

在TimeQA v2基准测试中，系统胜率达到83.7%，表现出在时间精度上的优势。
在MMLU Global Facts基准测试中，系统胜率达到78.0%。
在事实回答行中，基础性得分在78.8%至86.4%之间保持稳定。
识别出一种持续的失败模式，称为“错误前提过度声明”，即模型对基于错误前提的问题过度自信地生成回答。

技术栈: LangGraph（用于实现自调节管道）, 零样本基线模型, 纠正性检索增强生成（CRAG）, 原子声明分解技术, 领域检测器

优点

高效性：通过早退逻辑减少了不必要的API调用和计算资源消耗。
鲁棒性：多阶段验证机制显著提高了事实准确性和基础性得分。
自适应能力：能够根据内部置信度动态调整验证力度，并针对特定领域进行检索。
全面性：覆盖了从内在知识检查到外在检索验证的全过程，有效缓解了不同类型的幻觉。

局限

特定失败模式：存在“错误前提过度声明”的问题，即系统可能对基于错误前提的问题过度自信地生成回答。
系统复杂性：四阶段管道可能增加了系统的复杂性和响应延迟。
依赖性：虽然使用了领域检测，但检索效果仍可能依赖于外部档案的质量和覆盖范围。

与研究方向的相关性:

本文高度相关。它属于“大模型和深度学习技术原理的创新”领域，专注于解决LLM的核心缺陷——幻觉。论文提出的分层检索和验证架构是对现有RAG技术的改进，属于大模型技术原理层面的创新。虽然文中提及医学、法律等高风险领域作为应用背景，但核心贡献在于技术架构本身，符合用户对“新技术”和“创新性强”的要求。

8. Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

作者: Andor Diera, Ansgar Scherp 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17624v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）如何编码语义关系，直接高度相关于关键词1（LLMs）和关键词23（Mechanistic Interpretability）。论文使用稀疏自编码器（SAE）进行分析，与关键词2（Sparse Models）有一定关联。论文研究了不同规模的模型（包括70M参数的Pythia），与关键词3（Small Language Models）有一定关联。论文未涉及其他关键词如训练方法、推理技术、应用领域等，因此这些关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）如何编码语义关系（如同义、反义、上下位关系），通过线性探测和机制可解释性技术（如稀疏自编码器）发现，上下位关系存在编码不对称性，且关系信号在模型中层最强，在较大模型（Llama 3.1）中探测结果具有因果性。

摘要翻译

理解大型语言模型（LLM）是否捕捉结构化意义，需要考察它们如何表征概念间的关系。本研究考察了三个规模递增的模型：Pythia-70M、GPT-2 和 Llama 3.1 8B，聚焦于四种语义关系：同义关系、反义关系、上下位关系（hypernymy）和下位关系（hyponymy）。我们结合线性探测与机制可解释性技术，包括稀疏自编码器（SAE）和激活修补，以识别这些关系在何处被编码，以及特定特征如何对其表征做出贡献。我们的结果揭示了层级关系中的方向性不对称：上下位关系被冗余编码且难以抑制，而下位关系则依赖于紧凑的特征，这些特征更容易因消融而受到破坏。更广泛地说，关系信号是弥散的，但表现出稳定的分布模式：它们在中层达到峰值，并且在残差后/MLP通路中比在注意力通路中更强。不同模型间的任务难度保持一致（反义关系最容易，同义关系最难）。探测层面的因果性取决于模型能力：在 Llama 3.1 上，SAE 引导的修补能可靠地改变这些信号，而在较小模型上，这种改变则微弱或不稳定。我们的结果阐明了语义关系在 LLM 内部何处以及如何可靠地表征，并提供了一个可复现的框架，用于将稀疏特征与探测层面的因果证据联系起来。

摘要 (Abstract)

Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.

关键词: Large Language Models, Semantic Relations, Mechanistic Interpretability, Sparse Autoencoders, Linear Probing, Activation Patching, Model Scaling, Causal Analysis

9. On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

作者: David Restrepo, Miguel L Martins, Chenwei Wu, Luis Filipe Nakayama, Diego M Lopez, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17246v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文研究医学视觉-语言模型（VLMs）中的模态间隙（modality gap）现象及其对下游性能的影响，并提出一种轻量级后处理机制进行调控。论文核心与医学AI应用高度相关（关键词27得10分），涉及预训练模型的分析（关键词5得5分）和后训练调控方法（关键词6得8分），并对模型表示的可解释性有一定探讨（关键词23得5分）。论文未涉及纯语言模型、推理、对齐、高效微调、检索增强、上下文扩展、注意力优化、智能体、量化、加速、幻觉缓解、世界模型、模型合并、上下文学习等其他技术主题，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文研究了医学视觉-语言模型中模态间隙对下游性能的影响，发现通过轻量级后处理机制适度调控模态间隙能优化任务性能，且医学数据集对此更敏感。

摘要翻译

视觉语言模型（VLMs）在表征空间中呈现出一种典型的“锥体效应”，即非线性编码器将嵌入向量映射到高度集中的区域，这导致了被称为模态间隙的跨模态分离现象。尽管该现象已被广泛观测到，但其在监督式多模态学习（尤其在医学领域）中的实际影响尚不明确。本研究提出一种轻量级事后处理机制，该机制保持预训练的VLM编码器冻结，同时通过单一超参数{λ}持续调控跨模态分离程度。这使得我们能够在不进行昂贵重新训练的情况下，系统分析模态间隙如何影响下游多模态性能。我们在监督式多模态设置下，评估了通用模型（CLIP、SigLIP）和医学专用模型（BioMedCLIP、MedSigLIP）在多种医学与自然数据集上的表现。结果一致表明，减小过大的模态间隙能提升下游性能，且医学数据集对间隙调节表现出更强的敏感性；然而，完全消除间隙并非总是最优选择，中等程度、依赖具体任务的分离状态往往能产生最佳结果。这些发现将模态间隙定位为多模态表征的一种可调节属性，而非一个应当被普遍最小化的量。

摘要 (Abstract)

Vision-Language Models (VLMs) exhibit a characteristic “cone effect” in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

关键词: Vision-Language Models, Modality Gap, Medical AI, Post-hoc Mechanism, Cross-modal Separation, Supervised Multimodal Learning, BioMedCLIP, MedSigLIP

📋 所有论文列表

1. ✅ CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

评分: 62.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了CODMAS框架，通过基于大语言模型的多智能体协作系统自动化优化RTL代码，在流水线和时钟门控任务中分别实现了约25%的关键路径延迟降低和约22%的功耗降低。

摘要翻译

优化寄存器传输级（RTL）代码是电子设计自动化（EDA）中提升功耗、性能和面积（PPA）的关键步骤。本文提出CODMAS（基于辩证多智能体系统的协同优化框架），该框架结合结构化辩证推理、领域感知的代码生成与确定性评估，以实现RTL优化的自动化。CODMAS的核心包含两个辩证智能体：其一为“阐述者”（Articulator），其设计灵感源于橡皮鸭调试法，负责阐述逐步转换计划并揭示潜在假设；其二为“假设协作者”（Hypothesis Partner），负责预测结果并调和预期行为与实际行为之间的偏差，以指导针对性改进。这些智能体驱动一个领域专用编码智能体（Domain-Specific Coding Agent, DCA）生成具备架构感知能力的Verilog代码修改，并引导一个代码评估智能体（Code Evaluation Agent, CEA）验证语法、功能及PPA指标。我们同时提出了RTLOPT基准测试集，包含120组针对流水线与时钟门控转换的Verilog三元组（未优化版本、优化版本、测试平台）。在专有及开源大语言模型上，CODMAS在流水线优化中实现了关键路径延迟降低约25%，在时钟门控优化中实现了功耗降低约22%，同时相较于强提示方法与智能体基线，显著减少了功能错误与编译失败案例。这些结果表明，结构化的多智能体推理能够显著增强自动化RTL优化能力，并有望扩展至更复杂的设计与更广泛的优化任务。

摘要 (Abstract)

Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

关键词: Multi-agent Systems, LLM Agents, RTL Optimization, Electronic Design Automation, Dialectic Reasoning, Code Generation, Verilog, Power Performance Area

2. ✅ Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents

作者: Mohsen Arjmandi 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17683v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM游戏智能体架构Sensi，涉及测试时学习、课程学习、多智能体协调、上下文窗口控制、幻觉缓解等。与"LLM Agents"和"Large Language Models"高度相关（10分），与"LLM Agents"相关的子主题如"Multi-agent Systems”、“Self-Correction”、“Context Window Extension”、“Chain of Thought”、“System 2 Thinking”、“Hallucination Mitigation”、“In-context Learning"有一定关联（5分）。其他关键词如MoE、量化、科学AI应用等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Sensi LLM智能体架构，通过课程学习和两玩家分离设计解决游戏环境中测试时学习效率低的问题，实现了50-94倍的样本效率提升，但发现感知层幻觉是主要瓶颈。

摘要翻译

部署于未知环境的大语言模型（LLM）智能体必须在测试时学习任务结构，但现有方法需要数千次交互才能形成有效假设。本文提出Sensi——一种面向ARC-AGI-3游戏挑战的LLM智能体架构，通过三重机制实现结构化测试时学习：（1）将感知与行动分离的双智能体架构，（2）由外部状态机管理的课程学习系统，（3）作为控制平面的数据库，使智能体的上下文窗口可通过编程方式引导。我们进一步引入配备动态生成评估标准的LLM-as-judge（大语言模型作为裁判）组件，用以判定智能体何时对当前主题学习充分并推进至下一阶段。通过两个迭代版本报告实验结果：Sensi v1仅采用双智能体架构即解决2个游戏关卡，而Sensi v2增加课程学习后虽未解决任何关卡（0个），但仅用约32次行动尝试即完成全部学习课程，其样本效率达到需1600-3000次尝试的同类系统的50-94倍。我们精确诊断其失效模式为源于感知层的自洽幻觉级联，证明架构瓶颈已从学习效率转移至感知 grounding（感知锚定）——这成为一个更易处理的问题。

摘要 (Abstract)

Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.

关键词: LLM agents, test-time learning, curriculum learning, two-player architecture, context window control, sample efficiency, hallucination mitigation, game-playing

3. ✅ Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search

作者: Himadri Samanta 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17765v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究针对自动放射学报告生成中存在的幻觉和缺乏临床依据问题，提出了一种基于多模态检索增强生成（RAG）的系统，通过结合图像-文本嵌入和病例相似性检索来生成有引用依据的胸部X光印象报告，实验表明该方法显著提高了检索性能并增强了输出的可信度。

摘要翻译

随着深度学习与大语言模型的兴起，自动化放射学报告生成日益受到关注。然而，完全生成式方法常出现幻觉问题且缺乏临床依据，限制了其在实际工作流程中的可靠性。本研究提出一种多模态检索增强生成（RAG）系统，用于胸部X光影像印象的基于依据的草稿生成。该系统结合对比式图文嵌入、基于病例的相似性检索与引用约束的草稿生成机制，确保与历史放射学报告的事实对齐。我们使用MIMIC-CXR数据集的精选子集构建多模态检索数据库：图像嵌入通过CLIP编码器生成，文本嵌入则源自结构化的印象章节。通过FAISS索引实现可扩展的最近邻检索，构建了融合相似度框架。检索到的病例用于构建基于依据的提示，以生成印象草稿，并通过安全机制强制实施引用覆盖和基于置信度的拒绝生成。实验结果表明，与纯图像检索相比，多模态融合显著提升了检索性能，在临床相关发现上Recall@5超过0.95。基于依据的草稿生成流程可产生具有明确引用溯源性的可解释输出，相较于传统生成方法显著提升了可信度。本研究凸显了检索增强多模态系统在可靠临床决策支持与放射学工作流程增强方面的潜力。

摘要 (Abstract)

Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliability in real-world workflows. In this study, we propose a multimodal retrieval-augmented generation (RAG) system for grounded drafting of chest radiograph impressions. The system combines contrastive image-text embeddings, case-based similarity retrieval, and citation-constrained draft generation to ensure factual alignment with historical radiology reports. A curated subset of the MIMIC-CXR dataset was used to construct a multimodal retrieval database. Image embeddings were generated using CLIP encoders, while textual embeddings were derived from structured impression sections. A fusion similarity framework was implemented using FAISS indexing for scalable nearest-neighbor retrieval. Retrieved cases were used to construct grounded prompts for draft impression generation, with safety mechanisms enforcing citation coverage and confidence-based refusal. Experimental results demonstrate that multimodal fusion significantly improves retrieval performance compared to image-only retrieval, achieving Recall@5 above 0.95 on clinically relevant findings. The grounded drafting pipeline produces interpretable outputs with explicit citation traceability, enabling improved trustworthiness compared to conventional generative approaches. This work highlights the potential of retrieval-augmented multimodal systems for reliable clinical decision support and radiology workflow augmentation

4. ✅ EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

作者: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17808v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文解决了视频世界模型生成的视觉推演与机器人物理可执行动作之间的不匹配问题，通过提出EVA强化学习后训练框架，利用逆动力学奖励对齐模型，减少了特定于具体机器人的伪影并提高了下游任务执行成功率。

摘要翻译

视频生成模型正日益被用作机器人的世界模型，其中模型根据当前观测和任务指令生成未来的视觉推演序列，而逆动力学模型则将生成的帧转换为可执行的机器人动作。然而，当前的视频世界模型缺乏显式的可执行性约束。这导致视觉连贯的推演序列仍可能违反刚体运动学一致性，在通过逆动力学模型解码时产生不稳定或不可行的控制指令。我们将这种视觉生成与物理可执行控制之间的不匹配称为可执行性差距。虽然可通过拒绝采样等推理时技术缓解该差距，但由于视频生成成本高昂，此类方法效率低下。本文利用可执行性差距作为训练信号，提出可执行视频对齐——一种用于对齐视频世界模型的强化学习后训练框架。该框架在真实机器人轨迹上训练逆动力学模型，并将其重新用作奖励模型，通过生成视频所诱导的动作序列来评估视频质量：奖励由速度、加速度和急动度度量的平滑运动，同时惩罚违反本体约束的动作。值得注意的是，即使生成视频存在严重视觉伪影，该奖励仍能提供有效信息，因为此类伪影通常对应不稳定或越界的动作。在RoboTwin基准测试和真实双臂机器人上的实验表明，该方法能减少生成推演中特定于本体的伪影，并提升下游任务执行成功率。

摘要 (Abstract)

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

关键词: video world models, reinforcement learning, post-training, alignment, inverse dynamics, robot actions, executability gap, EVA

作者: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17372v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现视觉语言模型的越狱攻击源于视觉模态引起的表示偏移，并提出了一种通过移除越狱相关偏移来增强模型安全性的防御方法。

摘要翻译

大型视觉语言模型（VLMs）在整合视觉模态后，其安全对齐性常出现弱化现象。即使文本提示包含明确的有害意图，添加图像仍会显著提高越狱成功率。本文发现，在表征空间中，VLMs能够清晰区分良性输入与有害输入。此外，即使在有害输入中，越狱样本也会形成一种与拒绝样本明显分离的独特内部状态。这些观察表明，越狱行为并非源于模型未能识别有害意图，而是视觉模态将表征推向特定的越狱状态，从而导致未能触发拒绝机制。为量化这一转变，我们识别出一个越狱方向，并将图像诱导的表征偏移沿此方向的分量定义为越狱相关偏移。分析表明，越狱相关偏移能可靠地表征越狱行为，为多样化的越狱场景提供了统一解释。最后，我们提出一种防御方法，通过在推理时移除越狱相关偏移（JRS-Rem）来增强VLM的安全性。实验证明，JRS-Rem能在多种场景下提供强效防御，同时保持良性任务上的性能。

摘要 (Abstract)

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

关键词: Vision-Language Models, Jailbreak Attacks, Safety Alignment, Representation Shift, Defense Method, Visual Modality, Internal State Analysis, JRS-Rem

6. ✅ Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

作者: Federico Albanese, Pablo Ronco, Nicolás D’Ippolito 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17217v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一个基于本地大语言模型的隐私保护文本匿名化框架，通过用类型一致的替代品替换个人身份信息，在保护隐私的同时保持了文本的流畅性和语义效用，并在评估中显示出优于现有方法的性能。

摘要翻译

负责任地使用人工智能要求我们在不损害数据实用性的前提下保护敏感信息，这一需求在大语言模型时代变得尤为迫切。我们通过一种本地部署的、基于大语言模型的替换流程来应对这一挑战，该流程通过将个人可识别信息替换为符合类型特征的逼真替代内容来实现文本匿名化。该方法完全在组织边界内使用本地大语言模型执行，既能防止数据外泄，又能保持文本的流畅性和任务相关语义。

我们在基于对话的行为数据集上进行了系统性、多指标、跨技术的评估，以行业标准方案（微软Presidio和谷歌DLP）以及前沿方法（ZSTS，包括仅删除版本及删除加替换版本）作为基准。我们的评估方案通过一项面向全生命周期的标准，综合衡量了隐私性、语义效用及隐私保护下的可训练性——该标准通过对经脱敏处理的文本微调紧凑编码器（BERT+LoRA）获得。此外，我们通过在应答大语言模型前插入本地匿名化层，并评估其回答质量，来检验智能问答代理的性能。这一中间化的类型保持替换阶段确保敏感内容不会暴露给第三方API，从而在保障机密性的前提下实现问答代理的负责任部署。

我们的方法在隐私性、主题偏移控制、事实效用保持及可训练性损失方面均达到最优水平，在隐私-效用-可训练性的综合评估边界上超越了基于规则的方法、命名实体识别基准模型以及各版本ZSTS方法。这些结果表明，基于本地大语言模型的替换技术能够生成既符合负责任使用要求又具备操作价值的匿名语料：既适用于智能代理流程的安全部署，也适用于下游微调任务且性能衰减有限。

摘要 (Abstract)

Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic Q&A performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q&A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy–utility–trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation.

关键词: LLM-driven, privacy-preserving, anonymization, on-premise, PII substitution, agentic Q&A, LoRA fine-tuning, semantic utility

7. ✅ Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

作者: Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17872v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种领域接地的分层检索和验证架构，通过四阶段自调节管道（包括内在验证、自适应搜索路由、纠正文档分级和外在再生）来缓解LLM幻觉，在多个基准测试中显著提升了事实准确性和可靠性。

摘要翻译

大型语言模型（LLMs）在流畅性方面取得了前所未有的成就，但仍易产生“幻觉”——即生成事实错误或无依据的内容。这一局限在可靠性至关重要的高风险领域中尤为关键。我们提出了一种领域锚定的分层检索与验证架构，旨在通过将LLMs从随机模式匹配器转变为经过验证的真相探寻者，系统性地拦截事实错误。该框架采用基于LangGraph实现的四阶段自调节流程：（I）利用早期退出逻辑进行内在验证以优化计算资源，（II）通过领域检测器实现自适应搜索路由以定向检索特定主题档案库，（III）采用纠正性文档分级（CRAG）过滤无关上下文，（IV）进行外部再生及原子化声明级验证。该系统在来自五个不同基准测试的650条查询中进行了评估：TimeQA v2、FreshQA v2、HaluEval General、MMLU Global Facts和TruthfulQA。实证结果表明，该流程在所有环境中均持续优于零样本基线。在TimeQA v2中胜率峰值达83.7%，在MMLU Global Facts中达78.0%，证实了其在需要精细时间和数值精度的领域具有高效性。在事实性答案行中，锚定分数稳定保持在78.8%至86.4%之间。尽管该架构为错误信息提供了强大的故障防护机制，但仍识别出“虚假前提过度断言”这一持续存在的失效模式。这些发现为多阶段检索增强生成（RAG）行为提供了详细的实证特征描述，并表明未来工作应优先发展检索前“可答性”节点，以进一步弥合对话式人工智能的可靠性差距。

摘要 (Abstract)

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to “hallucinations” - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of “False-Premise Overclaiming” was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval “answerability” nodes to further bridge the reliability gap in conversational AI.

关键词: LLM Hallucinations, Domain-Grounded Retrieval, Tiered Retrieval, Retrieval-Augmented Generation, Factual Verification, LangGraph, CRAG, Groundedness

8. ✅ Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

作者: Andor Diera, Ansgar Scherp 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17624v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）如何编码语义关系（如同义、反义、上下位关系），通过线性探测和机制可解释性技术（如稀疏自编码器）发现，上下位关系存在编码不对称性，且关系信号在模型中层最强，在较大模型（Llama 3.1）中探测结果具有因果性。

摘要翻译

理解大型语言模型（LLM）是否捕捉结构化意义，需要考察它们如何表征概念间的关系。本研究考察了三个规模递增的模型：Pythia-70M、GPT-2 和 Llama 3.1 8B，聚焦于四种语义关系：同义关系、反义关系、上下位关系（hypernymy）和下位关系（hyponymy）。我们结合线性探测与机制可解释性技术，包括稀疏自编码器（SAE）和激活修补，以识别这些关系在何处被编码，以及特定特征如何对其表征做出贡献。我们的结果揭示了层级关系中的方向性不对称：上下位关系被冗余编码且难以抑制，而下位关系则依赖于紧凑的特征，这些特征更容易因消融而受到破坏。更广泛地说，关系信号是弥散的，但表现出稳定的分布模式：它们在中层达到峰值，并且在残差后/MLP通路中比在注意力通路中更强。不同模型间的任务难度保持一致（反义关系最容易，同义关系最难）。探测层面的因果性取决于模型能力：在 Llama 3.1 上，SAE 引导的修补能可靠地改变这些信号，而在较小模型上，这种改变则微弱或不稳定。我们的结果阐明了语义关系在 LLM 内部何处以及如何可靠地表征，并提供了一个可复现的框架，用于将稀疏特征与探测层面的因果证据联系起来。

摘要 (Abstract)

Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.

关键词: Large Language Models, Semantic Relations, Mechanistic Interpretability, Sparse Autoencoders, Linear Probing, Activation Patching, Model Scaling, Causal Analysis

9. ✅ On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文研究了医学视觉-语言模型中模态间隙对下游性能的影响，发现通过轻量级后处理机制适度调控模态间隙能优化任务性能，且医学数据集对此更敏感。

摘要翻译

视觉语言模型（VLMs）在表征空间中呈现出一种典型的“锥体效应”，即非线性编码器将嵌入向量映射到高度集中的区域，这导致了被称为模态间隙的跨模态分离现象。尽管该现象已被广泛观测到，但其在监督式多模态学习（尤其在医学领域）中的实际影响尚不明确。本研究提出一种轻量级事后处理机制，该机制保持预训练的VLM编码器冻结，同时通过单一超参数{λ}持续调控跨模态分离程度。这使得我们能够在不进行昂贵重新训练的情况下，系统分析模态间隙如何影响下游多模态性能。我们在监督式多模态设置下，评估了通用模型（CLIP、SigLIP）和医学专用模型（BioMedCLIP、MedSigLIP）在多种医学与自然数据集上的表现。结果一致表明，减小过大的模态间隙能提升下游性能，且医学数据集对间隙调节表现出更强的敏感性；然而，完全消除间隙并非总是最优选择，中等程度、依赖具体任务的分离状态往往能产生最佳结果。这些发现将模态间隙定位为多模态表征的一种可调节属性，而非一个应当被普遍最小化的量。

摘要 (Abstract)

Vision-Language Models (VLMs) exhibit a characteristic “cone effect” in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

关键词: Vision-Language Models, Modality Gap, Medical AI, Post-hoc Mechanism, Cross-modal Separation, Supervised Multimodal Learning, BioMedCLIP, MedSigLIP

10. ❌ UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

作者: Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17476v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文UniSAFE专注于统一多模态模型（UMMs）的安全性评估，属于大模型应用领域。与关键词的相关性分析如下：1）“Large Language Models"等（5分）：UMMs通常基于大模型构建，但论文未深入讨论其技术原理。2）“Instruction Tuning"等（8分）：论文强调"safety alignment”，直接涉及模型对齐和安全调整。3）“Hallucination Mitigation"等（8分）：论文评估安全风险如违规内容生成，与事实性和真实性相关。4）“Mechanistic Interpretability"等（5分）：通过基准测试解释模型漏洞，与可解释AI有一定关联。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对统一多模态模型（UMMs）存在的系统级安全风险，提出了首个综合性安全评估基准UniSAFE，并发现当前UMMs在多图像组合和多轮对话设置中具有更高的安全违规风险，图像输出任务比文本输出任务更脆弱。

摘要翻译

统一多模态模型（UMMs）具备强大的跨模态能力，但也带来了单任务模型中未曾观察到的新型安全风险。尽管此类模型已崭露头角，现有的安全基准测试仍分散于不同任务与模态之间，限制了对复杂系统级漏洞的全面评估。为填补这一空白，我们提出了UniSAFE——首个针对UMMs系统级安全性的综合基准测试，涵盖7种输入/输出模态组合，横跨传统任务与新型多模态上下文图像生成场景。UniSAFE采用共享目标设计，将常见风险场景映射至特定任务的输入/输出配置中，从而实现对安全失效的跨任务可控比较。该基准包含6,802个精心构建的测试实例，我们利用其对15个前沿的专有及开源UMMs进行评估。研究结果揭示了当前UMMs普遍存在的严重漏洞，包括在多图像组合与多轮对话场景中安全违规率显著升高，且图像输出任务的安全脆弱性持续高于文本输出任务。这些发现凸显了加强UMMs系统级安全对齐的迫切需求。我们的代码与数据已公开于https://github.com/segyulee/UniSAFE。

摘要 (Abstract)

Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE

关键词: Unified Multimodal Models, safety evaluation, benchmark, system-level vulnerabilities, multimodal-context image generation, safety alignment, cross-modality capabilities

11. ❌ Ruyi2.5 Technical Report

作者: Huan Song, Shuyu Tian, Qingfei Zhao, Wenhao Hong, Jiang Liu, Ting Long, Jiawei Shao, Xuelong Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17311v1

评分: 21.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Ruyi2.5多模态家族模型，基于AI Flow框架，采用共享骨干架构统一训练不同规模模型，并开发了Ruyi2.5-Camera隐私保护相机服务系统，包含边缘模型进行特征映射和云模型进行深度行为推理，同时提出BPPO方法加速强化学习微调。与关键词相关性分析：1）论文涉及多模态模型，属于大模型范畴，与"Large Language Models"等有一定关联（8分）；2）模型支持不同规模部署，边缘模型部分与"Small Language Models"相关（5分）；3）模型构建涉及预训练和领域适应，与"Pre-training"等关键词相关（8分）；其他关键词如MoE、SFT、RLHF、RAG、量化等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Ruyi2.5多模态家族模型及其隐私保护相机应用Ruyi2.5-Camera，通过共享骨干架构统一训练不同规模模型，并引入BPPO方法加速强化学习微调，在通用多模态基准上达到Qwen3-VL水平，在隐私约束监控任务上显著优于Qwen3-VL。

摘要翻译

我们推出基于AI Flow框架构建的多模态家族模型Ruyi2.5。该模型将Ruyi2的“一次训练，多处部署”范式扩展至多模态领域，构建了一个共享主干架构，在统一流程中协同训练不同规模的模型，确保所有部署层级间的语义一致性。基于Ruyi2.5，我们进一步开发了隐私保护相机服务系统Ruyi2.5-Camera，该系统将Ruyi2.5-Camera实例化为两阶段识别流程：边缘模型采用信息瓶颈引导的不可逆特征映射，在源头对原始帧进行去身份化处理；云端模型则执行深度行为推理。为加速强化学习微调过程，我们进一步提出二元前缀策略优化方法，该方法通过二元响应选择减少样本冗余，并将梯度更新聚焦于响应前缀，相比GRPO实现了2至3倍的训练加速。实验表明，Ruyi2.5在通用多模态基准测试中与Qwen3-VL表现相当，而Ruyi2.5-Camera在隐私受限的监控任务中显著优于Qwen3-VL。

摘要 (Abstract)

We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2’s “Train Once, Deploy Many” paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.

关键词: multimodal model, shared-backbone architecture, privacy-preserving camera system, edge-cloud pipeline, reinforcement learning fine-tuning, Binary Prefix Policy Optimization, AI Flow framework, Train Once Deploy Many

12. ❌ Text-to-Stage: Spatial Layouts from Long-form Narratives

作者: Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock, Yuliang Li, W. Owen Brimijoin, Vamsi Krishna Ithapu, Ishwarya Ananthabhotla 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17832v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究语言模型的空间推理能力，并应用了SFT和RL训练方法。因此，与"Large Language Models"和"Post-training"高度相关（10分），因为论文明确使用LLM并采用SFT（Best-of-N sampling）和RL（GRPO）进行训练。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等均未在摘要中提及或与论文主题无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何让语言模型从缺乏明确空间线索的长篇叙事文本中推理出舞台布局（场景、角色位置、移动和房间类型），并通过结合拒绝SFT和基于可验证奖励的RL训练方法，在经典英语文学语料上实现了比基础模型更好的性能，并与LLM评判和人类主观偏好保持一致。

摘要翻译

本研究探讨了语言模型从非结构化文本中展现空间推理的能力，旨在模拟人类认知并自动化一个对众多下游媒体应用有益的过程。具体而言，我们聚焦于“叙事到剧本”任务：从缺乏明确空间、位置或关系线索的文本中，推断出舞台剧本布局（包括场景、说话者位置、移动轨迹及房间类型）。我们进而引入一套受戏剧学启发的确定性评估体系，并提出一种训练与推理方案，该方案结合了基于Best-of-N采样的拒绝式监督微调，以及通过GRPO实现的可验证奖励强化学习。在仅包含古典英语文学文本的语料库上进行的实验表明，该方法在多项指标（角色归属、空间合理性和移动经济性）上均优于基础模型，并且与基于大语言模型的自动评估及人类主观偏好保持一致。

摘要 (Abstract)

In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.

关键词: language model, spatial reasoning, narrative-to-play, stage-play layouts, rejection SFT, RL, GRPO, LLM-as-a-judge

13. ❌ Discovering Decoupled Functional Modules in Large Language Models

作者: Yanke Yu, Jin Li, Ying Sun, Ping Li, Zhefeng Wang, Yi Zheng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17823v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是研究大型语言模型（LLMs）的内部功能模块化组织，属于LLM可解释性（Interpretability）研究范畴。因此，它与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文直接以LLMs为研究对象。同时，它与关键词"Mechanistic Interpretability” OR “Explainable AI"高度相关（10分），因为其目标是理解和解释LLM的内部工作机制（发现功能模块），这是可解释性AI的核心任务。论文未涉及其他关键词所描述的具体技术（如MoE、训练方法、推理加速、应用领域等），因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何发现和解耦大型语言模型（LLMs）内部的功能模块，提出了一种无监督框架ULCMOD，成功识别出具有语义一致性和层次组织的模块，为LLM的可解释性研究提供了新工具。

摘要翻译

理解大型语言模型（LLM）的内部功能组织对于提升其可信度与性能至关重要。然而，LLM如何将不同功能组织到模块中，目前仍鲜有探索。为填补这一空白，我们提出了一个功能模块发现问题，并设计了一种无监督的LLM跨层模块发现（ULCMOD）框架。该框架能够将整个LLM中的大量神经元同时解耦为多个模块，同时发现与这些模块相关的输入样本主题。我们的框架引入了一种新颖的目标函数和一种高效的迭代解耦（IterD）算法。大量实验表明，我们的方法能够发现高质量、解耦的模块，这些模块能捕捉更具意义的语义信息，并在多种下游任务中取得优越性能。此外，我们的定性分析表明，所发现的模块具有语义连贯性，对应着可解释的特定功能，并在LLM内部呈现出清晰的空间与层次化组织。本研究为解释LLM的功能模块提供了一种新工具，填补了LLM可解释性研究中的一个关键空白。

摘要 (Abstract)

Understanding the internal functional organization of Large Language Models (LLMs) is crucial for improving their trustworthiness and performance. However, how LLMs organize different functions into modules remains highly unexplored. To bridge this gap, we formulate a functional module discovery problem and propose an Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD) framework that simultaneously disentangles the large set of neurons in the entire LLM into modules while discovering the topics of input samples related to these modules. Our framework introduces a novel objective function and an efficient Iterative Decoupling (IterD) algorithm. Extensive experiments show that our method discovers high-quality, disentangled modules that capture more meaningful semantic information and achieve superior performance in various downstream tasks. Moreover, our qualitative analysis reveals that the discovered modules show semantic coherence, correspond to interpretable specializations, and a clear spatial and hierarchical organization within the LLM. Our work provides a novel tool for interpreting the functional modules of LLMs, filling a critical blank in LLM’s interpretability research.

关键词: Large Language Models, LLMs, Functional Modules, Interpretability, Unsupervised Discovery, Module Disentanglement, Neuron Analysis, Mechanistic Interpretability

14. ❌ Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

作者: Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17541v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心研究对象是多模态大语言模型（MLLMs），因此与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分）。论文的核心研究内容是视频监督微调（Video-SFT），这是监督微调（SFT）在视频模态上的具体应用，因此与关键词"Post-training” OR “Supervised Fine-tuning” OR “SFT"高度相关（10分）。论文主要探讨Video-SFT对模型视觉能力（空间与时间理解）的影响，并提出了一种混合帧策略来缓解图像-视频性能权衡。论文未涉及其他关键词所描述的技术（如MoE、量化、RAG、对齐、推理方法、智能体、科学AI应用等），因此这些关键词的相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了视频监督微调（Video-SFT）对多模态大语言模型视觉能力的影响，发现其能提升视频理解性能，但常导致静态图像理解性能下降或增益有限，并提出了一种自适应分配帧数的混合帧策略以部分缓解这种权衡。

摘要翻译

多模态大语言模型（MLLMs）通常通过多阶段训练进行构建，其中基于视频的监督微调（Video-SFT）是提升视觉理解能力的关键步骤。然而，其对视觉能力细粒度演变的影响，尤其是空间理解与时间理解之间的平衡，目前仍不甚明晰。本文系统研究了Video-SFT如何重塑MLLMs的视觉能力。在不同模型架构、参数规模和帧采样设置下，我们观察到一个一致的模式：Video-SFT能稳定提升视频任务性能，但在静态图像基准测试中往往仅带来有限增益，甚至导致性能下降。我们进一步揭示，这种权衡与时间预算密切相关：增加采样帧数通常能改善视频性能，但并不能稳定提升静态图像性能。基于这一发现，我们研究了一种指令感知的混合帧策略（Hybrid-Frame），该策略能自适应分配帧数，并在一定程度上缓解图像与视频任务之间的权衡。我们的结果表明，Video-SFT对MLLMs而言并非免费午餐，在联合图像-视频训练中保持空间理解能力仍是一个核心挑战。

摘要 (Abstract)

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

关键词: Multimodal Large Language Models (MLLMs), Video Supervised Fine-tuning (Video-SFT), Spatial Understanding, Temporal Understanding, Image-Video Trade-off, Frame Sampling, Hybrid-Frame Strategy, Visual Capabilities

15. ❌ Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT

作者: Zanting Ye, Xuanbin Wu, Guoqing Zhong, Shengyuan Liu, Jiashuai Liu, Ge Song, Zhisong Wang, Jing Hao, Xiaolong Niu, Yefeng Zheng, Yu Zhang, Lijun Lu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16963v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出了一种用于脊柱不稳定性筛查的医学影像分析框架TGBP，属于AI在生物医学领域的应用。摘要中明确提到框架集成了大型语言模型（LLM）模块，因此与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（5分）。框架强调其"auditable white-box"和"interpretable"特性，与"Mechanistic Interpretability” OR “Explainable AI"相关（5分）。论文的核心是医学影像分析在生物信息学/生物医学领域的应用，与关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"高度相关（10分）。论文未涉及其他关键词所描述的大模型技术原理、训练方法、推理优化、智能体等具体技术细节，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为TGBP的可解释白盒框架，用于在常规CT扫描中自动化筛查脊柱不稳定性，通过几何创新和集成LLM模块，在多中心验证中实现了90.2%的准确率，并显著优于肿瘤科医生的表现。

摘要翻译

常规肿瘤学计算机断层扫描（CT）为筛查脊柱不稳定性提供了理想机会，但由于脊柱肿瘤不稳定评分（SINS）所需的复杂几何推理，预防性稳定窗口常被错过。转移性骨溶解从根本上阻碍了SINS的自动化，其引发的拓扑模糊性干扰了标准分割方法与黑盒人工智能。我们提出拓扑引导生物力学分析框架（Topology-Guided Biomechanical Profiling, TGBP），这是一个可审计的白盒框架，将解剖感知与结构推理解耦。TGBP将SINS评估锚定于两项确定性几何创新：（i）采用椎管参照分区法以解决后外侧边界模糊问题；（ii）通过基于协方差的定向包围盒（oriented bounding boxes, OBB）进行上下文感知形态测量归一化，以量化椎体塌陷程度。结合辅助影像组学与大语言模型（LLM）模块，TGBP实现了端到端、可解释的SINS评估。在多中心、多癌种队列（$N=482$）验证中，TGBP在三层级稳定性分诊中达到90.2%的准确率。在一项盲法阅片研究（$N=30$）中，TGBP在复杂结构特征评估上显著优于肿瘤内科医生（$κ=0.857$ vs.\ $0.570$），并防止了总分估算中的误差累积（$κ=0.625$ vs.\ $0.207$），从而实现了专家级机会性筛查的普及化。

摘要 (Abstract)

Routine oncologic computed tomography (CT) presents an ideal opportunity for screening spinal instability, yet prophylactic stabilization windows are frequently missed due to the complex geometric reasoning required by the Spinal Instability Neoplastic Score (SINS). Automating SINS is fundamentally hindered by metastatic osteolysis, which induces topological ambiguity that confounds standard segmentation and black-box AI. We propose Topology-Guided Biomechanical Profiling (TGBP), an auditable white-box framework decoupling anatomical perception from structural reasoning. TGBP anchors SINS assessment on two deterministic geometric innovations: (i) canal-referenced partitioning to resolve posterolateral boundary ambiguity, and (ii) context-aware morphometric normalization via covariance-based oriented bounding boxes (OBB) to quantify vertebral collapse. Integrated with auxiliary radiomic and large language model (LLM) modules, TGBP provides an end-to-end, interpretable SINS evaluation. Validated on a multi-center, multi-cancer cohort ($N=482$), TGBP achieved 90.2% accuracy in 3-tier stability triage. In a blinded reader study ($N=30$), TGBP significantly outperformed medical oncologists on complex structural features ($κ=0.857$ vs.\ $0.570$) and prevented compounding errors in Total Score estimation ($κ=0.625$ vs.\ $0.207$), democratizing expert-level opportunistic screening.

关键词: Spinal Instability Screening, Computed Tomography (CT), White-Box Framework, Topology-Guided Biomechanical Profiling, Large Language Model (LLM), Interpretable AI, Medical Imaging Analysis, Oncologic Radiology

16. ❌ Evaluating Ill-Defined Tasks in Large Language Models

作者: Yi Zhou, Basel Shbita 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17067v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在模糊任务上的评估问题，因此与"Large Language Models"高度相关（10分）。论文涉及指令遵循评估，与"Instruction Tuning"有一定关联（5分），但未深入讨论对齐技术本身。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、科学AI应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文分析了当前大语言模型在模糊任务评估中的局限性，揭示了现有评估基准和指标无法提供可靠诊断信号的问题，并提出了更稳健、可解释的评估设计需求。

摘要翻译

当前针对大语言模型（LLM）的评估多聚焦于本质上定义不明确的任务，这类任务通常具有模糊的输入输出空间与不清晰的成功标准。本文分析了为何现有的评估基准与指标未能为此类任务提供可靠或具有诊断意义的模型能力信号。我们通过两个案例研究展开探讨：在复杂指令遵循（CIF）任务中，我们识别出反复出现的问题，包括对真实世界指令复杂度的覆盖有限、对指令表述的敏感性、评估指标不一致且不可比，以及基于LLM的评判器引入的不稳定性；在自然语言转Mermaid序列图（NL2Mermaid）任务中，我们展示了多维度评估标准如何能够提供超越综合分数的、可指导实践的具体洞见。这些案例共同表明，当前的评估方法常常混淆不同的失败模式，导致所得分数不稳定、缺乏诊断性且难以指导改进。我们的研究结果揭示了现有针对定义不明确任务的评估实践存在根本性局限，并呼吁设计更具鲁棒性和可解释性的评估方案。

摘要 (Abstract)

Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.

关键词: Large Language Models, Evaluation, Ill-defined Tasks, Complex Instruction Following, Benchmarks, Metrics, Diagnostic Signals, Evaluation Design

作者: Michel Schimpf, Julian Voigt, Thomas Bohné 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17887v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究AI辅助目标设定对目标进展的影响，核心是应用大型语言模型（LLM）作为职业教练（AI career coach “Leon” powered by Claude Sonnet），属于大模型在心理学/行为科学领域的应用研究。论文明确提及LLM-based chatbots，因此与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分）。其他关键词主要涉及大模型技术原理（如MoE、Scaling Laws、训练方法、推理优化等）或特定科学领域应用（如生物信息学），论文未涉及这些具体技术细节或领域，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该研究通过随机对照试验发现，基于大型语言模型的AI职业教练（Leon）能通过增强感知社会责任感来显著提高短期职业目标进展，相比无支持对照组效果显著，但与结构化书面反思相比整体目标进展无显著差异。

摘要翻译

如何大规模帮助人们识别并追求具有个人意义的职业目标，仍是应用心理学领域的一个关键挑战。职业辅导能够提升目标质量与达成率，但其成本高昂且可及性有限，制约了普及范围。基于大语言模型（LLM）的聊天机器人提供了一种可扩展的替代方案，然而其支持目标追求的心理机制尚未得到验证。本文报告了一项预先注册的三臂随机对照试验（N = 517），比较了人工智能职业教练（“Leon”，基于Claude Sonnet模型驱动）、一份内容高度匹配的书面结构化反思问卷，以及无干预对照组在两周随访期的目标进展。结果显示，人工智能聊天机器人带来的目标进展显著高于对照组（d = 0.33, p = .016）。与书面反思条件相比，人工智能并未显著提升整体目标进展，但它增强了参与者感知到的社会问责感。在预先注册的中介模型中，感知到的问责感中介了人工智能相对于问卷条件对目标进展的影响（间接效应 = 0.15, 95% CI [0.04, 0.31]），而自我一致性则未显示中介作用。这些发现表明，人工智能辅助的目标设定能够改善短期目标进展，且相较于结构化自我反思，其最明确的附加价值在于增强了感知到的问责感。

摘要 (Abstract)

Helping people identify and pursue personally meaningful career goals at scale remains a key challenge in applied psychology. Career coaching can improve goal quality and attainment, but its cost and limited availability restrict access. Large language model (LLM)-based chatbots offer a scalable alternative, yet the psychological mechanisms by which they might support goal pursuit remain untested. Here we report a preregistered three-arm randomised controlled trial (N = 517) comparing an AI career coach (“Leon,” powered by Claude Sonnet), a matched structured written questionnaire covering closely matched reflective topics, and a no-support control on goal progress at a two-week follow-up. The AI chatbot produced significantly higher goal progress than the control (d = 0.33, p = .016). Compared with the written-reflection condition, the AI did not significantly improve overall goal progress, but it increased perceived social accountability. In the preregistered mediation model, perceived accountability mediated the AI-over-questionnaire effect on goal progress (indirect effect = 0.15, 95% CI [0.04, 0.31]), whereas self-concordance did not. These findings suggest that AI-assisted goal setting can improve short-term goal progress, and that its clearest added value over structured self-reflection lies in increasing felt accountability.

关键词: AI career coach, large language model, LLM-based chatbot, goal progress, social accountability, randomized controlled trial, Claude Sonnet, career coaching

18. ❌ A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users

作者: Christos Koutsiaris 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17592v1

评分: 8.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文主要研究一个结合了词典和OpenAI大语言模型（LLM）的浏览器扩展，用于为数字素养较低的用户提供技术术语的上下文帮助。论文的核心应用了LLM（OpenAI的ChatGPT）来生成定义，因此与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关，评分为8分（论文的核心应用了LLM，但未深入探讨LLM的技术原理或创新）。其他关键词主要涉及大模型的技术原理、训练方法、推理优化、代理系统、科学应用等具体技术方向，而本文仅将LLM作为一个现成的工具组件用于应用开发，未涉及这些具体技术细节或创新，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了一个结合词典和OpenAI大语言模型的浏览器扩展，旨在帮助数字素养较低的用户理解网页上的技术术语，实验结果显示该工具能显著提高用户理解并节省信息检索时间。

摘要翻译

本文阐述了一款浏览器扩展程序的设计、实现与评估。该扩展能为用户在网页上悬停于技术缩写词时提供情境化帮助。它结合了精选技术词典与OpenAI大语言模型（LLM），通过轻量级工具提示浮层提供按需定义。系统采用双层人工智能（AI）处理流程：首先通过谷歌云自然语言处理（NLP）分类API与OpenAI的ChatGPT对访问页面进行技术相关性判定，随后才激活提示逻辑，从而降低误判率。一项针对25名中低数字素养用户的混合方法研究评估了该工具对阅读理解与信息检索时间的影响。结果显示：92%的参与者表示对技术术语的理解得到提升，96%的参与者确认其节省了手动网页搜索时间，所有参与者均认为工具提示未造成干扰。词典定义的平均生成时间为2135毫秒，AI生成定义需16429毫秒，而手动搜索每个缩写词的平均耗时则为17200毫秒。本研究展示了一种弥合数字素养鸿沟的实用实时方案，并为将情境化帮助扩展至医学、法律、金融等其他领域指明了方向。

摘要 (Abstract)

This paper describes the design, implementation, and evaluation of a browser extension that provides contextual help to users who hover over technological acronyms and abbreviations on web pages. The extension combines a curated technical dictionary with OpenAI’s large language model (LLM) to deliver on-demand definitions through lightweight tooltip overlays. A dual-layer artificial intelligence (AI) pipeline, comprising Google Cloud’s Natural Language Processing (NLP) taxonomy API and OpenAI’s ChatGPT, classifies each visited page as technology-related before activating the tooltip logic, thereby reducing false-positive detections. A mixed-methods study with 25 participants evaluated the tool’s effect on reading comprehension and information-retrieval time among users with low to intermediate digital literacy. Results show that 92% of participants reported improved understanding of technical terms, 96% confirmed time savings over manual web searches, and all participants found the tooltips non-disruptive. Dictionary-based definitions were appended in an average of 2135 ms, compared to 16429 ms for AI-generated definitions and a mean manual search time of 17200 ms per acronym. The work demonstrates a practical, real-time approach to bridging the digital literacy gap and points toward extending contextual help to other domains such as medicine, law, and finance.

关键词: browser extension, contextual help, digital literacy, large language model, LLM, technical acronyms, tooltip overlays, OpenAI ChatGPT

19. ❌ The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

作者: Yigit Ekin, Yossi Gandelsman 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17998v1

评分: 8.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种基于文本嵌入插值的图像编辑方法，核心是利用大语言模型（LLM）自动构建去偏对比提示对，从而在文本编码器空间中计算转向向量。因此，仅与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（8分），因为LLM被用作工具生成提示。其他关键词涉及模型架构、训练方法、推理技术、对齐、压缩、科学应用等，论文均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究如何实现无需训练的连续可控图像编辑，通过利用大语言模型自动构建提示对并在文本嵌入空间进行插值，提出了一种轻量级方法，在连续编辑效果上媲美基于训练的方法。

摘要翻译

我们提出一种无需训练的框架，用于在测试时对文本条件生成模型进行连续且可控的图像编辑。与先前依赖额外训练或人工干预的方法不同，我们发现只需在文本嵌入空间中进行简单引导即可实现平滑的编辑控制。给定一个目标概念（例如增强照片真实感或改变面部表情），我们使用大型语言模型自动构建一组小型去偏对比提示词对，并据此在生成器的文本编码器空间中计算一个引导向量。随后，我们将该向量直接添加到输入提示的表征中，以沿目标语义轴控制生成过程。为实现连续控制，我们提出一种弹性范围搜索程序，可自动识别引导强度的有效区间，避免引导不足（未编辑）和过度引导（改变其他属性）。在该区间内添加同一向量的缩放版本即可产生平滑连续的编辑效果。由于我们的方法仅修改文本表征，其自然可泛化至多种文本条件模态，包括图像和视频生成。为量化引导的连续性，我们引入一种新的评估指标，用于衡量不同编辑强度下语义变化的均匀性。通过比较各方法的连续编辑行为，我们发现尽管本方法设计简单轻量，其性能仍可与基于训练的方法相媲美，并优于其他无需训练的方法。

摘要 (Abstract)

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator’s text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

关键词: text embedding interpolation, continuous image editing, training-free framework, large language model, text-conditioned generative models, steering vector, elastic range search, semantic control

20. ❌ AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

作者: Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AgentFactory框架，核心是LLM-based agents的自我进化系统，通过将成功任务解决方案保存为可执行的子代理代码而非文本经验，实现持续能力积累。与’LLM Agents’高度相关（10分），涉及’Self-Improvement’机制（10分），‘Tool Use’和’Multi-agent Systems’有一定关联（各5分），因为子代理作为工具被调用和协调。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了LLM-based agents在复杂场景中任务重执行不可靠的问题，通过提出AgentFactory框架将成功任务解决方案保存为可执行的子代理代码，实现了持续的能力积累和自我进化。

摘要翻译

构建基于大语言模型（LLM）的智能体正变得日益重要。当前基于LLM的智能体自我进化研究主要将成功经验记录为文本提示或反思，这难以在复杂场景中可靠保证任务的高效重复执行。我们提出AgentFactory，一种新的自我进化范式，它将成功的任务解决方案保存为可执行的子智能体代码而非文本经验。关键在于，这些子智能体会根据执行反馈持续优化，随着处理更多任务而变得日益稳健和高效。保存的子智能体是纯Python代码并配有标准化文档，使其能够在任何支持Python的系统中移植。我们证明，AgentFactory实现了持续的能力积累：其可执行子智能体库随时间不断增长和完善，逐步减少处理类似任务所需的工作量，且无需人工干预。我们的实现已在https://github.com/zzatpku/AgentFactory开源，演示视频可在https://youtu.be/iKSsuAXJHW0查看。

摘要 (Abstract)

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.

关键词: LLM-based agents, self-evolution, executable subagents, continuous capability accumulation, task re-execution, Python code, agent framework, autonomous improvement

21. ❌ Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

作者: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Loc3R-VLM框架，通过全局布局重建和显式情境建模，增强2D视觉语言模型（VLM）的3D理解能力。核心与’Large Language Models’高度相关（10分），因为VLM是LLM的扩展。与’Pre-training’和’Post-training’相关（各5分），涉及从预训练3D基础模型提取先验和微调。与’Chain of Thought’和’System 2 Thinking’相关（各5分），强调3D空间推理和深度理解。与’World Models’相关（5分），涉及场景结构表示。其他关键词如MoE、SLMs、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了多模态大语言模型在空间理解和视角感知推理方面的不足，通过引入Loc3R-VLM框架，利用全局布局重建和显式情境建模，实现了基于语言的定位和3D问答的先进性能。

摘要翻译

多模态大语言模型（MLLMs）在连接视觉与语言方面取得了显著进展，但在空间理解和视点感知推理方面仍面临挑战。近期研究致力于通过几何线索增强输入表征，而非明确教导模型进行三维空间推理。本文提出Loc3R-VLM框架，该框架使二维视觉语言模型能够从单目视频输入中获取先进的三维理解能力。受人类空间认知启发，Loc3R-VLM依赖两个联合目标：通过全局布局重建构建场景结构的整体表征，以及通过显式情境建模锚定自我中心视角。这些目标提供了直接的空间监督，将感知与语言共同置于三维语境中。为确保几何一致性和度量尺度对齐，我们利用从预训练三维基础模型中提取的轻量级相机位姿先验。Loc3R-VLM在基于语言的定位任务中达到最先进性能，并在情境化及通用三维问答基准测试中超越现有基于二维和视频的方法，证明我们的空间监督框架能够实现强大的三维理解能力。项目页面：https://kevinqu7.github.io/loc3r-vlm

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

关键词: Multimodal Large Language Models, 3D understanding, spatial reasoning, vision-language models, monocular video, global layout reconstruction, situation modeling, language-based localization

22. ❌ Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

作者: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于视频视觉语言模型（Video VLMs）的时空令牌剪枝方法STTS，核心创新在于提升LLM在视频任务中的计算效率。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文直接研究如何优化LLM在视频VLM架构中的效率。与"Quantization OR Model Compression OR Low-bit Weights"和"Speculative Decoding OR Inference Acceleration"有一定关联（8分），因为令牌剪枝是一种模型压缩和推理加速技术，但论文未涉及量化或推测解码等具体方法。其他关键词如MoE、SFT、RAG、CoT等与论文内容无关（0分），论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为STTS的时空令牌评分方法，用于在视频视觉语言模型中统一剪枝视觉令牌，从而在13个视频问答任务上平均性能仅下降0.7%的情况下，将训练和推理效率提升62%。

摘要翻译

令牌剪枝对于提升视觉语言模型（VLMs）的计算效率至关重要，尤其在视频任务中，时间冗余现象普遍存在。先前的方法通常仅在视觉变换器（ViT）内部进行令牌剪枝，专门用于动作识别和物体分割等单模态感知任务，而未适配下游视觉语言任务；或者仅在大型语言模型（LLM）内部剪枝，而保持ViT输出完整，这往往需要复杂的文本条件令牌选择机制。本文提出时空令牌评分（STTS），这是一个简单轻量的模块，可在无需文本条件或令牌合并的情况下，对ViT和LLM中的视觉令牌进行剪枝，并完全兼容端到端训练。通过辅助损失学习时间维度的评分，并借助LLM下游梯度学习空间维度的评分，辅以我们高效的打包算法，STTS在整个架构中剪枝50%的视觉令牌，在训练和推理阶段实现62%的效率提升，同时在13项长短视频问答任务中平均性能仅下降0.7%。随着每视频采样帧数的增加，效率增益进一步提升。在长视频问答任务中应用测试时缩放技术，相比基线模型可额外获得0.5-1%的性能提升。总体而言，STTS代表了一种新颖、简单而有效的技术，实现了统一、全架构范围的视觉令牌剪枝。

摘要 (Abstract)

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

关键词: token pruning, vision-language models, video-based tasks, computational efficiency, spatio-temporal token scoring, end-to-end training, inference acceleration, model compression

23. ❌ Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

作者: Amine Lbath 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于软件漏洞检测的自动化基准生成和数据集创建，涉及学习型漏洞检测、对抗性协同进化等概念，但未提及任何大模型、深度学习技术原理或AI for Science的具体应用。所有关键词均与大模型技术、深度学习原理或特定科学领域AI应用相关，而本文属于软件工程安全领域，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该研究提出了一种自动化基准生成方法，通过向真实代码库注入漏洞并合成可复现的漏洞利用证明，为仓库级漏洞检测代理创建精确标记的数据集，并探索注入与检测代理之间的对抗性协同进化以提高鲁棒性。

摘要翻译

软件漏洞的数量持续增长，且在实践中仍难以检测。尽管基于学习的漏洞检测技术已取得进展，但现有基准测试主要围绕函数层面展开，未能捕捉真实、可执行、跨过程调用的实际场景。近期出现的仓库级安全基准测试证明了真实环境的重要性，但其人工构建方式限制了规模。本博士研究提出一种自动化基准测试生成器，它能将真实漏洞注入现实世界的代码仓库，并合成可复现的漏洞验证（PoV）利用代码，从而为训练和评估仓库级漏洞检测智能体提供精确标注的数据集。我们进一步研究了漏洞注入与检测智能体之间的对抗性协同进化循环，以提升其在真实约束下的鲁棒性。

摘要 (Abstract)

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

关键词: software vulnerability detection, automated benchmark generator, repo-level datasets, proof-of-vulnerability exploits, adversarial co-evolution, learning-based detection, repository-level security, vulnerability injection

24. ❌ Specification-Aware Distribution Shaping for Robotics Foundation Models

作者: Sadık Bera Yüksel, Derya Aksaray 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于机器人基础模型（robotics foundation models），属于基础模型（Foundation Models）范畴，因此与第一个关键词高度相关（8分）。论文提到使用预训练模型（pretrained robotics foundation model），与预训练相关（5分）。其他关键词如MoE、SLMs、SFT、RLHF、RAG、推理加速、幻觉缓解等均未涉及，因此评分为0。论文虽涉及机器人领域，但未明确属于AI for Science中的生物信息学或化学信息学，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对机器人基础模型在执行自然语言指令时缺乏形式化安全保证的问题，提出了一种规范感知的动作分布优化框架，在不修改模型参数的情况下强制执行信号时序逻辑约束，并在仿真中验证了其有效性。

摘要翻译

机器人基础模型在执行多样化任务与跨环境自然语言指令方面已展现出强大能力。然而，这些模型在很大程度上仍依赖于数据驱动，在部署过程中缺乏对安全性及时间相关规约满足性的形式化保证。实践中，机器人常需遵循包含丰富时空要求的操作约束，例如限时目标访问、序列化任务目标以及持续性安全条件。本研究提出一种规约感知的动作分布优化框架，该框架能在不修改预训练机器人基础模型参数的前提下，在执行过程中强制实施广泛类型的信号时序逻辑（Signal Temporal Logic，STL）约束。在每一个决策步骤中，该方法通过基于前向动力学传播对剩余时间范围进行推演，计算出满足严格STL可行性约束的最小修改动作分布。我们在仿真环境中使用先进的机器人基础模型，通过多种环境与复杂规约验证了所提出框架的有效性。

摘要 (Abstract)

Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.

关键词: Robotics Foundation Models, Specification-Aware, Signal Temporal Logic, Action Distribution Optimization, Safety Guarantees, Pre-trained Models, Forward Dynamics Propagation, Simulation Validation

25. ❌ TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

作者: Pepe Alonso 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17973v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI编码代理（AI coding agents）的回归问题，开发了TDAD工具和基准方法，使用基于AST的代码测试图构建和加权影响分析来减少测试回归。核心相关关键词包括：LLM Agents（高度相关，论文核心研究AI代理）、Retrieval-Augmented Generation（相关，论文提到GraphRAG工作流）、Self-Correction（相关，论文涉及代理的自我改进循环）、Tool Use（相关，论文研究代理技能部署）。其他关键词如LLMs有一定关联（论文使用Qwen模型），但大部分关键词（如MoE、Scaling Laws、PEFT等）与论文内容无直接关系。

!!! tip deepseek-chat TL;DR

该论文针对AI编码代理在解决软件问题时经常引入回归（破坏先前通过的测试）的问题，提出了TDAD（Test-Driven Agentic Development）工具和方法，通过基于AST的代码测试图构建和加权影响分析，将测试级回归减少了70%，并将解决率从24%提高到32%。

摘要翻译

AI编码代理能够解决现实软件问题，但常引发回归缺陷，导致先前通过的测试失败。现有基准几乎只关注解决率，对回归行为的研究严重不足。本文提出TDAD（测试驱动智能体开发），这是一种开源工具与基准方法，它结合基于抽象语法树（AST）的代码-测试图构建与加权影响分析，以揭示最可能受代码变更影响的测试。通过在SWE-bench Verified数据集上使用两个本地模型（Qwen3-Coder 30B测试100个实例，Qwen3.5-35B-A3B测试25个实例）进行评估，TDAD的GraphRAG工作流将测试级回归降低了70%（从6.08%降至1.82%），当作为智能体技能部署时，解决率从24%提升至32%。一个意外发现是，仅采用测试驱动开发（TDD）提示反而增加了回归率（9.94%），这表明较小模型从上下文信息（应验证哪些测试）中获得的收益大于从流程指令（如何执行TDD）中获得的收益。在10个实例的子集上，自主自动改进循环将解决率从12%提升至60%，且回归率为0%。这些发现表明，在AI智能体工具设计中，呈现上下文信息优于规定流程工作流。所有代码、数据与日志已公开于https://github.com/pepealonso95/TDAD。

摘要 (Abstract)

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD’s GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

关键词: AI coding agents, test-driven development, regression reduction, graph-based impact analysis, agentic workflow, GraphRAG, autonomous improvement, code-test graph

26. ❌ VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

作者: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VideoAtlas主要研究视频理解中的长上下文挑战，提出了一种层次化网格表示方法VideoAtlas和Video-RLM架构。与关键词的相关性分析如下：1. 高度相关（10分）：‘Context Window Extension OR Long Context LLMs’，因为论文核心解决视频长上下文问题，并扩展RLMs到视觉领域。2. 较强相关（8分）：‘Large Language Models OR LLMs OR Foundation Models’，论文基于语言模型（RLMs）扩展；‘LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’，因为Video-RLM采用Master-Worker多智能体架构进行协调探索。3. 其他关键词（0分）：论文未涉及MoE、SLMs、训练技术、推理优化、科学AI应用等具体内容，主要聚焦视频表示和导航范式。

!!! tip deepseek-chat TL;DR

论文解决了视频理解中长上下文和表示保真度的挑战，提出了VideoAtlas层次化网格表示和Video-RLM多智能体架构，实现了对数级计算增长和从1小时到10小时视频的持续稳健性能。

摘要翻译

将语言模型扩展至视频领域面临两大挑战：一是表征挑战——现有方法依赖有损近似；二是长上下文挑战——基于字幕或智能体的处理流程将视频压缩为文本，导致视觉保真度丧失。为克服这些问题，我们提出 VideoAtlas，这是一种任务无关的环境，能够将视频表征为分层网格结构，同时具备无损、可导航、可扩展、无需字幕和预处理的特点。视频概览可一目了然，任何区域均可递归放大，且视频内容、中间分析过程与智能体记忆均采用统一的视觉表征，从而实现了端到端的无损文本转换。这种分层结构确保访问深度的增长仅与视频时长呈对数关系。针对长上下文问题，递归语言模型（Recursive Language Models, RLMs）近期为长文本处理提供了强大解决方案，但将其扩展至视觉领域需要一个可供递归操作的结构化环境，这正是 VideoAtlas 所提供的。将 VideoAtlas 构建为马尔可夫决策过程，催生了 Video-RLM：一种并行的主-从架构，其中主模块协调全局探索，而从模块并行深入指定区域以积累无损视觉证据。我们展示了三项关键发现：（1）计算量随视频时长呈对数增长，而网格结构复用带来的30-60%多模态缓存命中率进一步放大了这一优势；（2）环境预算机制——通过限制最大探索深度，提供了一个可调控计算精度权衡的原则性超参数；（3）涌现的自适应计算分配能力，可根据问题粒度动态调整。在从1小时到10小时基准测试的扩展中，Video-RLM 始终保持最佳的时长鲁棒性，精度下降最小，这证明结构化环境导航是视频理解领域可行且可扩展的研究范式。

摘要 (Abstract)

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent’s memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60% multimodal cache hit rate arising from the grid’s structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

关键词: Video Understanding, Long-context Video, Hierarchical Grid Representation, Recursive Language Models (RLMs), Multi-agent Architecture, Logarithmic Compute, Video-RLM, Structured Environment Navigation

27. ❌ IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

作者: Priyaranjan Pattnayak, Sanchari Chowdhuri 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在12种印度语言中的安全评估，直接涉及LLMs和安全对齐（Alignment）技术，因此这两个关键词得10分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、推理加速、量化压缩、AI for Science等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该论文首次系统评估了10个主流大语言模型在12种印度语言中的安全表现，发现存在显著的安全漂移现象，跨语言一致性仅12.8%，安全对齐在不同语言间转移不均，并发布了IndicSafe基准以支持文化感知的安全评估。

摘要翻译

随着大语言模型在多语言环境中部署，其在文化多元、资源匮乏语言中的安全性表现仍鲜为人知。本研究首次对12种印度语言的大语言模型安全性进行了系统评估，这些语言使用者超过12亿，但在大语言模型训练数据中代表性不足。我们使用包含种姓、宗教、性别、健康和政治等维度的6000条文化情境提示词数据集，通过提示词的翻译变体评估了10个主流大语言模型。

分析表明存在显著的安全性漂移现象：跨语言一致性仅为12.8%，且不同语言间的安全判定率差异超过17%。部分模型在资源匮乏的书写体系中过度拒绝良性提示，对政治敏感话题过度标记，而另一些模型则未能识别不安全生成内容。我们通过提示词熵值、类别偏差分数和多语言一致性指数对这些缺陷进行了量化。

研究结果揭示多语言大语言模型存在关键的安全性泛化缺陷，表明安全对齐策略在不同语言间未能均衡迁移。我们发布首个支持印度语言部署文化情境安全性评估的基准测试集\textsc{IndicSafe}，并倡导建立基于区域危害认知的语言感知对齐策略。

摘要 (Abstract)

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8%, and \texttt{SAFE} rate variance exceeds 17% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

关键词: Large Language Models, LLM Safety, Multilingual Evaluation, Safety Alignment, Indic Languages, Cultural Grounding, Benchmark, Safety Generalization

28. ❌ CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

作者: Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究将预训练注意力模块（如GQA）转换为多头部潜在注意力（MLA）以提升表达能力而不增加KV缓存成本，直接涉及LLM架构优化和高效推理。高度相关关键词：‘Large Language Models’（论文基于Qwen3、Llama-3.1等LLM）、‘KV Cache Compression’（核心目标是在固定KV宽度下减少缓存成本）、‘Speculative Decoding OR Inference Acceleration’（旨在提升推理效率）。中等相关：‘Post-training OR Supervised Fine-tuning OR SFT’（提及后训练微调以恢复精度）、‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（涉及参数高效调整）。‘Quantization OR Model Compression OR Low-bit Weights’有一定关联（涉及权重分解和压缩技术）。其他关键词如MoE、SLMs、RAG、CoT等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出CARE方法，通过协方差感知和秩增强分解将预训练注意力模块转换为多头部潜在注意力，在固定KV缓存预算下显著提升模型表达能力和推理效率，并在多个LLM上实现困惑度降低和准确率提升。

摘要翻译

将预训练注意力模块（如分组查询注意力GQA）转换为多头潜在注意力（MLA）可在不增加KV缓存成本的情况下提升表达能力，这对高效推理具有吸引力。然而，许多实用的转换基线方法依赖于仅权重的低秩近似（例如SVD式初始化）和均匀秩分配。这些方法侧重于最小化权重矩阵之间的差异，而非关注权重如何影响输入激活，忽略了激活的协方差结构，并强制跨层使用均匀秩，从而导致激活漂移和注意力保真度下降。为解决这些问题，我们提出了CARE——一种固定KV宽度下的协方差感知、秩增强MLA转换流程。CARE引入三个关键步骤：（i）激活保持因子分解，使近似对齐实际输入激活而非仅权重；（ii）调整秩分配，通过将更多容量分配给最需要的层，在固定KV预算下跨层分配资源；（iii）KV对等映射，将转换后的K和V重新参数化以适配MLA格式，同时保持KV缓存大小不变。在Qwen3-4B/30B-A3B-Instruct-2507和Llama-3.1-8B/70B-Instruct模型上的实验表明，在相同KV预算下，我们的方法优于均匀秩SVD基线，单次困惑度降低最高达215倍，平均准确率提升最高达1.70倍。经过简短的SVD后修复微调，我们能够完全恢复原始模型的准确率。

摘要 (Abstract)

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model’s accuracy.

关键词: multi-head latent attention, KV-cache optimization, efficient inference, attention module conversion, covariance-aware decomposition, rank allocation, activation-preserving factorization, model compression

29. ❌ Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

作者: Ya-Ting Yang, Quanyan Zhu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents在企业系统中的隐私保护问题，与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确研究LLM agents的隐私风险分析框架。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对企业系统中LLM agents访问内部数据库时可能泄露敏感信息的问题，提出了基于差分隐私的概率分析框架，并推导了隐私边界与生成参数的关系，最终建立了隐私-效用优化设计问题。

摘要翻译

大型语言模型（LLM）与人工智能体正日益融入企业系统，以访问内部数据库并生成情境感知的响应。尽管此类集成提升了生产力和决策支持能力，但模型输出可能无意中泄露敏感信息。尽管先前许多研究致力于保护用户提示的隐私，但从企业数据角度考量隐私风险的研究相对较少。为此，本文提出了一种基于差分隐私的概率框架，用于分析人工智能体中的隐私泄露问题。我们将响应生成建模为一个随机机制，该机制将提示和数据集映射到词元序列的概率分布上。在此框架内，我们引入了词元级和消息级差分隐私，并推导出将隐私泄露与温度参数、消息长度等生成参数相关联的隐私边界。进一步地，我们构建了一个隐私-效用设计问题，以刻画最优温度选择策略。

摘要 (Abstract)

Large language models (LLMs) and AI agents are increasingly integrated into enterprise systems to access internal databases and generate context-aware responses. While such integration improves productivity and decision support, the model outputs may inadvertently reveal sensitive information. Although many prior efforts focus on protecting the privacy of user prompts, relatively few studies consider privacy risks from the enterprise data perspective. Hence, this paper develops a probabilistic framework for analyzing privacy leakage in AI agents based on differential privacy. We model response generation as a stochastic mechanism that maps prompts and datasets to distributions over token sequences. Within this framework, we introduce token-level and message-level differential privacy and derive privacy bounds that relate privacy leakage to generation parameters such as temperature and message length. We further formulate a privacy-utility design problem that characterizes optimal temperature selection.

关键词: Large Language Models, AI Agents, Differential Privacy, Privacy Leakage, Enterprise Systems, Privacy-Utility Tradeoff, Token-level Privacy, Message-level Privacy

30. ❌ scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

作者: Sergey V. Samsonau 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文scicode-lint的核心是使用LLM生成模式来检测科学Python代码中的方法错误，属于大模型在科学领域的应用。与’Large Language Models’高度相关（10分），因为系统使用前沿模型生成检测模式；与’Small Language Models’相关（8分），因为运行时使用小型本地模型执行；与’AI for Science’高度相关（10分），因为直接应用于科学代码质量检测。其他关键词如MoE、Scaling Laws、训练方法、推理优化、代理系统等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出scicode-lint工具，使用LLM生成模式检测科学Python代码中的方法错误（如数据泄漏），在Kaggle笔记本上达到65%精度（100%召回率），在科学论文测试中精度为54-62%。

摘要翻译

科学Python代码中的方法学缺陷会产生看似合理但错误的结果，传统代码检查工具和静态分析工具无法检测此类问题。多个研究团队已开发出针对机器学习的专用检查器，证明检测具有可行性。然而这些工具存在共同的可持续性问题：依赖特定版本的pylint或Python、有限的封装性，且每个新模式的识别都需人工工程实现。随着AI生成代码增加科学软件的体量，对自动化方法学检查（如检测数据泄露、错误交叉验证及随机种子缺失）的需求日益增长。本文提出scicode-lint工具，其双层架构将模式设计（构建时使用前沿模型）与执行（运行时使用小型本地模型）相分离。模式通过生成而非手工编码实现；适配新库版本仅需消耗计算资源，而非工程时间。在包含人工标注真实值的Kaggle笔记本测试中，预处理泄露检测在100%召回率下达到65%精确度；在38篇应用AI/ML的已发表学术论文中，精确度为62%（经大语言模型评估），且不同模式类别间存在显著差异；在预留论文集中，精确度为54%。在受控测试中，scicode-lint对66种模式的综合检测准确率达到97.7%。

摘要 (Abstract)

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

关键词: scientific Python code, methodology bugs, LLM-generated patterns, data leakage detection, automated methodology checking, two-tier architecture, Kaggle notebooks, precision and recall

31. ❌ RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

作者: Arpit Singh Gautam, Saurabh Jha 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RAMP专注于大语言模型（LLMs）的量化压缩技术，以实现高效的设备端推理。核心贡献是提出一种基于强化学习的自适应混合精度量化方法，显著相关关键词包括：‘Large Language Models’（论文研究对象）、‘Small Language Models OR On-device AI’（目标应用场景）、‘Post-training’（量化属于后训练技术）、‘Quantization OR Model Compression’（核心方法）、‘Speculative Decoding OR Inference Acceleration’（间接相关，量化旨在加速推理）。其他关键词如MoE、Scaling Laws、Alignment等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出了一种基于强化学习的自适应混合精度量化方法RAMP，用于优化大语言模型的设备端部署，在保持模型性能的同时显著减少存储占用并实现跨模型泛化。

摘要翻译

后训练量化对于在资源受限硬件上部署大语言模型至关重要，然而现有先进方法强制所有层采用统一比特宽度，导致精度与效率的权衡未能达到最优。本文提出RAMP（强化自适应混合精度），一种基于离策略软演员-评论家框架的方法，通过学习逐层比特宽度分配策略，在全局比特预算约束下最小化困惑度。该策略以包含激活统计、权重特性及结构描述符的11维嵌入向量为条件，实现了跨模型家族与规模的零样本迁移能力。为实现稳定的4比特以下量化，我们引入尺度折叠技术——一种通过逐通道缩放和归一化层补偿将激活异常值迁移至权重的预处理方法。采用带非对称惩罚与预算断崖机制的质量优先奖励函数，驱动策略快速收敛。在Llama 2 7B模型上，RAMP以3.68GB存储（3.65有效比特）实现5.54困惑度，优于统一4比特AWQ方法（3.90GB存储下困惑度5.60），在模型体积上减少6%，质量上提升1%至3%。关键的是，仅基于Llama 2 7B训练的策略可零样本泛化至Llama 2 13B和Mistral 7B模型，其表现常超越针对特定目标的训练结果，这支持了量化敏感性主要取决于模型架构的假设。HALO流水线将分配方案导出为GGUF格式，可在CPU、GPU及边缘设备上实现免内核推理，保持FP16精度下常识推理性能的99.5%。

摘要 (Abstract)

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

关键词: LLM quantization, mixed precision, on-device inference, reinforcement learning, model compression, post-training quantization, edge AI, parameter efficiency

32. ❌ Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

作者: Podakanti Satyajith Chary, Nagarajan Ganapathy 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文基于BiomedCLIP（生物医学视觉-语言基础模型）进行修改，属于大模型在生物医学领域的应用，与’Large Language Models OR LLMs OR Foundation Models’（8分）和’AI for Science OR Bioinformatics OR Cheminformatics’（10分）高度相关。论文涉及模型修改和优化，与’Pre-training OR Continual Pre-training OR Domain Adaptation’（8分）和’Post-training OR Supervised Fine-tuning OR SFT’（8分）有一定关联。其他关键词如MoE、SLMs、RAG、RLHF等未在论文中提及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于改进BiomedCLIP模型的多标签分类框架，用于处理视频胶囊内窥镜数据中的极端类别不平衡问题，在RARE-VISION测试集上实现了0.2456的mAP@0.5和0.2353的mAP@0.95。

摘要翻译

本研究提出一种用于视频胶囊内镜（VCE）的多标签分类框架，通过结合架构层面与优化层面的策略，解决了Galar数据集中固有的极端类别不平衡问题。我们的方法对生物医学视觉-语言基础模型BiomedCLIP进行改进，将其标准多头自注意力机制替换为差分注意力机制——该机制通过计算两个softmax注意力图之间的差异来抑制注意力噪声。为应对标注帧中病理发现占比不足0.1%的倾斜标签分布，我们采用平方根频率加权采样器、非对称焦点损失、混合正则化以及逐类别阈值优化策略。在生成事件级JSON输出前，通过中值滤波平滑与间隙合并来增强时序连贯性。在包含三次NaviCam检查（共161,025帧）的预留RARE-VISION测试集上，该流程在单GPU上以约8.6分钟完成全部推理，整体时序mAP@0.5达到0.2456，mAP@0.95达到0.2353。

摘要 (Abstract)

This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.

关键词: BiomedCLIP, video capsule endoscopy, multi-label classification, class imbalance, differential attention, asymmetric focal loss, temporal coherence, biomedical vision-language model

33. ❌ Procedural Generation of Algorithm Discovery Tasks in Machine Learning

作者: Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O’Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个用于机器学习算法发现任务的程序化生成器DiscoGen和基准测试DiscoBench，主要关注自动化算法开发、任务生成和基准评估。虽然涉及机器学习算法发现和优化，但论文内容与所有评分关键词（均聚焦于大模型、深度学习技术原理、特定训练方法或AI科学应用）无直接关联，未提及大模型、深度学习、LLM、MoE、训练方法、推理技术、AI代理或科学AI应用等概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了DiscoGen，一个用于机器学习算法发现任务的程序化生成器，并创建了DiscoBench基准测试，旨在解决现有任务套件在评估算法发现系统时的局限性，并通过实验展示了其在算法发现代理提示优化中的应用。

摘要翻译

机器学习算法的自动化开发具有催生新突破的潜力。然而，目前我们改进和评估算法发现系统的能力受限于现有任务套件。这些套件存在诸多问题，例如：评估方法不完善；数据污染；以及包含已饱和或高度相似的问题。为此，我们提出DiscoGen——一个面向机器学习算法发现任务的程序化生成器，可用于生成如强化学习优化器或图像分类损失函数开发等任务。受程序化生成在强化学习领域成功的启发，DiscoGen能够从多个机器学习领域中生成数百万个难度与复杂度各异的任务。这些任务由少量配置参数定义，可用于优化算法发现智能体（Algorithm Discovery Agents, ADAs）。我们同时推出DiscoBench，这是一个由DiscoGen任务中固定、小型子集构成的基准测试集，用于对ADAs进行系统性评估。最后，除了展示其用于ADA提示优化的实验外，我们还提出了多个由DiscoGen实现的、具有广阔前景的研究方向。DiscoGen已在https://github.com/AlexGoldie/discogen开源发布。

摘要 (Abstract)

Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at https://github.com/AlexGoldie/discogen.

关键词: algorithm discovery, procedural generation, machine learning tasks, DiscoGen, DiscoBench, benchmark evaluation, reinforcement learning, image classification

34. ❌ How do LLMs Compute Verbal Confidence

作者: Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17839v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM如何内部计算口头置信度，属于LLM内部机制和可解释性研究。高度相关关键词：‘Large Language Models’（研究对象为Gemma 3和Qwen 2.5）、‘Mechanistic Interpretability’（使用激活导向、修补、噪声等实验方法探究内部表示）。中等相关：‘Self-Correction’（涉及LLM的自我评估能力，但论文重点在置信度生成而非修正）。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究揭示了LLM如何内部计算口头置信度，发现置信度表示在答案相邻位置提前生成并缓存，然后检索输出，表明这是一种自动的、复杂的自我评估而非事后重建。

摘要翻译

言语置信度——即提示大型语言模型以数值或类别形式陈述其置信水平——被广泛用于从黑盒模型中提取不确定性估计。然而，大型语言模型内部如何生成此类评分仍属未知。我们探讨了两个问题：第一，置信度是何时计算的——是在被请求时即时计算，还是在答案生成过程中自动计算并缓存以供后续检索；第二，言语置信度代表什么——是词元对数概率，还是对答案质量更丰富的评估？聚焦于Gemma 3 27B和Qwen 2.5 7B模型，我们为缓存检索机制提供了聚合证据。通过激活引导、补丁干预、噪声注入及交换实验，我们发现置信度表征首先出现在答案相邻位置，随后才呈现于言语化节点。注意力阻断实验精确定位了信息流向：置信度从答案词元处收集，缓存于首个答案后位置，随后被检索输出。关键的是，线性探测与方差分解显示，这些缓存表征解释了言语置信度中超出词元对数概率的显著方差，表明其代表了一种更丰富的答案质量评估，而非简单的流畅度读数。这些发现证明，言语置信度反映了自动且复杂的自我评估过程——而非事后重建——这对理解大型语言模型的元认知机制及改进校准方法具有重要意义。

摘要 (Abstract)

Verbal confidence – prompting LLMs to state their confidence as a number or category – is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation – not post-hoc reconstruction – with implications for understanding metacognition in LLMs and improving calibration.

关键词: Verbal Confidence, LLMs, Uncertainty Estimation, Activation Steering, Cached Retrieval, Self-evaluation, Mechanistic Interpretability, Calibration

35. ❌ Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

作者: Zunzhe Zhang, Runhan Huang, Yicheng Liu, Shaoting Zhu, Linzhan Mou, Hang Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人控制中的扩散模型和流匹配技术，提出了一种名为GeCO的时间无条件优化框架，用于自适应和鲁棒的机器人控制。虽然论文提到了与Vision-Language-Action (VLA)模型的集成，但核心内容并非大语言模型（LLMs）或深度学习技术原理的创新，而是机器人模仿学习中的具体算法改进。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文主要涉及机器人控制、扩散模型和优化方法，与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对机器人模仿学习中扩散模型和流匹配方法存在固定计算预算分配的问题，提出了一种时间无条件的生成控制优化框架（GeCO），将动作合成转化为迭代优化过程，实现了自适应计算分配和训练免费的安全检测，提高了控制成功率和效率。

摘要翻译

扩散模型与流匹配已成为机器人模仿学习的基石，但其存在结构性低效问题：推理过程通常受限于固定的积分调度机制，该机制对状态复杂度不敏感。这种范式迫使策略在简单动作与复杂任务上消耗相同的计算资源。我们提出生成式控制即优化（Generative Control as Optimization，GeCO），一种时间无关的框架，将动作合成从轨迹积分转化为迭代优化过程。GeCO在动作序列空间中学习一个平稳的速度场，其中专家行为形成稳定的吸引子。因此，测试时推理转变为自适应过程，能够根据收敛状态分配计算量——对简单状态提前终止优化，而对困难状态则进行更长时间的精细化调整。此外，这种平稳几何结构产生了一种无需训练的内在安全信号：优化后动作对应的速度场范数可作为鲁棒的分布外（Out-of-Distribution，OOD）检测器，在分布内状态下保持较低值，而对异常状态则显著升高。我们在标准仿真基准上验证了GeCO，并展示了其向pi0系列视觉-语言-动作（Vision-Language-Action，VLA）模型的无缝扩展能力。作为标准流匹配头的即插即用替代方案，GeCO通过原生优化机制提升了任务成功率和执行效率，为安全部署提供了新途径。视频与代码详见 https://hrh6666.github.io/GeCO/

摘要 (Abstract)

Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence–exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/

关键词: Generative Control, Flow Matching, Robotic Control, Diffusion Models, Optimization, Adaptive Inference, Out-of-Distribution Detection, Vision-Language-Action Models

36. ❌ RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

作者: Zhenhang Yuan, Shenghai Yuan, Lihua Xie 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文RPMS专注于解决LLM智能体在具身环境中的规划失败问题，核心是LLM智能体架构创新。高度相关（10分）的关键词是’Large Language Models’和’LLM Agents’，因为论文明确使用LLM（Llama 3.1 8B, Claude Sonnet 4.5, GPT-4）作为智能体基础，并改进其规划能力。中等相关（5分）的关键词包括：‘Chain of Thought’和’System 2 Thinking’（涉及多步推理和深度规划），‘Self-Correction’（通过规则和状态反馈纠正错误），‘Tool Use’（智能体执行动作可视为工具使用），‘AI for Science’（在ScienceWorld环境测试，属于科学AI应用）。其他关键词与论文的技术焦点（规则检索、记忆管理、冲突仲裁）无直接关联，得0分。

!!! tip deepseek-chat TL;DR

论文提出了RPMS架构，通过规则检索和记忆协同来解决LLM智能体在具身环境中因动作无效和状态漂移导致的规划失败问题，在ALFWorld和ScienceWorld任务中显著提升了成功率。

摘要翻译

大语言模型智能体在封闭世界具身环境中常因动作需满足严格前提条件（如位置、物品栏及容器状态）而失败，且失败反馈信息稀疏。我们识别出两种结构性耦合的失效模式：（P1）无效动作生成与（P2）状态漂移，二者在退化循环中相互放大。本文提出RPMS——一种冲突管理架构，其通过结构化规则检索确保动作可行性，借助轻量级信念状态控制记忆适用性，并通过规则优先仲裁机制解决两类信息源间的冲突。在ALFWorld（134项未见任务）测试中，RPMS使用Llama 3.1 8B模型实现59.7%的单次尝试成功率（较基线提升23.9个百分点），使用Claude Sonnet 4.5模型达98.5%（提升11.9个百分点）；在8B模型的增益中，仅规则检索单项即贡献14.9个百分点（统计显著），成为主导因素。关键发现表明：情景记忆具有条件效用——若未经状态锚定直接使用会损害特定任务类型的性能，但经过当前状态过滤并受显式动作规则约束后，则能转化为稳定的净增益。将RPMS适配至ScienceWorld环境并使用GPT-4测试，在所有消融条件下均获得稳定提升（平均得分54.0 vs. ReAct基线的44.9），这为核心机制在结构异构环境中具有可迁移性提供了证据。

摘要 (Abstract)

LLM agents often fail in closed-world embodied environments because actions must satisfy strict preconditions – such as location, inventory, and container states – and failure feedback is sparse. We identify two structurally coupled failure modes: (P1) invalid action generation and (P2) state drift, each amplifying the other in a degenerative cycle. We present RPMS, a conflict-managed architecture that enforces action feasibility via structured rule retrieval, gates memory applicability via a lightweight belief state, and resolves conflicts between the two sources via rules-first arbitration. On ALFWorld (134 unseen tasks), RPMS achieves 59.7% single-trial success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp); of the 8B gain, rule retrieval alone contributes +14.9 pp (statistically significant), making it the dominant factor. A key finding is that episodic memory is conditionally useful: it harms performance on some task types when used without grounding, but becomes a stable net positive once filtered by current state and constrained by explicit action rules. Adapting RPMS to ScienceWorld with GPT-4 yields consistent gains across all ablation conditions (avg. score 54.0 vs. 44.9 for the ReAct baseline), providing transfer evidence that the core mechanisms hold across structurally distinct environments.

关键词: LLM agents, embodied planning, rule retrieval, memory synergy, conflict management, action feasibility, state drift, ALFWorld

37. ❌ CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

作者: Lintang Sutawika, Aditya Bharat Soni, Bharath Sriraam R R, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Zhou, Yilin Zhang, Leander Melroy Maben, Graham Neubig 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CodeScout主要研究使用强化学习训练代码搜索代理，核心涉及LLM Agents（编码代理）和Tool Use（使用Unix终端作为工具）。论文明确提到使用LLMs作为基础模型，并专注于agentic code search和RL optimization。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training、SFT、RLHF等）、推理优化技术、多智能体系统、模型压缩、AI for Science等均未在标题或摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究如何通过有效的强化学习配方训练仅配备标准Unix终端的编码代理，使其在代码搜索任务中达到优于或媲美更大规模LLMs的性能。

摘要翻译

实现编码代理在大型代码库中执行任务的前提是代码定位——即识别需要操作的相关文件、类与函数。虽然此前已出现基于嵌入检索（如向量搜索）的仓库级代码定位方法，但近期研究重点转向开发能够独立执行或与实际工作交错进行代码定位的智能代理。现有的大多数代理化代码搜索方法为代理配备了复杂且专门化的工具，例如通过静态分析生成的仓库图。本文证明，通过有效的强化学习方法，仅配备标准Unix终端的编码代理经过训练后即可取得优异效果。我们在三个基准测试（SWE-Bench Verified、Pro和Lite）上的实验表明，我们的模型始终优于或媲美参数量大2-18倍的基础及后训练大语言模型，有时甚至接近Claude Sonnet等闭源模型在使用专用框架时的表现。本研究特别关注现有编码代理环境在代码搜索任务中的改造利用技术、奖励机制设计以及强化学习优化方法。我们向社区发布训练所得的模型家族CodeScout，并公开全部代码与数据以供后续研究。

摘要 (Abstract)

A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on. While repository-level code localization has been performed using embedding-based retrieval approaches such as vector search, recent work has focused on developing agents to localize relevant code either as a standalone precursor to or interleaved with performing actual work. Most prior methods on agentic code search equip the agent with complex, specialized tools, such as repository graphs derived from static analysis. In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results. Our experiments on three benchmarks (SWE-Bench Verified, Pro, and Lite) reveal that our models consistently achieve superior or competitive performance over 2-18x larger base and post-trained LLMs and sometimes approach performance provided by closed models like Claude Sonnet, even when using specialized scaffolds. Our work particularly focuses on techniques for re-purposing existing coding agent environments for code search, reward design, and RL optimization. We release the resulting model family, CodeScout, along with all our code and data for the community to build upon.

关键词: code search agents, reinforcement learning, LLM agents, tool use, Unix terminal, coding agents, repository-level code localization, RL optimization

38. ❌ FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

作者: Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng, Vincent Ng, Zhi Wang, Xiangyu Yue, Chuanyi Li, Lewei Lu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FailureMem提出了一种基于LLM的多模态自主软件修复框架，核心创新在于：1）使用LLM作为基础模型进行代码修复（高度相关）；2）采用混合工作流-智能体架构，实现自主代理工作流（高度相关）；3）通过工具使用（主动感知工具）增强视觉定位能力（高度相关）；4）建立失败记忆库实现自我改进（高度相关）；5）涉及多步推理和深度推理过程（中等相关）；6）利用失败案例进行上下文学习（中等相关）。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多模态自动程序修复中工作流僵化、视觉定位不精确和失败经验未重用的问题，提出了FailureMem框架，通过混合工作流-智能体架构、区域级视觉定位工具和失败记忆库，在SWE-bench Multimodal数据集上比GUIRepair提升了3.7%的修复率。

摘要翻译

多模态自动程序修复（Multimodal Automated Program Repair，MAPR）通过要求模型联合推理源代码、文本问题描述以及图形用户界面截图等视觉信息，扩展了传统的程序修复范畴。尽管近期基于大语言模型的修复系统已展现出良好前景，但现有方法仍面临若干局限：僵化的工作流管道限制了调试过程中的探索空间，视觉推理通常基于整页截图而缺乏局部定位，且失败的修复尝试很少被转化为可复用的知识。为应对这些挑战，我们提出了FailureMem，一种融合三项关键机制的多模态修复框架：一种平衡结构化定位与灵活推理的混合工作流-智能体架构，支持区域级视觉定位的主动感知工具，以及一个将过往修复尝试转化为可复用指导的失败记忆库。在SWE-bench Multimodal数据集上的实验表明，FailureMem将问题解决率较GUIRepair提升了3.7%。

摘要 (Abstract)

Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoning is often performed over full-page screenshots without localized grounding, and failed repair attempts are rarely transformed into reusable knowledge. To address these challenges, we propose FailureMem, a multimodal repair framework that integrates three key mechanisms: a hybrid workflow-agent architecture that balances structured localization with flexible reasoning, active perception tools that enable region-level visual grounding, and a Failure Memory Bank that converts past repair attempts into reusable guidance. Experiments on SWE-bench Multimodal demonstrate FailureMem improves the resolved rate over GUIRepair by 3.7%.

关键词: Multimodal Automated Program Repair, LLM-based repair systems, hybrid workflow-agent architecture, active perception tools, Failure Memory Bank, autonomous software repair, visual reasoning, self-improvement

39. ❌ ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

作者: Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的训练优化技术（ChopGrad），通过截断反向传播解决内存问题，属于计算机视觉和深度学习优化领域。所有评分关键词均与大语言模型（LLMs）、模型对齐、推理、代理、科学AI等主题相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出ChopGrad，一种用于视频扩散模型的截断反向传播方案，通过限制梯度计算到局部帧窗口来显著降低训练内存成本，同时保持全局一致性，从而实现了在长视频或高分辨率视频上使用逐像素损失进行高效微调。

摘要翻译

近期视频扩散模型通过循环帧处理实现高质量生成，其中每一帧的生成都依赖于先前帧。然而，这种循环机制意味着在像素域训练此类模型会产生极高的内存成本，因为激活值在整个视频序列中持续累积。这一根本性限制也使得基于逐像素损失对长视频或高分辨率视频进行模型微调在计算上难以实现。本文提出ChopGrad——一种用于视频解码的截断反向传播方案，该方案将梯度计算限制在局部帧窗口内，同时保持全局一致性。我们对此近似方法进行了理论分析，并证明其能够通过逐帧损失实现高效微调。ChopGrad将训练内存需求从随视频帧数线性增长（完整反向传播）降低至恒定内存消耗，并在多项基于逐像素损失的条件视频生成任务中——包括视频超分辨率、视频修复、神经渲染场景的视频增强以及可控驾驶视频生成——与现有先进视频扩散模型相比展现出优越性能。

摘要 (Abstract)

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

关键词: video diffusion models, truncated backpropagation, training memory reduction, pixel-wise losses, fine-tuning, frame-wise losses, video generation, computational efficiency

40. ❌ Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

作者: Antônio Junior Alves Caiado, Michael Hahsler 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Transformer模型在推理时使用MC Dropout（蒙特卡洛采样）的鲁棒性分析，与’Monte Carlo Tree Search OR MCTS AND LLM’高度相关（10分），因为MC Dropout是蒙特卡洛方法在深度学习中的具体应用。论文涉及模型推理和认知分析，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（8分）、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）相关，因为研究模型在推理任务中的表现分解。论文分析模型行为和可靠性，与’Mechanistic Interpretability OR Explainable AI’（8分）相关，属于可解释性研究。论文评估Transformer语言模型，与’Large Language Models OR LLMs OR Foundation Models’（8分）相关，但未深入LLM特定技术。其他关键词如MoE、SLMs、训练方法、对齐、RAG、压缩等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了19种Transformer模型在推理时使用MC Dropout的鲁棒性，发现模型性能对架构敏感且与规模无关，并提出了认知分解框架来指导不确定性感知应用中的模型选择。

摘要翻译

基于Transformer的语言模型已被广泛部署于推理任务，但其在推断时随机性下的行为仍未得到充分探究。尽管训练中常使用dropout技术，但其通过蒙特卡洛采样在推断时产生的影响缺乏跨架构的系统性评估，这限制了对模型在不确定性感知应用中可靠性的理解。

本研究采用MC Dropout方法，对19个Transformer模型在dropout引发的变异性进行分析，每个样本进行100次随机前向传播。研究将dropout鲁棒性定义为在随机推断下保持高准确率与稳定预测的能力，并通过单次运行准确率的标准差进行量化。通过认知分解框架，将模型表现解耦为记忆与推理两个组成部分。实验涵盖五种dropout配置，对1000个样本进行了95项独立评估。

结果显示模型存在显著的架构差异。较小模型展现出完美的预测稳定性，而中等规模模型则表现出明显的波动性。中等规模模型在整体性能上表现最优；更大规模模型在记忆任务中表现突出。关键发现是，53%的模型在基线MC Dropout设置下出现严重准确率下降，其中任务专用模型的准确率损失最高达24个百分点，表明这些架构不适合用于不确定性量化。研究还观察到非对称效应：高dropout率使记忆准确率下降27个百分点，而推理能力仅下降1个百分点，这表明记忆任务依赖于被dropout破坏的稳定表征。84%的模型表现出记忆偏向的性能特征。

本研究首次为Transformer模型建立了全面的MC Dropout基准测试，揭示出dropout鲁棒性具有架构依赖性且与模型规模无关。认知特征分析框架为不确定性感知应用中的模型选择提供了可操作的指导依据。

摘要 (Abstract)

Transformer-based language models are widely deployed for reasoning, yet their behavior under inference-time stochasticity remains underexplored. While dropout is common during training, its inference-time effects via Monte Carlo sampling lack systematic evaluation across architectures, limiting understanding of model reliability in uncertainty-aware applications. This work analyzes dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as maintaining high accuracy and stable predictions under stochastic inference, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 samples. Results reveal substantial architectural variation. Smaller models demonstrate perfect prediction stability while medium-sized models exhibit notable volatility. Mid-sized models achieve the best overall performance; larger models excel at memory tasks. Critically, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating unsuitability for uncertainty quantification in these architectures. Asymmetric effects emerge: high dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. 84% of models demonstrate memory-biased performance. This provides the first comprehensive MC Dropout benchmark for transformers, revealing dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications.

关键词: Transformer models, MC Dropout, stochastic inference, dropout robustness, cognitive profiling, uncertainty quantification, Monte Carlo sampling, architecture evaluation

41. ❌ Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

作者: Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, Yi Chen, Peipei Yang, Xu-Yao Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大视觉语言模型（LVLMs）的后训练量化方法，与’Post-training’高度相关（10分），是量化技术的具体应用。‘Quantization’是核心主题（15分），论文提出QIG方法进行细粒度量化。‘Large Language Models’相关（8分），因LVLMs是大模型的一种。‘Mechanistic Interpretability’相关（10分），方法受可解释性中的归因方法启发。‘Speculative Decoding’有一定关联（8分），因量化旨在加速推理。‘Small Language Models’和’PEFT’各5分，因量化有助于模型轻量化部署和参数高效调整。其他关键词如MoE、Scaling Laws、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大视觉语言模型（LVLMs）部署中的计算和内存开销问题，提出了一种基于量化感知集成梯度的细粒度后训练量化方法（QIG），通过将量化粒度从模态级提升到令牌级，显著提高了量化后模型的准确率，在多个模型和基准测试中有效缩小了与全精度模型的性能差距。

摘要翻译

大型视觉语言模型（LVLMs）在需要多模态交互的一系列下游任务中取得了显著成功，但其能力伴随着巨大的计算和内存开销，这阻碍了实际部署。在众多加速技术中，训练后量化是一种流行且有效的降低内存成本和加速推理的策略。然而，现有的LVLM量化方法通常在模态层面衡量令牌敏感性，这未能捕捉复杂的跨令牌交互，并且在定量衡量令牌层面的量化误差方面存在不足。随着令牌在模型内部交互，模态之间的区别逐渐减弱，这表明需要进行细粒度的校准。受机制可解释性中公理化归因的启发，我们引入了一种基于量化感知积分梯度（Quantization-aware Integrated Gradients, QIG）的细粒度量化策略，该策略利用积分梯度定量评估令牌敏感性，并将粒度从模态层面推进到令牌层面，从而同时反映模态间和模态内的动态。在W4A8和W3A16两种设置下对多个LVLM进行的广泛实验表明，我们的方法在可忽略的延迟开销下，提高了不同模型和基准测试的准确率。例如，在3比特仅权重量化下，我们的方法将LLaVA-onevision-7B的平均准确率提高了1.60%，将其与全精度对应模型的差距缩小至仅1.33%。代码可在 https://github.com/ucas-xiang/QIG 获取。

摘要 (Abstract)

Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.

关键词: Large Vision Language Models, Post-training Quantization, Quantization-aware Integrated Gradients, Token-level Sensitivity, Model Compression, Inference Acceleration, Fine-grained Calibration, Mechanistic Interpretability

42. ❌ RangeAD: Fast On-Model Anomaly Detection

作者: Luca Hinkamp, Simon Klüttermann, Emmanuel Müller 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《RangeAD: Fast On-Model Anomaly Detection》专注于机器学习中的异常检测（AD）问题，提出了一种名为On-Model AD的新设置和RangeAD算法，该算法利用主模型的神经元输出范围进行高效异常检测。论文的核心内容围绕异常检测、模型效率、推理成本优化等主题，但未涉及大语言模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术、AI科学应用等相关，而本文研究的是通用的机器学习异常检测方法，与这些关键词无直接关联。因此，所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为On-Model AD的异常检测新设置和RangeAD算法，通过利用主模型的神经元输出范围，在保持高性能的同时显著降低了推理成本，为高效异常检测提供了一个实用框架。

摘要翻译

在实践中，机器学习方法通常需要异常检测（AD）来筛选输入或检测分布偏移。当前典型做法是在主模型旁并行运行独立的异常检测模型。然而，这种分离方式忽略了主模型本身已编码大量目标分布信息的事实。本文提出“模型上异常检测”（On-Model AD）这一新范式，该框架明确利用与主机器学习模型的关联进行异常检测。在此框架下，我们提出RangeAD算法，该算法利用从主模型提取的神经元级输出范围进行检测。RangeAD即使在高维任务中也表现出优越性能，同时显著降低推理成本。我们的研究结果证明了模型上异常检测框架作为高效异常检测实践方案的潜力。

摘要 (Abstract)

In practice, machine learning methods commonly require anomaly detection (AD) to filter inputs or detect distributional shifts. Typically, this is implemented by running a separate AD model alongside the primary model. However, this separation ignores the fact that the primary model already encodes substantial information about the target distribution. In this paper, we introduce On-Model AD, a setting for anomaly detection that explicitly leverages access to a related machine learning model. Within this setting, we propose RangeAD, an algorithm that utilizes neuron-wise output ranges derived from the primary model. RangeAD achieves superior performance even on high-dimensional tasks while incurring substantially lower inference costs. Our results demonstrate the potential of the On-Model AD setting as a practical framework for efficient anomaly detection.

关键词: anomaly detection, on-model AD, RangeAD, neuron-wise output ranges, inference costs, machine learning, high-dimensional tasks, distributional shifts

43. ❌ Governed Memory: A Production Architecture for Multi-Agent Workflows

作者: Hamed Taheri 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体工作流中的共享内存和治理架构，与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。论文涉及记忆检索和反思机制，与’Retrieval-Augmented Generation/RAG/Retrieval-Generation’和’Self-Correction/Self-Improvement/Self-Reflection’有一定关联（5分）。论文提到企业AI部署自主智能体节点，隐含使用大模型技术，与’Large Language Models/LLMs/Foundation Models’有间接关联（5分）。其他关键词如MoE、量化、推理加速、对齐训练等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对企业多智能体工作流中缺乏共享内存和统一治理导致的五大结构性问题，提出了Governed Memory架构，通过双内存模型、分层治理路由等机制，在实验中实现了高事实召回率、零跨实体泄漏等结果，并在基准测试中达到74.8%的整体准确率。

摘要翻译

企业人工智能在工作流中部署了数十个自主智能体节点，这些节点作用于相同的实体，但缺乏共享内存与统一治理机制。我们识别出由这种内存治理缺失引发的五大结构性挑战：跨智能体工作流的内存孤岛；跨团队与工具的治理碎片化；非结构化内存无法被下游系统使用；自主多步执行中的冗余上下文传递；以及缺乏反馈循环的静默质量衰减。本文提出“受治理内存”——一种共享内存与治理层，通过四种机制解决上述问题：结合开放集原子事实与模式强制的类型化属性的双重内存模型；具备渐进式上下文传递的分层治理路由机制；实体范围隔离的反射边界检索；以及包含AI辅助编写与自动化逐属性优化的闭环模式生命周期。我们通过受控实验（N=250，五种内容类型）验证各机制：双重模态互补实现99.6%的事实召回率；治理路由精度达92%；渐进式传递减少50%的令牌消耗；500次对抗性查询中实现零跨实体泄漏；100%的对抗性治理合规性；每个实体约七个受治理内存时输出质量趋于饱和。在LoCoMo基准测试中，该架构整体准确率达到74.8%，证实治理与模式强制不会损害检索质量。该系统已在Personize.ai投入生产应用。

摘要 (Abstract)

Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.

关键词: Governed Memory, Multi-Agent Workflows, Shared Memory, Governance Layer, Autonomous Agents, Memory Silos, Entity-Scoped Isolation, LoCoMo Benchmark

44. ❌ A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks

作者: Leonardo Del Grande, Christoph Brune, Marcello Carioni 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无限宽度浅层ReLU神经网络的TV正则化训练，属于深度学习理论分析领域，与大多数关键词（特别是大模型相关技术）无直接关联。唯一相关的是"Mixture of Experts OR MoE OR Sparse Models”，因为论文重点分析解的稀疏性（sparsity），但并非MoE架构，故给5分（有一定关联）。其他关键词均不涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了无限宽度浅层ReLU神经网络在TV正则化训练下的稀疏性保证，证明了在低噪声和小正则化参数下，解具有有限支撑且收敛性良好。

摘要翻译

本文研究了无限宽度浅层ReLU神经网络的全变差（TV）正则化训练问题，该问题被表述为单位球面上测度的凸优化问题。我们的方法利用TV正则化优化问题的对偶理论，为训练问题解的稀疏性建立了严格的理论保证。分析进一步刻画了在低噪声区域和小正则化参数下，这种稀疏性如何以及何时得以保持。我们分析的关键动机在于：对于ReLU激活函数，其关联的对偶证书在权重空间中是分段线性的。这些线性区域——我们称之为对偶区域——由数据通过诱导的超平面划分所确定的激活模式决定。利用这一结构，我们证明了在每个对偶区域内，对偶证书至多存在一个极值点。因此，任何极小化子的支撑集都是有限的，其基数可由仅依赖于数据诱导超平面划分几何结构的常数上界所限制。随后，我们进一步研究了保证此类稀疏解唯一性的充分条件。最后，在对偶区域边界上满足适当的对偶证书非退化性条件下，我们证明了在低标签噪声和小正则化参数存在时，训练问题的解仍保持稀疏性，且Dirac测度的数量保持不变。此外，这些测度的位置和振幅会收敛，且当位置位于对偶区域内部时，其收敛速率与噪声和正则化参数呈线性依赖关系。

摘要 (Abstract)

In this paper, we study total variation (TV)-regularized training of infinite-width shallow ReLU neural networks, formulated as a convex optimization problem over measures on the unit sphere. Our approach leverages the duality theory of TV-regularized optimization problems to establish rigorous guarantees on the sparsity of the solutions to the training problem. Our analysis further characterizes how and when this sparsity persists in a low noise regime and for small regularization parameter. The key observation that motivates our analysis is that, for ReLU activations, the associated dual certificate is piecewise linear in the weight space. Its linearity regions, which we name dual regions, are determined by the activation patterns of the data via the induced hyperplane arrangement. Taking advantage of this structure, we prove that, on each dual region, the dual certificate admits at most one extreme value. As a consequence, the support of any minimizer is finite, and its cardinality can be bounded from above by a constant depending only on the geometry of the data-induced hyperplane arrangement. Then, we further investigate sufficient conditions ensuring uniqueness of such sparse solution. Finally, under a suitable non-degeneracy condition on the dual certificate along the boundaries of the dual regions, we prove that in the presence of low label noise and for small regularization parameter, solutions to the training problem remain sparse with the same number of Dirac deltas. Additionally, their location and the amplitudes converge, and, in case the locations lie in the interior of a dual region, the convergence happens with a rate that depends linearly on the noise and the regularization parameter.

关键词: infinite-width neural networks, total variation regularization, sparsity, convex optimization, dual certificate, ReLU activation, hyperplane arrangement, Dirac deltas

45. ❌ Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory

作者: Oliver Zahn, Simran Chana 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的持久记忆机制，直接涉及LLM作为知识工作者、上下文记忆、检索增强生成、长上下文处理、智能体工作流和上下文学习等关键技术。论文通过对比传统上下文记忆与知识对象方法，系统评估了容量限制、压缩损失和目标漂移等失败模式，并提出了密度自适应检索机制。与推理、事实性等关键词有一定关联，但与模型架构、训练方法、压缩加速、科学应用等其他关键词无关。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型作为持久知识工作者时，传统上下文记忆方法存在的容量限制、压缩损失和目标漂移三大失败模式，并提出知识对象方法在准确性、成本和多跳推理方面显著优于传统方法。

摘要翻译

大型语言模型日益成为持久性知识工作载体，其中上下文记忆——即存储在提示中的事实——是当前默认策略。本研究对上下文记忆与知识对象（Knowledge Objects，简称KOs）进行了系统性评估，后者采用离散化哈希寻址元组结构，具备O(1)检索复杂度。在上下文窗口内，Claude Sonnet 4.5模型在10至7,000条事实范围内实现了100%的精确匹配准确率（占其20万token窗口的97.5%）。然而实际生产部署揭示了三种失效模式：容量限制（8,000条事实导致提示溢出）、压缩损失（摘要化处理破坏60%的事实）以及目标漂移（级联压缩侵蚀54%的项目约束条件，而模型仍保持完全置信度）。相比之下，知识对象在所有测试条件下均保持100%准确率，且成本降低252倍。在多跳推理任务中，知识对象达到78.9%的准确率，而上下文记忆仅为31.6%。跨四个前沿模型的实验复现证实压缩损失是架构性缺陷而非模型特定问题。研究还发现嵌入检索在对抗性事实上表现失效（召回率为1时精确率仅20%），而神经记忆系统（Titans）虽能存储事实却无法实现按需检索。我们提出密度自适应检索机制作为切换策略，并开源完整的基准测试套件。

摘要 (Abstract)

Large language models increasingly serve as persistent knowledge workers, with in-context memory - facts stored in the prompt - as the default strategy. We benchmark in-context memory against Knowledge Objects (KOs), discrete hash-addressed tuples with O(1) retrieval. Within the context window, Claude Sonnet 4.5 achieves 100% exact-match accuracy from 10 to 7,000 facts (97.5% of its 200K window). However, production deployment reveals three failure modes: capacity limits (prompts overflow at 8,000 facts), compaction loss (summarization destroys 60% of facts), and goal drift (cascading compaction erodes 54% of project constraints while the model continues with full confidence). KOs achieve 100% accuracy across all conditions at 252x lower cost. On multi-hop reasoning, KOs reach 78.9% versus 31.6% for in-context. Cross-model replication across four frontier models confirms compaction loss is architectural, not model-specific. We additionally show that embedding retrieval fails on adversarial facts (20% precision at 1) and that neural memory (Titans) stores facts but fails to retrieve them on demand. We introduce density-adaptive retrieval as a switching mechanism and release the benchmark suite.

关键词: Large Language Models, Persistent Memory, Knowledge Objects, In-context Learning, Retrieval-Augmented Generation, Multi-hop Reasoning, Context Window, LLM Agents

46. ❌ CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

作者: Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17775v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLMs在无标签强化学习中的推理能力提升，核心涉及Chain of Thought/System 2 Thinking（数学推理）和Self-Correction（生成器-验证器协同进化），与LLMs高度相关。Hallucination Mitigation部分相关（解决系统性错误）。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文提出CoVerRL框架，通过生成器-验证器协同进化解决无标签强化学习中因追求一致性而导致的输出多样性崩溃问题（共识陷阱），在数学推理基准上比基线提升4.7-5.9%。

摘要翻译

无标注强化学习使大型语言模型能够在无需真实标注监督的情况下提升推理能力，其典型方法是将多数投票的答案视为伪标签。然而，我们发现了一个关键失效模式：随着训练过程最大化自我一致性，输出多样性会急剧下降，导致模型自信地强化那些难以被察觉的系统性错误。我们将此现象称为共识陷阱。为摆脱该陷阱，我们提出了CoVerRL框架，其中单一模型在生成器与验证器角色间交替运行，两种能力相互引导提升。多数投票为训练验证器提供了带有噪声但信息丰富的监督信号，而不断改进的验证器则逐步从伪标签中筛选出自我一致性的错误。这种协同进化形成了一个良性循环，使模型在整个训练过程中保持较高的奖励准确率。在Qwen和Llama系列模型上的实验表明，CoVerRL在数学推理基准测试中比无标注基线方法提升了4.7-5.9%。此外，自我验证准确率从约55%提升至85%以上，证实了两种能力确实实现了协同进化。

摘要 (Abstract)

Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.

关键词: label-free reinforcement learning, large language models, reasoning capabilities, consensus trap, generator-verifier co-evolution, mathematical reasoning, self-consistency, pseudo-labels

47. ❌ Attention Sinks Induce Gradient Sinks

作者: Yihong Chen, Quanming Yao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer模型中的注意力沉没现象及其与梯度沉没的关系，属于大模型技术原理的创新研究。与’Large Language Models’相关（8分），因为Transformer是LLM的核心架构；与’Pre-training’相关（8分），因为研究涉及预训练模型和训练机制；与’Mechanistic Interpretability’高度相关（10分），因为论文从反向传播角度解释注意力机制的内在机理，属于可解释AI范畴。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer模型中注意力沉没现象如何通过训练时的梯度沉没机制导致大规模激活，并通过V-scale修改验证了这一因果关系。

摘要翻译

注意力汇聚点与大规模激活是Transformer模型中反复出现且密切相关的现象。现有研究主要集中于前向传播过程，导致其关联性究竟是直接的还是由训练机制所中介尚不明确。本文从反向传播的视角探讨这一问题。通过实证与理论分析，我们证明在因果掩码条件下，注意力汇聚点会引发显著的梯度集中现象，我们将其称为梯度汇聚点。此外，在采用RMSNorm的预归一化架构中，大规模激活可被理解为训练过程中对这种局部梯度压力的自适应响应。为验证这一假设，我们提出了V-scale——一种调整值路径反向传播梯度的改进方法。在经V-scale调整的预训练模型中，注意力汇聚点得以保留，而大规模激活则受到抑制。这些结果支持了以下解释：梯度汇聚点是连接注意力汇聚点与大规模激活的关键训练时中介因素。

摘要 (Abstract)

Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a training-time mechanism. We study this question from the perspective of backpropagation. Empirically and theoretically, we show that under causal mask, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Furthermore, in pre-norm architectures with RMSNorm, massive activations can be understood as an adaptive response to this localized gradient pressure during training. To test this hypothesis, we introduce V-scale, a modification that adjusts value-path backpropagated gradients. In pretrained V-scale models, attention sinks are preserved whereas massive activations are suppressed. These results support the interpretation that gradient sink is a key training-time mediator linking attention sinks and massive activations.

关键词: Attention Sinks, Gradient Sinks, Transformer Models, Backpropagation, Massive Activations, Pre-norm Architectures, RMSNorm, V-scale

48. ❌ Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

作者: Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究有害幽默检测，涉及大模型在安全对齐和深度推理方面的应用。与’Large Language Models’相关（8分），因为论文评估了SOTA模型；与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为论文强调安全对齐；与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关（10分），因为论文关注需要深度推理的隐式有害幽默检测。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个多模态、多语言的有害幽默检测基准，发现闭源模型优于开源模型，并强调需要基于文化的深度推理安全对齐。

摘要翻译

黑色幽默通常依赖于微妙的文化意蕴和隐含线索，需要结合语境推理才能理解，这带来了当前静态基准测试未能捕捉的安全挑战。为此，我们引入了一个新颖的多模态、多语言基准，用于检测和理解有害及冒犯性幽默。我们人工标注的数据集包含英语和阿拉伯语的3000条文本与6000张图像，以及1200个涵盖英语、阿拉伯语和语言无关（通用）情境的视频。与标准毒性数据集不同，我们执行严格的标注准则：区分安全笑话与有害笑话，并将后者进一步分类为显性（直接）和隐性（隐蔽）两类，以探究深度推理能力。我们系统评估了所有模态下最先进的开源与闭源模型。研究结果显示，闭源模型显著优于开源模型，且两种语言中英语与阿拉伯语的表现均存在明显差距，这凸显了基于文化背景、具备推理意识的安全对齐机制的迫切需求。警告：本文包含可能具有冒犯性、有害性或偏见性的示例数据。

摘要 (Abstract)

Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}

关键词: harmful humor detection, multimodal benchmark, multilingual dataset, safety alignment, deep reasoning, cultural nuances, implicit cues, model evaluation

49. ❌ Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

作者: Iakovos-Christos Zarkadis, Christos Douligeris 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于网络入侵检测系统（NIDS）中的传统机器学习（ML）和对抗性学习，用于网络攻击分类和合成数据生成。研究涉及监督学习、对抗性学习、合成数据评估（如SDV框架、f-divergences、统计测试），但未提及任何大模型（LLMs）、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大模型技术、深度学习创新或科学AI应用相关，而本文研究的是传统ML在网络安全中的具体应用，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该研究开发了稳定的机器学习模型用于网络入侵检测，并利用对抗性学习生成高保真度和实用性的合成数据，通过统一的多模态NIDS数据集进行评估。

摘要翻译

网络攻击的监督检测始终是网络入侵检测系统（NIDS）的关键组成部分。当前正值人工智能（AI）发展的关键时期，随着利用生成式人工智能（GenAI）和强化学习等先进技术实施的攻击日益复杂，若我们希望保护散落在网络中的个人数据，这一技术已成为至关重要的环节。本文基于首个统一的多模态NIDS数据集，解决了两个任务。该数据集整合了流级数据、数据包载荷信息及时间上下文特征，源自经过统一特征空间重处理的CIC-IDS-2017、CIC-IoT-2023、UNSW-NB15和CIC-DDoS-2019数据集。在第一个任务中，我们采用机器学习（ML）算法并结合分层交叉验证，以稳定可靠的方式实现网络攻击防护。在第二个任务中，我们利用对抗学习算法生成合成数据，将其与真实数据进行比较，并借助SDV（Synthetic Data Vault）框架、f-散度、可区分性检验以及非参数统计检验来评估合成数据的保真度、实用性和隐私性。研究结果通过综合运用合成数据仓库框架、TRTS与TSTR测试、非参数统计检验及f-散度度量，为入侵检测提供了稳定的机器学习模型，并生成了具有高保真度和实用性的生成模型。

摘要 (Abstract)

Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.

关键词: network intrusion detection systems, machine learning, adversarial learning, synthetic data generation, NIDS dataset, statistical evaluation, f-divergence, SDV framework

50. ❌ SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

作者: Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SARE框架，利用大型视觉语言模型（LVLMs）进行免训练的细粒度视觉识别。核心创新在于：1）采用级联设计，结合快速候选检索和细粒度推理（与"Retrieval-Augmented Generation"相关）；2）推理过程中包含自我反思经验机制，利用过去失败经验提供可迁移的判别指导（与"Self-Correction OR Self-Improvement OR Self-Reflection"高度相关）；3）涉及多步推理过程（与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"高度相关）；4）需要深入推理解决视觉模糊问题（与"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"相关）。论文基于LVLMs，与"Large Language Models OR LLMs OR Foundation Models"相关，但主要关注视觉语言模型而非纯语言模型。其他关键词如MoE、量化、对齐等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型在细粒度视觉识别中存在的样本识别难度不均和错误经验无法复用问题，提出了SARE框架，通过级联检索-推理设计和自我反思经验机制，在14个数据集上实现了最先进的性能并显著降低了计算开销。

摘要翻译

大型视觉语言模型（LVLMs）的最新进展使得无需训练的细粒度视觉识别（FGVR）成为可能。然而，由于下属类别固有的视觉模糊性，有效利用LVLMs进行FGVR仍面临挑战。现有方法主要采用检索导向或推理导向的范式来应对这一挑战，但两者均受限于两个根本性缺陷：（1）它们对所有样本采用相同的推理流程，未考虑识别难度的不均衡性，导致准确率和效率均未达最优；（2）缺乏整合和复用错误特定经验的机制，导致在相似的困难案例上反复失败。为解决这些局限，我们提出SARE——一种面向免训练FGVR的样本自适应推理框架。具体而言，SARE采用级联设计，将快速候选检索与细粒度推理相结合，仅在必要时调用后者。在推理过程中，SARE引入自反思经验机制，该机制利用过往失败案例为推理过程提供可迁移的判别性指导，且无需任何参数更新。在14个数据集上的大量实验证实，SARE在显著降低计算开销的同时，实现了最先进的性能。

摘要 (Abstract)

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

关键词: Large Vision-Language Models, Fine-Grained Visual Recognition, Training-free, Adaptive Reasoning, Self-reflective Experience, Cascaded Design, Retrieval-Augmented, Computational Efficiency

51. ❌ Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

作者: Diederick C. Niehorster, Marcus Nyström 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的Segment Anything Model（SAM）在眼图像分割中的应用，属于视觉基础模型（vision foundation models）的评估研究。论文内容主要涉及视觉模型性能比较、提示工程（visual and concept prompts）和特定领域（眼图像）的应用评估。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联（5分），因为眼图像分割属于生物医学图像分析范畴，可视为AI在科学/生物信息学领域的应用。其他关键词均专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），与这篇计算机视觉论文的核心内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究评估了Segment Anything Model 3（SAM3）在眼图像分割任务中的性能，发现SAM3在大多数情况下并未超越SAM2，且速度更慢，因此SAM2仍然是眼图像分割的最佳选择。

摘要翻译

先前的研究表明，视觉基础模型在眼部图像分割任务中展现出良好的零样本性能。本文旨在检验最新版本的Segment Anything Model（SAM3）是否比SAM2提供更优的眼部图像分割性能，并探索其新增的概念（文本）提示模式的效果。我们使用多样化数据集评估了眼部图像分割性能，这些数据集既包括实验室环境下采集的高分辨率高质量视频，也包含在非受控环境中获取的具有挑战性的TEyeD眼部视频数据集。结果显示，在大多数情况下，无论是使用视觉提示还是概念提示，SAM3在实验室数据集和非受控数据集上的表现均未优于SAM2。由于SAM2不仅性能更佳，且处理速度更快，我们得出结论：对于眼部图像分割任务，SAM2目前仍是最佳选择。我们提供了经调整的SAM3代码库版本，该版本支持处理任意时长的视频。

摘要 (Abstract)

Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3’s codebase that allows processing videos of arbitrary duration.

关键词: Segment Anything Model, SAM3, eye image segmentation, visual prompts, concept prompts, TEyeD dataset, video processing, foundation models

52. ❌ From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving

作者: A. Humnabadkar, A. Sikdar, B. Cave, H. Zhang, N. Bessis, A. Behera 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注自动驾驶领域的仿真技术、合成数据和领域适应策略，属于计算机视觉和机器人学的应用研究。虽然提到了vision-language models，但这是作为增强场景理解的一个角色被提及，并非论文的核心技术焦点。论文内容与绝大多数关键词（特别是大模型技术原理相关的）没有直接关联。唯一有微弱关联的是’Domain Adaptation’（属于’Pre-training OR Continual Pre-training OR Domain Adaptation’的一部分），因为论文讨论了合成数据到真实数据的领域适应策略，因此给5分（有一定关联）。其他关键词均未在论文中涉及或仅作为背景提及，故评0分。

!!! tip deepseek-chat TL;DR

这篇综述论文系统回顾了自动驾驶中利用合成数据、数字孪生仿真和领域适应策略来克服真实世界数据稀缺、提升系统泛化能力和安全验证的研究进展、工具平台及未来挑战。

摘要翻译

近年来，自动驾驶技术取得了显著进展，但其实际部署仍受限于数据稀缺性、安全性要求以及跨多样化环境的泛化需求。为此，合成数据与虚拟环境已成为重要的赋能工具，为训练与评估提供了可扩展、可控且标注丰富的场景。本文综述了自动驾驶、仿真技术与合成数据交叉领域的最新进展。我们从三个核心维度梳理了该领域的研究格局：（i）合成数据在感知与规划中的应用，（ii）基于数字孪生（Digital Twin）的系统验证仿真，以及（iii）连接合成数据与真实数据的域适应策略。我们还强调了视觉-语言模型与仿真真实性在提升场景理解与泛化能力方面的作用。本文提供了对数据集、工具及仿真平台的详细分类，并分析了基准测试设计的趋势。最后，我们讨论了关键挑战与开放研究方向，包括仿真到真实（Sim2Real）迁移、可扩展的安全性验证、协同自主性以及仿真驱动的策略学习，这些问题的解决将加速推动安全、可泛化且全球可部署的自动驾驶系统的发展。

摘要 (Abstract)

Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.

关键词: autonomous driving, synthetic data, simulation, digital twin, domain adaptation, Sim2Real transfer, safety validation, vision-language models

53. ❌ MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

作者: Yusen Wu, Yiran Liu, Xiaotie Deng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）构建多智能体经济沙盒，涉及LLMs应用、后训练/监督微调（SFT）、偏好对齐（Alignment）、LLM智能体（LLM Agents）和多智能体系统（Multi-agent Systems）等关键技术，这些关键词高度相关（10分）。论文属于大模型在经济学领域的应用研究，与’AI for Science’有一定关联（5分）。其他关键词如MoE、量化、推理加速、可解释性等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大语言模型的多智能体经济沙盒（MALLES），通过后训练对齐消费者偏好并采用多智能体讨论框架，显著提升了产品选择和购买预测的准确性及模拟稳定性。

摘要翻译

在现实经济中，高维、多模态环境对现代决策构成了根本性挑战，而智能体异质性与组合数据稀疏性进一步加剧了问题的复杂性。本文提出了一种基于多智能体大语言模型的经济沙盒（Multi-Agent Large Language Model-based Economic Sandbox, MALLES），利用大模型的固有泛化能力构建一个适用于跨领域、跨品类场景的统一仿真框架。我们方法的核心是一种偏好学习范式：通过对多品类海量异构交易记录进行后训练，使大语言模型在经济维度上对齐。该范式使模型能够内化并迁移潜在的消费者偏好模式，从而缓解单一品类中普遍存在的数据稀疏问题。为提升仿真稳定性，我们引入了平均场机制以建模产品环境与顾客群体间的动态交互，有效稳定了高维决策空间内的采样过程。此外，我们提出了一种多智能体讨论框架，其中专业化智能体通过协作处理大量产品信息。该架构通过分布式认知负载缓解了单智能体的注意力瓶颈，并通过结构化对话捕捉关键决策因素。实验表明，相较于现有的经济与金融大语言模型仿真基线，我们的框架在产品选择准确性、购买数量预测及仿真稳定性方面均取得显著提升。研究结果证实了大语言模型作为现实经济中基于基础数据库的高保真、可扩展决策仿真与后续分析基础支柱的潜力。

摘要 (Abstract)

In the real economy, modern decision-making is fundamentally challenged by high-dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large-sacle models to establish a unified simulation framework applicable to cross-domain and cross-category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Furthermore, we propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single-agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.

关键词: Multi-agent LLMs, Economic Sandbox, Preference Alignment, Post-training, Mean-field Mechanism, Simulation Framework, Consumer Preference, Decision Simulation

54. ❌ Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

作者: Joohyoung Jeon, Hongchul Lee 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM交易代理在多代理系统中的可信度验证，通过匿名化方法解决记忆偏差和幸存者偏差问题。高度相关的关键词包括：LLMs（核心模型）、LLM Agents/Multi-agent Systems（研究框架）、Chain of Thought/System 2 Thinking（代理输出推理过程）。中等相关的关键词包括：Hallucination Mitigation（涉及信号真实性验证）、Explainable AI（通过推理嵌入图提供可解释性）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为BlindTrade的匿名化框架，通过盲化股票代码和公司名称来验证LLM交易代理是否真正理解市场动态而非依赖记忆关联，在2025年YTD测试中实现了1.40的夏普比率，并发现策略在波动市场中表现优异但在趋势性牛市中alpha降低。

摘要翻译

为使大语言模型交易代理真正值得信赖，其必须展现对市场动态的理解能力，而非仅依赖对股票代码关联的记忆。构建负责任的多智能体系统需要严格的信号验证：必须证明预测反映的是合理模式，而非预训练记忆。我们针对两类虚假性能来源进行研究：由特定股票代码预训练导致的记忆偏差，以及由有缺陷的回测方法产生的幸存者偏差。我们的解决方案是对智能体实施“蒙眼测试”——匿名化所有标识符——以验证有效信号是否依然存在。BlindTrade方法对股票代码与公司名称进行匿名处理，由四个大语言模型智能体输出评分及推理过程。我们从推理嵌入向量构建图神经网络（GNN）图谱，并采用近端策略优化-深度结构化规则（PPO-DSR）策略进行交易。在2025年年初至8月1日的测试中，我们在20个随机种子下实现了1.40 ± 0.22的夏普比率，并通过负对照实验验证了信号的有效性。为评估模型在单一样本外窗口之外的稳健性，我们进一步扩展测试周期（2024-2025），结果揭示了市场状态依赖性：该策略在波动市场中表现优异，但在趋势性牛市中的阿尔法收益有所降低。

摘要 (Abstract)

For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents–anonymizing all identifiers–and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024–2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

关键词: LLM trading agents, multi-agent systems, anonymization, portfolio optimization, reasoning embeddings, GNN graph, PPO-DSR policy, signal validation

55. ❌ Objective Mispricing Detection for Shortlisting Undervalued Football Players via Market Dynamics and News Signals

作者: Chinenye Omejieke, Shuyao Chen, Xia Cui 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究足球运动员市场价值评估，使用梯度提升回归和NLP特征（情感统计和语义嵌入），但未涉及大模型、深度学习技术原理或科学领域应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于市场动态和新闻信号的客观误定价检测框架，用于筛选被低估的足球运动员，发现市场动态是主要信号，而NLP特征提供次要增益以提高鲁棒性和可解释性。

摘要翻译

本文提出了一种基于客观错误定价、实用且可复现的足球运动员价值低估识别框架。该方法不依赖主观专家标签，而是通过结构化数据（历史市场动态、球员履历与合同特征、转会历史）估算预期市场价值，并将其与观测估值进行比较以定义错误定价。随后，我们评估了新闻衍生的自然语言处理（Natural Language Processing，NLP）特征（即足球新闻报道中的情感统计量与语义嵌入）是否能够补充市场信号，以筛选出价值被低估的球员。

通过时序性（考虑数据泄露）评估，梯度提升回归模型能够解释对数转换后市场价值的大部分方差。在价值低估球员筛选任务中，基于ROC-AUC的消融实验表明，市场动态是主要信号，而NLP特征则提供了持续、次要的增益，增强了模型的稳健性与可解释性。SHAP分析揭示了市场趋势和年龄因素的主导作用，而新闻衍生的波动性线索在高不确定性情境下会放大信号。本研究所设计的流程旨在为球探工作流程提供决策支持，强调排名/筛选机制而非硬性分类阈值，并附有简明的可复现性与伦理声明。

摘要 (Abstract)

We present a practical, reproducible framework for identifying undervalued football players grounded in objective mispricing. Instead of relying on subjective expert labels, we estimate an expected market value from structured data (historical market dynamics, biographical and contract features, transfer history) and compare it to the observed valuation to define mispricing. We then assess whether news-derived Natural Language Processing (NLP) features (i.e., sentiment statistics and semantic embeddings from football articles) complement market signals for shortlisting undervalued players. Using a chronological (leakage-aware) evaluation, gradient-boosted regression explains a large share of the variance in log-transformed market value. For undervaluation shortlisting, ROC-AUC-based ablations show that market dynamics are the primary signal, while NLP features provide consistent, secondary gains that improve robustness and interpretability. SHAP analyses suggest the dominance of market trends and age, with news-derived volatility cues amplifying signals in high-uncertainty regimes. The proposed pipeline is designed for decision support in scouting workflows, emphasizing ranking/shortlisting over hard classification thresholds, and includes a concise reproducibility and ethics statement.

关键词: mispricing detection, undervalued football players, market dynamics, NLP features, gradient-boosted regression, SHAP analysis, decision support, scouting workflows

56. ❌ WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

作者: Wanjun Du, Zifeng Yuan, Tingting Chen, Fucai Ke, Beibei Lin, Shunli Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17680v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）在恶劣天气条件下的推理分割能力，与大多数关键词无关。仅与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’相关，因为论文评估VLMs在五个推理维度（功能、场景、结构、交互、需求匹配）上的表现，属于推理能力研究，但非核心创新点。与’Large Language Models/LLMs/Foundation Models’有弱关联，因为使用了LLM生成查询，但论文焦点是VLMs而非LLMs技术本身。其他关键词涉及具体技术（如MoE、量化、对齐等）或领域（如生物信息学），论文未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了WeatherReasonSeg基准，用于评估视觉语言模型在恶劣天气条件下的推理分割能力，发现模型性能随天气严重程度单调下降且不同天气类型导致不同的脆弱性模式。

摘要翻译

现有视觉语言模型（VLMs）在基于推理的分割任务中已展现出卓越性能。然而，当前基准数据集主要构建于理想条件下拍摄的高质量图像。这引发了一个关键问题：当雨、雪、雾等恶劣天气条件严重破坏视觉线索时，VLMs能否保持可靠的推理分割能力？为应对这一挑战，我们提出了WeatherReasonSeg基准，旨在评估VLMs在恶劣天气条件下的基于推理的分割性能。该基准包含两个互补组成部分：首先，我们通过对现有分割数据集施加不同严重程度的合成天气效果，构建了一个可控推理数据集，以实现细粒度的鲁棒性分析；其次，为捕捉真实世界的复杂性，我们通过掩码引导的大语言模型（LLM）提示生成语义一致的查询，构建了真实世界恶劣天气推理分割数据集。我们进一步将评估范围拓展至五个推理维度，包括功能、应用场景、结构属性、交互和需求匹配。通过对多种VLMs的大量实验，我们得出两个关键发现：（1）VLM性能随天气严重程度增加呈单调下降趋势；（2）不同天气类型会引发差异化的脆弱性模式。我们希望WeatherReasonSeg能为推进鲁棒且具备天气感知能力的推理研究奠定基础。

摘要 (Abstract)

Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.

关键词: WeatherReasonSeg, vision-language models, reasoning-based segmentation, adverse weather conditions, benchmark, robustness analysis, mask-guided LLM prompting, vulnerability patterns

57. ❌ Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

作者: Jaemin Kim, Jong Chul Ye 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究检索增强生成（RAG）在扩散模型中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文涉及大模型（Masked Diffusion Models）在知识密集型任务中的应用，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。论文旨在解决检索上下文噪声导致的生成质量问题，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（8分）。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对检索增强生成中检索上下文噪声导致生成质量下降的问题，提出了一种用于掩码扩散模型的自适应引导框架ARAM，通过动态校准引导尺度来改善知识密集型问答任务的性能。

摘要翻译

检索增强生成通过将外部知识融入语言模型生成过程，提升了事实依据性。然而，当检索到的上下文存在噪声、不可靠或与模型参数知识不一致时，会引发检索先验冲突，从而降低生成质量。尽管该问题已在自回归语言模型中得到研究，但在基于扩散的语言模型中仍鲜有探索——其迭代去噪过程为整合检索上下文带来了独特挑战。本研究提出自适应检索增强掩码扩散，这是一种面向掩码扩散模型在检索增强生成场景中的免训练自适应引导框架。该框架根据检索上下文引起的分布偏移的信噪比，在去噪过程中动态校准引导强度。直观而言，当检索上下文提供可靠修正证据时，模型增强引导；当上下文信号存在噪声或缺乏支持性时，则抑制引导。在多个知识密集型问答基准上的大量实验表明，相较于主流检索增强生成基线方法，该框架显著提升了整体问答性能。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model’s parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.

关键词: Retrieval-Augmented Generation, Masked Diffusion Models, Adaptive Guidance, Retrieval-Prior Conflicts, Knowledge-Intensive QA, Signal-to-Noise Ratio, Training-Free Framework

58. ❌ Inhibitory normalization of error signals improves learning in neural circuits

作者: Roy Henha Eyono, Daniel Levenstein, Arna Ghosh, Jonathan Cornford, Blake Richards 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究生物神经回路中的抑制性归一化机制及其对学习的影响，使用人工神经网络进行模拟验证，属于基础神经科学和计算神经科学领域。所有评分关键词均聚焦于大语言模型、深度学习技术原理及其应用，而本文完全不涉及这些主题：未提及任何语言模型、预训练/微调技术、推理优化、对齐方法、代理系统、模型压缩等。论文的核心是生物启发的神经网络学习机制，而非大模型技术或其在科学领域的应用创新。

!!! tip deepseek-chat TL;DR

该研究探讨了生物神经回路中抑制性归一化是否改善学习，发现仅在前向传播中应用归一化无效，但将归一化扩展到反向传播误差信号时能显著提升图像识别任务性能。

摘要翻译

归一化是神经回路中的关键运算。大脑中有证据表明，归一化通过抑制性中间神经元实现，并使得神经群体能够适应其输入分布的变化。在人工神经网络（ANNs）中，归一化被用于改善涉及复杂输入分布任务的学习效果。然而，目前尚不清楚生物神经回路中由抑制介导的归一化是否也能促进学习。本文通过使用具有独立兴奋性和抑制性群体的人工神经网络，在可变光照条件下的图像识别任务中进行训练，以探索这种可能性。我们发现，如果仅在推理阶段应用归一化，抑制介导的归一化并不会改善学习效果。然而，当这种归一化扩展到包含反向传播误差时，性能显著提升。这些结果表明，如果抑制介导的归一化能够促进大脑的学习过程，它还需要对学习信号进行归一化处理。

摘要 (Abstract)

Normalization is a critical operation in neural circuits. In the brain, there is evidence that normalization is implemented via inhibitory interneurons and allows neural populations to adjust to changes in the distribution of their inputs. In artificial neural networks (ANNs), normalization is used to improve learning in tasks that involve complex input distributions. However, it is unclear whether inhibition-mediated normalization in biological neural circuits also improves learning. Here, we explore this possibility using ANNs with separate excitatory and inhibitory populations trained on an image recognition task with variable luminosity. We find that inhibition-mediated normalization does not improve learning if normalization is applied only during inference. However, when this normalization is extended to include back-propagated errors, performance improves significantly. These results suggest that if inhibition-mediated normalization improves learning in the brain, it additionally requires the normalization of learning signals.

关键词: inhibitory normalization, neural circuits, artificial neural networks, learning improvement, error signal normalization, image recognition, back-propagated errors, biological learning mechanisms

59. ❌ Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

作者: Philipp Normann, Andreas Happe, Jürgen Cito, Daniel Arp 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents在安全领域的应用，特别是针对Linux权限提升任务。高度相关的关键词包括：LLMs（使用4B模型）、SLMs（强调本地小模型）、Post-training/SFT（两阶段后训练流程的核心部分）、CoT Reasoning（任务需要多步交互推理）、LLM Agents（研究主题）。RLHF相关得5分，因为论文使用了强化学习（但未明确是RLHF/RLAIF/DPO）。其他关键词如MoE、Scaling Laws、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段后训练方法，开发了一个本地小模型（PrivEsc-LLM）用于Linux权限提升任务，通过监督微调和强化学习，在12个测试场景中达到95.8%的成功率，接近顶级闭源模型性能，同时将每次成功攻击的推理成本降低了100倍以上。

摘要翻译

大语言模型智能体在漏洞发现等研究领域的重要性日益凸显。然而，最强大的系统仍处于封闭状态且仅限云端使用，这导致其资源消耗大、难以复现，且不适用于涉及专有代码或敏感数据的工作。因此，迫切需要能够在严格资源预算下执行安全任务的小型本地模型，但开发此类模型的方法仍未得到充分探索。本文通过提出一个两阶段的后训练流程来填补这一空白。我们聚焦于Linux权限提升问题，该任务的成功可自动验证，且需要多步骤的交互式推理。我们采用防止数据泄露的实验设置，对一个40亿参数模型进行两阶段后训练：首先在程序化生成的权限提升环境轨迹上进行监督微调，随后使用可验证奖励进行强化学习。在一个包含12个Linux权限提升场景的保留基准测试中，仅监督微调即可在20轮内将基线成功率提升一倍以上，而强化学习进一步将我们最终得到的模型——PrivEsc-LLM——的成功率提升至95.8%，几乎与Claude Opus 4.6的97.5%持平。与此同时，每次成功权限提升的预期推理成本降低了超过100倍。

摘要 (Abstract)

LLM agents are increasingly relevant to research domains such as vulnerability discovery. Yet, the strongest systems remain closed and cloud-only, making them resource-intensive, difficult to reproduce, and unsuitable for work involving proprietary code or sensitive data. Consequently, there is an urgent need for small, local models that can perform security tasks under strict resource budgets, but methods for developing them remain underexplored. In this paper, we address this gap by proposing a two-stage post-training pipeline. We focus on the problem of Linux privilege escalation, where success is automatically verifiable and the task requires multi-step interactive reasoning. Using an experimental setup that prevents data leakage, we post-train a 4B model in two stages: supervised fine-tuning on traces from procedurally generated privilege-escalation environments, followed by reinforcement learning with verifiable rewards. On a held-out benchmark of 12 Linux privilege-escalation scenarios, supervised fine-tuning alone more than doubles the baseline success rate at 20 rounds, and reinforcement learning further lifts our resulting model, PrivEsc-LLM, to 95.8%, nearly matching Claude Opus 4.6 at 97.5%. At the same time, the expected inference cost per successful escalation is reduced by over 100x.

关键词: LLM agents, Linux privilege escalation, post-training, supervised fine-tuning, reinforcement learning, small local models, verifiable rewards, multi-step reasoning

60. ❌ FINER: MLLMs Hallucinate under Fine-grained Negative Queries

作者: Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的幻觉问题，与’Large Language Models’高度相关（10分）。提出的FINER-Tuning方法使用Direct Preference Optimization（DPO），与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分）。研究直接针对幻觉缓解，与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分）。论文涉及基准测试和分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。FINER-Tuning属于微调方法，与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型在细粒度负查询下的幻觉问题，提出了FINER基准和FINER-Tuning方法（基于DPO），显著减少了幻觉并提升了多模态能力。

摘要翻译

多模态大语言模型（MLLMs）普遍存在幻觉问题，尤其在处理细粒度查询时更为突出。现有基准测试主要关注与图像相关的粗粒度问题，未能充分体现这一挑战。为此，我们引入了细粒度负向查询（FINER），并构建了两个基准测试集：FINER-CompreCap 和 FINER-DOCCI。利用 FINER，我们从多对象、多属性、多关系以及“是什么”问题四种设置出发，对幻觉现象进行了分析。我们的基准测试表明，当细粒度不匹配与图像中真实存在的元素同时出现时，MLLMs 容易产生幻觉。为解决这一问题，我们提出了 FINER-Tuning 方法，该方法基于受 FINER 启发的数据，利用直接偏好优化（Direct Preference Optimization, DPO）进行训练。对四个前沿 MLLMs 进行 FINER-Tuning 微调后，在我们的基准测试上，幻觉问题减少了最高达 24.2%（InternVL3.5-14B 模型），同时，在八个现有幻觉测试集上的性能也得到了提升，并在六个通用多模态基准测试中增强了整体能力。代码、基准测试集及模型已公开于 \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}。

摘要 (Abstract)

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what’’ questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

关键词: Multimodal Large Language Models, Hallucinations, Fine-grained Queries, Direct Preference Optimization, Benchmark, FINER-Tuning, MLLMs

61. ❌ Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

作者: Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究生成式中间帧插值（Generative Inbetweening），专注于计算机视觉和视频生成领域，通过Keyframe-anchored Attention Bias和Rescaled Temporal RoPE等技术解决帧一致性和语义对齐问题。虽然论文涉及注意力机制和生成模型，但所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文核心是视频帧生成，未涉及LLM、MoE、Scaling Laws、对齐、推理、代理、量化等关键词描述的技术。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对生成式中间帧插值任务中帧不一致和语义错位的问题，提出了Keyframe-anchored Attention Bias和Rescaled Temporal RoPE方法，在无需额外训练的情况下实现了最先进的帧一致性、语义保真度和节奏稳定性。

摘要翻译

生成式中间帧生成旨在合成首尾关键帧之间逼真的过渡帧，其目标远超简单的插值。随着序列间隔变宽与运动幅度增大，现有生成式中间帧模型常产生帧间不一致、节奏不稳定及语义错位的问题。由于该任务需基于固定端点生成多种合理运动路径，必须依赖来自关键帧和文本的额外引导以明确预期路径。为此，我们通过关键帧锚定注意力偏置机制，将关键帧与文本的语义及时序引导注入每一中间帧。同时，我们采用重缩放时序旋转位置编码来增强帧间一致性，使自注意力机制能更准确地关联关键帧信息。本文还提出了首个专为文本条件化中间帧生成评估设计的基准测试集TGI-Bench，支持针对不同挑战场景的模型性能分析。无需额外训练，我们的方法在多样化挑战任务中，无论是短序列还是长序列，均实现了当前最优的帧一致性、语义保真度与节奏稳定性。

摘要 (Abstract)

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

关键词: Generative Inbetweening, Keyframe-anchored Attention Bias, Rescaled Temporal RoPE, Frame Consistency, Semantic Fidelity, Pace Stability, TGI-Bench, Intermediate Frame Synthesis

62. ❌ Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

作者: Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究跨域少样本学习（CDFSL）中基于CLIP模型的局部视觉-语言对齐问题，提出CC-CDFSL方法。核心是视觉-语言模型（CLIP）的微调对齐，而非大语言模型（LLM）技术。相关关键词：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：涉及域适应和预训练模型（CLIP）的跨域应用。2. “Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文改进CLIP的微调方法。3. “Mechanistic Interpretability OR Explainable AI”（8分）：论文增强模型可解释性，可视化学习模式。4. “AI for Science OR Bioinformatics OR Cheminformatics”（8分）：应用于医学诊断等科学领域，符合AI for Science。其他关键词主要针对LLM、推理、对齐、优化等，与论文的视觉-语言模型和少样本学习焦点无关，得0分。

!!! tip deepseek-chat TL;DR

论文针对跨域少样本学习中CLIP模型的局部视觉-语言对齐问题，提出基于循环一致性的CC-CDFSL方法，有效提升对齐效果、增强可解释性并实现最先进性能。

摘要翻译

跨域少样本学习（Cross-Domain Few-Shot Learning, CDFSL）旨在将利用大规模通用数据（源域）训练的模型适配到仅有稀缺训练数据的下游目标域，而基于视觉-语言模型（如CLIP）的相关研究仍处于早期阶段。典型的下游领域（如医学诊断）需要细粒度的视觉线索以实现可解释的识别，但我们发现，尽管当前微调后的CLIP模型能够大致关注源域中的重要区域，却难以聚焦于这些关键线索。尽管现有研究已揭示了CLIP在捕捉局部细微模式上的不足，本文进一步发现：域间差异与稀缺训练数据会显著加剧这一缺陷，其影响远大于对整体模式的影响，我们将此称为基于CLIP的CDFSL中的局部对齐失准问题。为解决该问题，鉴于缺乏对齐局部视觉特征与文本语义的监督信号，我们转而利用自监督信息。受翻译任务的启发，我们提出了具有循环一致性的CC-CDFSL方法，该方法将局部视觉特征翻译为文本特征，再将其翻译回视觉特征（反之亦然），并通过约束原始特征与翻译回的特征相近来实现对齐。为减少视觉模态中丰富信息引入的噪声，我们进一步提出语义锚定机制：首先通过增强视觉特征为文本到图像的映射提供更丰富的语料库，随后压缩图像特征以滤除无关的图像到文本映射。在不同基准数据集、骨干网络及微调方法上的大量实验表明，我们的方法能够：（1）有效提升局部视觉-语言对齐效果；（2）通过可视化图像块增强所学模式及模型决策的可解释性；（3）取得最先进的性能表现。

摘要 (Abstract)

Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP’s shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

关键词: Cross-Domain Few-Shot Learning, Vision-Language Models, CLIP, Local Alignment, Interpretability, Cycle Consistency, Medical Diagnosis, Self-supervision

63. ❌ Automated Grammar-based Algebraic Multigrid Design With Evolutionary Algorithms

作者: Dinesh Parthasarathy, Wayne Mitchell, Arjun Gambhir, Harald Köstler, Ulrich Rüde 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17641v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究代数多重网格方法的自动化设计，使用进化算法和遗传编程优化网格循环模式，属于计算数学和科学计算领域。所有关键词均与大模型、深度学习、语言模型、对齐、推理、代理等技术无关，因此除’AI for Science’外均得0分。‘AI for Science’得5分，因为论文使用AI技术（进化算法）解决科学计算问题，但并非核心的大模型或深度学习应用，只是广义的AI在科学领域的应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用进化算法和遗传编程自动设计高效代数多重网格方法的新方法，通过优化非标准循环模式显著提升了多重网格作为求解器和预条件器的性能。

摘要翻译

尽管多重网格法在求解许多重要偏微分方程时具有渐进最优性，但其效率在很大程度上依赖于对各算法组件的精心选择。与近期利用深度学习技术优化特定多重网格组件的方法不同，我们采用了一种互补策略，即运用进化算法从经过验证的算法构建模块中构造高效的多重网格循环。本文将展示该方法在生成高效代数多重网格法中的应用，这些方法采用了所谓的灵活循环——即具有层级特定的平滑序列和非递归的循环模式。此类非标准循环的搜索空间过于庞大，难以手动遍历，我们通过上下文无关文法引导的遗传编程来生成该空间。利用线性代数库 hypre 进行的数值实验表明，这些非标准的遗传编程循环在作为求解器和预条件子时，均具有提升多重网格性能的潜力。

摘要 (Abstract)

Although multigrid is asymptotically optimal for solving many important partial differential equations, its efficiency relies heavily on the careful selection of the individual algorithmic components. In contrast to recent approaches that can optimize certain multigrid components using deep learning techniques, we adopt a complementary strategy, employing evolutionary algorithms to construct efficient multigrid cycles from proven algorithmic building blocks. Here, we will present its application to generate efficient algebraic multigrid methods with so-called \emph{flexible cycling}, that is, level-specific smoothing sequences and non-recursive cycling patterns. The search space with such non-standard cycles is intractable to navigate manually, and is generated using genetic programming (GP) guided by context-free grammars. Numerical experiments with the linear algebra library, \emph{hypre}, demonstrate the potential of these non-standard GP cycles to improve multigrid performance both as a solver and a preconditioner.

关键词: algebraic multigrid, evolutionary algorithms, genetic programming, flexible cycling, hypre library, solver optimization, preconditioner design, context-free grammars

64. ❌ Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies

作者: Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum, Henni Ouerdane, Stefan Streif, Pavel Osinenko 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）的基准测试方法学，提出了一种通过随机逆最优性生成已知最优策略系统的框架。论文内容完全围绕强化学习的理论、算法评估和系统生成，不涉及任何大语言模型（LLM）、深度学习技术原理、模型训练方法（如预训练、微调、对齐）、推理优化、智能体架构或科学AI应用等关键词领域。所有关键词均与大模型或深度学习技术直接相关，而本文是纯粹的强化学习理论方法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对强化学习算法评估复杂性问题，提出了一种基于随机逆最优性的基准测试框架，能够生成具有已知最优策略的系统，为RL算法提供可控、可复现的精确评估基础。

摘要翻译

强化学习（RL）算法的客观比较极为复杂，因为不同强化学习方法的结果与性能基准测试对环境设计、奖励结构以及算法学习和环境动态中固有的随机性高度敏感。为应对这一复杂性，我们通过将逆最优性扩展至具有噪声的离散时间、控制仿射非线性系统，引入了一个严格的基准测试框架。该框架提供了充分必要条件，使得预设的价值函数与策略在构造的系统中达到最优，从而能够通过同伦变换和随机化参数系统性地生成基准测试族。我们通过自动构建多样化环境对其进行了验证，展示了该框架在算法间进行受控且全面评估的能力。通过将标准方法与真实最优解进行对比评估，本研究为精确且严谨的强化学习基准测试提供了可复现的基础。

摘要 (Abstract)

The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework’s capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.

关键词: Reinforcement Learning, Benchmarking, Converse Optimality, Stochastic Systems, Optimal Policies, Control-Affine Systems, Nonlinear Systems, Algorithm Evaluation

65. ❌ VeriGrey: Greybox Agent Validation

作者: Yuntong Zhang, Sungmin Kang, Ruijie Meng, Marcel Böhme, Abhik Roychoudhury 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VeriGrey专注于LLM代理的安全性测试，核心研究LLM代理（Agentic AI）如何通过工具调用与环境交互，并开发灰盒测试方法发现安全漏洞。因此，与’Large Language Models’、‘LLM Agents’、‘Tool Use’高度相关（10分），因为这些是论文的核心研究对象。其他关键词如MoE、量化、推理加速、科学AI等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种灰盒测试方法VeriGrey，用于发现LLM代理在工具调用过程中的安全漏洞，相比黑盒方法能更有效地识别间接提示注入等攻击场景。

摘要翻译

近年来，智能体人工智能（Agentic AI）已成为备受关注的研究主题。大型语言模型（LLM）智能体在其后端包含一个或多个大型语言模型，在前端则通过将LLM输出与调用若干外部工具获得的结果相结合，进行自主决策。这种与外部环境的自主交互引入了关键的安全风险。

本文提出一种灰盒方法，用于探索LLM智能体的多样化行为并揭示其安全风险。我们的方法VeriGrey以被调用工具序列作为反馈函数来驱动测试过程，这有助于发现那些引发智能体异常行为、虽不频繁但危险的工具调用。作为测试过程中的变异算子，我们通过变异提示（prompt）来设计有害的注入提示。这一过程通过将智能体的任务与注入任务相连接而精心实现，使得注入任务成为完成智能体功能的必要步骤。在著名的AgentDojo基准测试中，将我们的方法与黑盒基线进行比较，当后端使用GPT-4.1时，VeriGrey在发现间接提示注入（indirect prompt injection）漏洞方面实现了33%的额外效能。

我们还对广泛使用的编程智能体Gemini CLI和知名的个人助手OpenClaw进行了真实案例研究。VeriGrey发现了能诱发多种攻击场景的提示，这些场景无法通过黑盒方法识别。在OpenClaw中，通过构建一个在需要时采用变异模糊测试的对话智能体，VeriGrey能够从10个恶意技能中识别出恶意技能变体（在Kimi-K2.5 LLM后端上成功率为10/10=100%，在Opus 4.6 LLM后端上成功率为9/10=90%）。这证明了像VeriGrey这样的动态方法在测试智能体方面的价值，并最终有望导向一个智能体保障框架。

摘要 (Abstract)

Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

关键词: LLM agents, agentic AI, tool invocation, security risks, grey-box testing, prompt injection, AgentDojo, VeriGrey

66. ❌ rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks

作者: Suryasis Jana, Abhik Ghosh 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17628v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是通用神经网络分类器的鲁棒训练方法（rSDNet），针对标签噪声和对抗攻击，属于传统深度学习/机器学习中的鲁棒性研究。所有关键词均与大模型（LLM）技术、大模型应用、AI for Science等特定领域无关。论文未涉及任何大模型相关技术（如预训练、微调、对齐、推理优化、智能体等）或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于S-散度的统一鲁棒神经网络学习框架rSDNet，用于同时抵御标签噪声和对抗攻击，并在图像分类基准上验证了其有效性。

摘要翻译

神经网络是现代人工智能的核心，然而其训练过程对数据污染仍高度敏感。标准的神经分类器通过最小化分类交叉熵损失进行训练，这对应于多项模型下的最大似然估计。尽管在理想条件下具有统计效率，但该方法极易受到污染观测值的影响，包括破坏输出空间监督的标签噪声，以及在输入空间中引发最坏情况偏差的对抗性扰动。本文提出一个统一且基于统计原理的鲁棒神经分类框架，通过单一学习目标同时处理这两种污染形式。我们将神经网络训练表述为最小散度估计问题，并引入rSDNet——一种基于广义$S$-散度类的鲁棒学习算法。由此得到的训练目标继承了经典统计估计的鲁棒特性，能够通过模型概率自动降低异常观测值的权重。我们建立了rSDNet的关键总体层面性质，包括费舍尔一致性、蕴含贝叶斯最优性的分类校准性，以及在均匀标签噪声和无限小特征污染下的鲁棒性保证。在三个基准图像分类数据集上的实验表明，rSDNet在保持干净数据上竞争力的准确率的同时，提升了对标签污染和对抗攻击的鲁棒性。我们的研究结果凸显了最小散度学习作为一种原理性框架，在异构数据污染下实现鲁棒神经分类的有效性。

摘要 (Abstract)

Neural networks are central to modern artificial intelligence, yet their training remains highly sensitive to data contamination. Standard neural classifiers are trained by minimizing the categorical cross-entropy loss, corresponding to maximum likelihood estimation under a multinomial model. While statistically efficient under ideal conditions, this approach is highly vulnerable to contaminated observations including label noises corrupting supervision in the output space, and adversarial perturbations inducing worst-case deviations in the input space. In this paper, we propose a unified and statistically grounded framework for robust neural classification that addresses both forms of contamination within a single learning objective. We formulate neural network training as a minimum-divergence estimation problem and introduce rSDNet, a robust learning algorithm based on the general class of $S$-divergences. The resulting training objective inherits robustness properties from classical statistical estimation, automatically down-weighting aberrant observations through model probabilities. We establish essential population-level properties of rSDNet, including Fisher consistency, classification calibration implying Bayes optimality, and robustness guarantees under uniform label noise and infinitesimal feature contamination. Experiments on three benchmark image classification datasets show that rSDNet improves robustness to label corruption and adversarial attacks while maintaining competitive accuracy on clean data, Our results highlight minimum-divergence learning as a principled and effective framework for robust neural classification under heterogeneous data contamination.

关键词: robust neural classification, label noise, adversarial attacks, S-divergence, minimum-divergence estimation, data contamination, neural network training, image classification

67. ❌ Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

作者: Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究3D室内场景编辑，使用自然语言指令和符号规划（PDDL风格），但未涉及大模型、深度学习技术原理或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、代理系统或科学AI相关，而本文专注于计算机视觉/机器人领域的特定任务，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Edit-As-Act框架，通过目标回归规划和符号动作语言解决自然语言驱动的3D室内场景编辑问题，在保持物理一致性和语义一致性的同时显著优于现有方法。

摘要翻译

通过自然语言编辑三维室内场景在概念上直观但技术上具有挑战性。现有的开放词汇系统通常需要重新生成场景的大部分内容，或依赖破坏空间结构的图像空间编辑，从而导致非预期的全局变化或物理上不一致的布局。这些局限源于将编辑主要视为生成任务。我们采取不同视角：用户指令定义了期望的世界状态，而编辑应是一系列最小化的动作序列，在保持其他所有内容不变的前提下实现该状态。这一视角催生了“编辑即行动”（Edit-As-Act）框架，该框架将开放词汇场景编辑作为三维空间中的目标回归规划来执行。给定源场景和自由形式指令，Edit-As-Act 会预测符号化的目标谓词，并在 EditLang 中进行规划——这是一种受 PDDL 启发的动作语言，我们为其设计了明确编码支撑、接触、碰撞等几何关系的先决条件与效果。语言驱动的规划器提出动作，验证器则强制执行目标导向性、单调性与物理可行性，从而产生可解释且物理连贯的变换。通过将推理与底层生成分离，Edit-As-Act 实现了指令忠实度、语义一致性与物理合理性——这是现有范式无法同时满足的三个标准。在 E2A-Bench（我们构建的涵盖 9 种室内环境、共 63 项编辑任务的基准测试）上，Edit-As-Act 在所有编辑类型和场景类别中均显著优于先前方法。

摘要 (Abstract)

Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.

关键词: 3D indoor scene editing, natural language instruction, goal-regressive planning, symbolic goal predicates, EditLang action language, physical feasibility, instruction fidelity, E2A-Bench benchmark

68. ❌ Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

作者: Felix Schur 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17577v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是从无动作标签的离线轨迹数据中恢复潜在动作和环境动态的理论问题，属于强化学习中的表示学习/可辨识性理论范畴。所有评分关键词均聚焦于大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而该论文完全不涉及任何语言模型、深度学习模型或AI for Science的具体应用。论文的核心是数学证明（非负矩阵分解、可辨识性条件），与评分关键词列表中的任何技术主题均无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了当轨迹数据中没有动作标签但包含演示者身份信息时，能否从离线数据中恢复潜在动作和环境动态的问题，并证明了在演示者策略多样性和秩条件下，潜在转移和策略是可辨识的（至多存在标签排列歧义）。

摘要翻译

当行动从未被观测到时，能否从离线轨迹中恢复潜在行动与环境动态？我们在轨迹无行动标注但带有演示者身份标签的场景下研究该问题。我们假设每位演示者遵循不同的策略，而环境动态在演示者间共享，且身份仅通过所选行动影响下一时刻的观测。在这些假设下，条件性下一观测分布 $p(o_{t+1}\mid o_t,e)$ 是潜在行动条件转移核的混合，其混合权重随演示者而异。我们证明，这为每个状态诱导了可观测条件分布的列随机非负矩阵分解。利用充分分散的策略多样性与秩条件，我们证明潜在转移与演示者策略在潜在行动标签的置换意义下是可识别的。我们通过格拉姆行列式最小体积准则将该结果扩展至连续观测空间，并证明连通状态空间上转移映射的连续性可将局部置换歧义提升为单一的全局置换。少量带标签的行动数据足以消除这最后的歧义。这些结果表明，演示者多样性为从离线强化学习数据中学习潜在行动与动态提供了可识别性的理论依据。

摘要 (Abstract)

Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

关键词: latent actions, offline data, demonstrator diversity, identifiability, nonnegative matrix factorization, offline reinforcement learning, policy recovery, dynamics recovery

69. ❌ Unsupervised Symbolic Anomaly Detection

作者: Md Maruf Hossain, Tim Katzke, Simon Klüttermann, Emmanuel Müller 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SYRAN专注于基于符号回归的无监督异常检测方法，其核心是学习可解释的符号方程来检测异常。该方法与绝大多数大模型/深度学习技术关键词（如LLM、MoE、训练方法、推理优化、智能体等）完全无关。仅与两个关键词有微弱关联：1）‘Mechanistic Interpretability OR Explainable AI’（5分）：论文强调其方法通过人类可读的方程实现可解释性，属于可解释AI范畴，但并非针对大模型的机制可解释性；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：论文提到其方程可对应已知的科学或医学关系，暗示了在科学领域的潜在应用，但并非论文的核心技术焦点。其他关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SYRAN的无监督异常检测方法，它通过符号回归学习可解释的方程来描述正常数据的符号不变量，从而实现高可解释性的异常检测，并在性能上达到了与最先进方法相当的水平。

摘要翻译

我们提出SYRAN，一种基于符号回归的无监督异常检测方法。与将正常模式编码于不透明的高维模型不同，我们的方法学习一组可人为解读的方程，这些方程描述了符号不变量：即在正常数据上近似保持恒定的函数。对这些不变量的偏离将产生异常分数，因此检测逻辑本身具有可解释性，而无需依赖事后解释。实验结果表明，SYRAN具有高度可解释性，其生成的方程与已知的科学或医学关系相符，同时在异常检测性能上保持与最先进方法相当的强大表现。

摘要 (Abstract)

We propose SYRAN, an unsupervised anomaly detection method based on symbolic regression. Instead of encoding normal patterns in an opaque, high-dimensional model, our method learns an ensemble of human-readable equations that describe symbolic invariants: functions that are approximately constant on normal data. Deviations from these invariants yield anomaly scores, so that the detection logic is interpretable by construction, rather than via post-hoc explanation. Experimental results demonstrate that SYRAN is highly interpretable, providing equations that correspond to known scientific or medical relationships, and maintains strong anomaly detection performance comparable to that of state-of-the-art methods.

关键词: unsupervised anomaly detection, symbolic regression, interpretable models, symbolic invariants, human-readable equations, anomaly scores, scientific applications, medical relationships

70. ❌ FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models

作者: Simon Klüttermann, Tim Katzke, Phuong Huong Nguyen, Emmanuel Müller 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于表格数据基础模型（PFNs）在异常检测中的应用，核心贡献是提出FoMo-X框架以增强模型的可解释性。因此，与’Foundation Models’高度相关（10分），因为PFNs是基础模型的一种；与’Explainable AI’高度相关（10分），因为研究核心是解决模型黑箱问题、提供诊断信号。与’Pre-training’有一定关联（5分），因为模型使用预训练主干。其他关键词（如LLMs、MoE、RLHF等）主要针对语言模型或特定技术，与本文的表格数据异常检测研究无关，故得0分。

!!! tip deepseek-chat TL;DR

本文针对异常检测基础模型缺乏可解释性的问题，提出了FoMo-X框架，通过附加诊断头来提供风险分级和不确定性度量，实现了高性能与可解释性的结合。

摘要翻译

表格数据基础模型，特别是先验数据拟合网络（Prior-Data Fitted Networks, PFNs），通过实现无需训练即可无监督零样本适应新数据集的能力，彻底改变了离群值检测（Outlier Detection, OD）领域。然而，尽管这些模型具有强大的预测能力，它们通常作为不透明的黑箱运行，仅输出缺乏安全关键决策所需操作背景的标量离群值分数。现有的事后解释方法往往计算成本过高，难以实时部署，或未能捕捉零样本推理中固有的认知不确定性。在本研究中，我们提出了FoMo-X，一个模块化框架，为OD基础模型赋予了内在的、轻量级的诊断能力。我们利用了一个关键发现：预训练PFN骨干网络的冻结嵌入已经编码了丰富的、上下文条件化的关系信息。FoMo-X将辅助诊断头附加到这些嵌入上，这些诊断头使用与骨干网络相同的生成模拟器先验进行离线训练。这使得我们能够将计算成本高昂的属性（例如基于蒙特卡洛丢弃的认知不确定性）提炼为确定性的、单次前向传播的推理过程。我们通过两个新颖的诊断头实例化FoMo-X：一个将偏差离散化为可解释风险等级的“严重性头”，以及一个提供校准置信度度量的“不确定性头”。在合成和真实世界基准（ADBench）上的广泛评估表明，FoMo-X能够以高保真度和可忽略的推理开销恢复真实的诊断信号。通过弥合基础模型性能与操作可解释性之间的差距，FoMo-X为可信赖的零样本离群值检测提供了一条可扩展的路径。

摘要 (Abstract)

Tabular foundation models, specifically Prior-Data Fitted Networks (PFNs), have revolutionized outlier detection (OD) by enabling unsupervised zero-shot adaptation to new datasets without training. However, despite their predictive power, these models typically function as opaque black boxes, outputting scalar outlier scores that lack the operational context required for safety-critical decision-making. Existing post-hoc explanation methods are often computationally prohibitive for real-time deployment or fail to capture the epistemic uncertainty inherent in zero-shot inference. In this work, we introduce FoMo-X, a modular framework that equips OD foundation models with intrinsic, lightweight diagnostic capabilities. We leverage the insight that the frozen embeddings of a pretrained PFN backbone already encode rich, context-conditioned relational information. FoMo-X attaches auxiliary diagnostic heads to these embeddings, trained offline using the same generative simulator prior as the backbone. This allows us to distill computationally expensive properties, such as Monte Carlo dropout based epistemic uncertainty, into a deterministic, single-pass inference. We instantiate FoMo-X with two novel heads: a Severity Head that discretizes deviations into interpretable risk tiers, and an Uncertainty Head that provides calibrated confidence measures. Extensive evaluation on synthetic and real-world benchmarks (ADBench) demonstrates that FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead. By bridging the gap between foundation model performance and operational explainability, FoMo-X offers a scalable path toward trustworthy, zero-shot outlier detection.

关键词: foundation models, outlier detection, explainability, tabular data, PFNs, zero-shot inference, epistemic uncertainty, diagnostic heads

71. ❌ FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

作者: Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型在图像到视频生成中的技术问题，特别是针对4K高分辨率输入的平铺去噪方法。所有评分关键词均与大语言模型（LLMs）、模型训练/对齐技术、推理优化、代理系统、科学AI应用等相关，而本文研究的是计算机视觉领域的扩散模型视频生成，未涉及任何大语言模型技术或相关概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了扩散模型在生成4K超高分辨率图像到视频时，平铺去噪方法导致的全局布局不一致问题，提出了一种基于预计算潜在先验的训练无关方法FrescoDiffusion，通过融合低分辨率视频的全局参考来增强时空一致性，同时保留局部细节。

摘要翻译

基于扩散的图像到视频（I2V）模型正变得越来越有效，但它们难以扩展到超高分辨率输入（例如4K）。在模型原生分辨率下生成视频通常会丢失细粒度结构，而采用高分辨率分块去噪虽能保留局部细节，却会破坏全局布局一致性。这种失效模式在壁画动画场景中尤为严重：大型艺术作品包含众多不同的角色、物体以及语义各异的子场景，这些元素必须在时间维度上保持空间连贯性。我们提出了FrescoDiffusion，一种无需训练的、从单张复杂图像生成连贯大画幅I2V的方法。其核心思想是通过预计算的潜在先验来增强分块去噪过程：我们首先在基础模型分辨率下生成一个低分辨率视频，并对其潜在轨迹进行上采样，以获得一个捕捉长程时空结构的全局参考。对于4K生成，我们在每个扩散时间步计算每个分块的噪声预测，并通过在模型输出空间中最小化一个加权最小二乘目标，将其与该参考融合。该目标结合了标准的分块合并准则与我们的正则化项，产生一个闭式融合更新，在保留精细细节的同时增强了全局连贯性。我们还引入了一个空间正则化变量，使得能够对允许运动的区域进行区域级控制。在VBench-I2V数据集和我们提出的壁画I2V数据集上的实验表明，相较于分块基线方法，我们的方法在全局一致性和保真度方面均有提升，同时计算高效。我们的正则化方法使得在创造性与一致性之间的权衡具备了明确的可控性。

摘要 (Abstract)

Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.

关键词: Diffusion models, Image-to-video generation, 4K resolution, Tiled denoising, Global coherence, Latent prior, Fresco animation, Training-free method

72. ❌ CLeAN: Continual Learning Adaptive Normalization in Dynamic Environments

作者: Isabella Marasco, Davide Evangelista, Elena Loli Piccolomini, Michele Colajanni 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于持续学习中的自适应归一化技术（CLeAN），应用于表格数据，解决动态环境中的数据分布变化问题。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是传统机器学习中的归一化方法改进，未涉及大模型、深度学习或特定科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种用于表格数据持续学习的自适应归一化方法CLeAN，通过可学习参数和指数移动平均更新来适应动态数据分布，实验表明该方法能提升模型性能并缓解灾难性遗忘。

摘要翻译

人工智能系统主要依赖静态数据分布，因此在数据频繁变化的动态现实环境（如网络安全、自动驾驶或金融领域）中往往表现不佳。持续学习通过使模型能够从序列数据中学习并保留先验知识，提供了潜在的解决方案。然而，该领域一个关键且尚未充分探讨的问题是数据归一化。传统归一化方法（如最小-最大缩放）预设可以访问整个数据集，这与持续学习的序列性质不相容。本文提出持续学习自适应归一化（Continual Learning Adaptive Normalization, CLeAN），这是一种专为表格数据持续学习设计的新型自适应归一化技术。CLeAN通过使用可学习参数来估计全局特征尺度，这些参数通过指数移动平均（Exponential Moving Average, EMA）模块进行更新，使模型能够适应不断变化的数据分布。通过对两个数据集及多种持续学习策略（包括Resevoir Experience Replay、A-GEM和EwC）的综合评估，我们证明CLeAN不仅能提升模型在新数据上的性能，还能缓解灾难性遗忘。这些发现强调了自适应归一化在增强表格数据学习稳定性和有效性方面的重要性，为利用归一化在动态学习环境中保存知识提供了新的视角。

摘要 (Abstract)

Artificial intelligence systems predominantly rely on static data distributions, making them ineffective in dynamic real-world environments, such as cybersecurity, autonomous transportation, or finance, where data shifts frequently. Continual learning offers a potential solution by enabling models to learn from sequential data while retaining prior knowledge. However, a critical and underexplored issue in this domain is data normalization. Conventional normalization methods, such as min-max scaling, presuppose access to the entire dataset, which is incongruent with the sequential nature of continual learning. In this paper we introduce Continual Learning Adaptive Normalization (CLeAN), a novel adaptive normalization technique designed for continual learning in tabular data. CLeAN involves the estimation of global feature scales using learnable parameters that are updated via an Exponential Moving Average (EMA) module, enabling the model to adapt to evolving data distributions. Through comprehensive evaluations on two datasets and various continual learning strategies, including Resevoir Experience Replay, A-GEM, and EwC we demonstrate that CLeAN not only improves model performance on new data but also mitigates catastrophic forgetting. The findings underscore the importance of adaptive normalization in enhancing the stability and effectiveness of tabular data, offering a novel perspective on the use of normalization to preserve knowledge in dynamic learning environments.

关键词: Continual Learning, Adaptive Normalization, Tabular Data, Dynamic Environments, Exponential Moving Average, Catastrophic Forgetting, Data Normalization, Sequential Data

73. ❌ Per-Domain Generalizing Policies: On Learning Efficient and Robust Q-Value Functions (Extended Version with Technical Appendix)

作者: Nicola J. Müller, Moritz Oster, Isabel Valera, Jörg Hoffmann, Timo P. Gros 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是强化学习/规划领域中的Q值函数学习问题，使用图神经网络和监督学习技术，旨在提高策略的效率和鲁棒性。论文内容完全聚焦于经典规划算法、Q-learning和监督学习，未涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是传统强化学习/规划方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在规划学习中通过正则化监督学习来训练高效鲁棒的Q值函数，以替代传统的状态值函数方法，并在多个领域实验中证明了其优于状态值函数策略且与LAMA-first规划器竞争的性能。

摘要翻译

学习跨领域泛化策略是规划学习中的核心挑战。现有标准方法通常采用图神经网络表示状态价值函数，并通过教师规划器生成的最优规划样本进行监督学习。本研究主张转而学习Q值函数。此类策略在评估给定状态时成本显著降低，因其仅需处理当前状态而非所有后续状态。令人意外的是，采用传统监督学习方法训练Q值效果不佳，因其未能有效区分教师规划器采取与未采取的动作。我们通过引入正则化项强制实现这种区分，从而解决了该问题。实验表明，基于Q值的策略在10个不同领域中持续超越状态价值策略，并与LAMA-first规划器性能相当。

摘要 (Abstract)

Learning per-domain generalizing policies is a key challenge in learning for planning. Standard approaches learn state-value functions represented as graph neural networks using supervised learning on optimal plans generated by a teacher planner. In this work, we advocate for learning Q-value functions instead. Such policies are drastically cheaper to evaluate for a given state, as they need to process only the current state rather than every successor. Surprisingly, vanilla supervised learning of Q-values performs poorly as it does not learn to distinguish between the actions taken and those not taken by the teacher. We address this by using regularization terms that enforce this distinction, resulting in Q-value policies that consistently outperform state-value policies across a range of 10 domains and are competitive with the planner LAMA-first.

关键词: Q-value functions, supervised learning, planning, graph neural networks, regularization, per-domain generalization, policy learning, efficiency

74. ❌ Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

作者: Jaein Kim, Hee Bin Yoo, Dong-Sig Han, Byoung-Tak Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D点云分析的等变卷积方法（ECKConv），研究内容为计算机视觉中的几何深度学习，与所有评分关键词（均围绕大模型、深度学习技术原理及科学应用）完全无关。论文未涉及任何大模型、语言模型、训练技术、推理方法、代理系统或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的等变坐标基核卷积方法（ECKConv），解决了3D点云分析中同时保持严格SE(3)对称性和可扩展性的问题，并在多个点云任务上验证了其优越性能。

摘要翻译

刚性运动对称性是高效学习三维点云问题的关键因素之一。群卷积作为提取等变特征的典型方法，其实现始终难以同时兼顾严格的对称性与可扩展性。我们主张利用交织算子框架来解决这一矛盾，但先前研究未能实现完全的SE(3)对称性或大规模问题的可扩展性，这要求更先进的核架构设计。本文提出等变坐标核卷积（Equivariant Coordinate-based Kernel Convolution，简称ECKConv）。该方法通过定义在双陪集空间中的核域实现SE(3)等变性，并采用基于坐标的显式核网络设计，显著提升了学习能力与内存效率。在点云分类、姿态配准、部件分割及大规模语义分割等多项任务上的实验表明，相较于当前最先进的等变方法，ECKConv在保持刚性等变性与内存可扩展性的同时，展现出卓越的性能表现。

摘要 (Abstract)

A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.

关键词: SE(3) equivariance, point cloud analysis, coordinate-based kernels, group convolution, rigid motion symmetry, memory efficiency, 3D vision, equivariant features

75. ❌ Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

作者: Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17531v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像零水印技术，研究扩散模型编辑下的图像认证问题。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文不涉及这些领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图像块对关系不变性的零水印框架Rel-Zero，用于在AI编辑下实现鲁棒的内容认证，无需修改原始图像。

摘要翻译

基于扩散模型的图像编辑技术的最新进展对数字视觉内容的真实性构成了重大威胁。传统的基于嵌入的水印方法为保持鲁棒性常会引入可感知的扰动，不可避免地损害视觉保真度。同时，现有的零水印方法通常依赖全局图像特征，难以抵御复杂的篡改操作。本工作中，我们发现了一个关键现象：尽管单个图像块在基于人工智能的编辑过程中会发生显著改变，但图像块对之间的相对距离关系保持相对不变。利用这一特性，我们提出了关系零水印（Relational Zero-Watermarking, Rel-Zero）这一新颖框架。该框架无需修改原始图像，而是从这些编辑不变的关系中提取出独特的零水印。通过将水印建立在固有的结构一致性而非绝对外观之上，Rel-Zero为内容认证提供了一种非侵入式且具有强韧性的机制。大量实验表明，与先前的零水印方法相比，Rel-Zero在多种编辑模型和操作下均实现了显著提升的鲁棒性。

摘要 (Abstract)

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

关键词: zero-watermarking, AI editing, diffusion models, patch-pair invariance, content authentication, robust watermarking, image manipulation, non-invasive watermarking

76. ❌ Informative Semi-Factuals for XAI: The Elaborated Explanations that People Prefer

作者: Saugat Aryal, Mark T. Keane 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于可解释人工智能（XAI）领域，提出了一种新的半事实解释方法（ISF），旨在生成更详细、信息更丰富的解释。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、应用领域等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术，而本文研究的是传统XAI中的解释生成算法。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文直接研究可解释AI方法，属于该领域的核心内容，因此给予10分。

!!! tip deepseek-chat TL;DR

该论文针对现有半事实解释方法缺乏解释性的问题，提出了一种信息性半事实（ISF）算法，通过揭示影响决策的隐藏特征来生成更详细的解释，实验和用户研究表明该方法生成的解释质量更高且更受用户偏好。

摘要翻译

近年来，在可解释人工智能领域，一种被称为“半事实解释”的“即使”型解释策略日益流行，它旨在阐明即使某些输入特征发生改变，预测结果如何仍能保持不变。例如，在常用的银行应用场景中，半事实解释可以通过告知客户“即使您申请的贷款金额翻倍，您的申请仍会被批准”，来为其成功的申请提供更优选项或其他替代方案的信息。大多数半事实XAI算法专注于寻找对单个关键特征的最大值改变，且该改变不会导致结果变化（这与反事实解释不同，后者通常寻找对多个特征的最小值改变以改变结果）。然而，目前尚无半事实方法能够解释为何这些极端的值改变不会影响结果；例如，一个信息量更丰富的半事实解释可以告知客户，正是其良好的信用评分使其能够借到所申请金额两倍的贷款。在本研究中，我们提出一种新算法——信息性半事实方法——该方法通过补充关于影响自动化决策的额外隐藏特征的信息，生成更为详尽的解释，从而扩展了半事实解释。在基准数据集上的实验结果表明，该ISF方法计算出的半事实解释在关键指标上既信息丰富又具有高质量。此外，一项用户研究表明，相较于现有方法生成的简单半事实解释，人们更偏好这些经过详尽阐述的解释。

摘要 (Abstract)

Recently, in eXplainable AI (XAI), $\textit{even if}$ explanations – so-called semi-factuals – have emerged as a popular strategy that explains how a predicted outcome $\textit{can remain the same}$ even when certain input-features are altered. For example, in the commonly-used banking app scenario, a semi-factual explanation could inform customers about better options, other alternatives for their successful application, by saying “$\textit{Even if}$ you asked for double the loan amount, you would still be accepted”. Most semi-factuals XAI algorithms focus on finding maximal value-changes to a single key-feature that do $\textit{not}$ alter the outcome (unlike counterfactual explanations that often find minimal value-changes to several features that alter the outcome). However, no current semi-factual method explains $\textit{why}$ these extreme value-changes do not alter outcomes; for example, a more informative semi-factual could tell the customer that it is their good credit score that allows them to borrow double their requested loan. In this work, we advance a new algorithm – the $\textit{informative semi-factuals}$ (ISF) method – that generates more elaborated explanations supplementing semi-factuals with information about additional $\textit{hidden features}$ that influence an automated decision. Experimental results on benchmark datasets show that this ISF method computes semi-factuals that are both informative and of high-quality on key metrics. Furthermore, a user study shows that people prefer these elaborated explanations over the simpler semi-factual explanations generated by current methods.

关键词: Explainable AI, XAI, semi-factual explanations, informative semi-factuals, algorithm, hidden features, user study, elaborated explanations

77. ❌ AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

作者: Manuel Barusco, Davide Dalle Pezze, Francesco Borsatti, Gian Antonio Susto 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉异常检测（VAD），专注于计算机视觉领域的教师-学生架构、轻量级适配器、边缘部署优化和持续学习，与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、提示工程、对齐、推理、代理、压缩等技术，也未应用于生物信息学等科学领域。

!!! tip deepseek-chat TL;DR

该论文提出了AdapTS，一种用于多类别和持续视觉异常检测的轻量级教师-学生框架，通过共享冻结主干和轻量级适配器显著减少了内存开销，在MVTec AD和VisA数据集上匹配现有方法性能的同时，最小变体仅需8MB额外内存，适合边缘部署。

摘要翻译

视觉异常检测（VAD）在工业检测中至关重要，然而现有方法大多局限于单类别场景，难以应对实际环境中多类别与持续学习的需求。尽管师生（Teacher-Student, TS）架构具有高效性，但其在持续学习场景中的应用尚未得到探索。为填补这一空白，我们提出了AdapTS——一个专为多类别与持续学习场景设计的统一TS框架，并针对边缘部署进行了优化。AdapTS通过采用单一共享的冻结主干网络，并在学生路径中注入轻量级可训练适配器，消除了对两种不同架构的需求。训练过程通过分割引导目标和合成Perlin噪声得到增强，同时基于原型的任务识别机制在推理阶段以99%的准确率动态选择适配器。

在MVTec AD和VisA数据集上的实验表明，AdapTS在多类别与持续学习场景中均达到了现有TS方法的性能水平，同时大幅降低了内存开销。我们最轻量级的变体AdapTS-S仅需额外8 MB内存，比STFPM（95 MB）减少13倍，比RD4AD（360 MB）减少48倍，比DeSTSeg（1120 MB）减少149倍，这使其成为复杂工业环境中边缘部署的高度可扩展解决方案。

摘要 (Abstract)

Visual Anomaly Detection (VAD) is crucial for industrial inspection, yet most existing methods are limited to single-category scenarios, failing to address the multi-class and continual learning demands of real-world environments. While Teacher-Student (TS) architectures are efficient, they remain unexplored for the Continual Setting. To bridge this gap, we propose AdapTS, a unified TS framework designed for multi-class and continual settings, optimized for edge deployment. AdapTS eliminates the need for two different architectures by utilizing a single shared frozen backbone and injecting lightweight trainable adapters into the student pathway. Training is enhanced via a segmentation-guided objective and synthetic Perlin noise, while a prototype-based task identification mechanism dynamically selects adapters at inference with 99% accuracy. Experiments on MVTec AD and VisA demonstrate that AdapTS matches the performance of existing TS methods across multi-class and continual learning scenarios, while drastically reducing memory overhead. Our lightest variant, AdapTS-S, requires only 8 MB of additional memory, 13x less than STFPM (95 MB), 48x less than RD4AD (360 MB), and 149x less than DeSTSeg (1120 MB), making it a highly scalable solution for edge deployment in complex industrial environments.

关键词: Visual Anomaly Detection, Teacher-Student Architecture, Continual Learning, Multi-class Detection, Lightweight Adapters, Edge Deployment, Memory Efficiency, Industrial Inspection

78. ❌ AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting

作者: Binqing Wu, Zongjiang Shang, Shiyu Liu, Jianlong Huang, Jiahui Xu, Ling Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于空气质量预测，提出了一种基于神经延迟微分方程（AirDDE）的深度学习框架，属于AI在环境科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等核心大模型研究主题无关，因此除’AI for Science’外均得0分。‘AI for Science’得5分，因为论文属于AI在科学（环境科学）领域的应用，但并非生物信息学或化学信息学，且未涉及大模型技术，相关性有限。

!!! tip deepseek-chat TL;DR

该论文针对空气质量预测中污染物传播延迟被忽视的问题，提出了首个神经延迟微分方程框架AirDDE，通过记忆增强注意力模块和物理引导的延迟演化函数，在三个真实数据集上实现了最先进的预测性能，平均MAE降低了8.79%。

摘要翻译

精准的空气质量预测对公共卫生和环境保护至关重要，但由于污染物动态的复杂性，该任务仍具挑战性。现有深度学习方法通常将污染物动态建模为瞬时过程，忽视了污染物传播的内在延迟性。为此，我们提出AirDDE，这是该任务中首个将延迟建模融入物理引导下连续时间污染物演化的神经延迟微分方程框架。具体而言，我们引入了两个创新模块：(1) 记忆增强注意力模块，该模块能检索全局与局部的历史特征，从而自适应地捕捉由多因素数据调节的延迟效应；(2) 基于扩散-平流方程的物理引导延迟演化函数，该函数对扩散、延迟平流及源/汇项进行建模，能够以物理合理的方式捕捉延迟敏感的污染物累积模式。在三个真实世界数据集上的大量实验表明，AirDDE实现了最先进的预测性能，其平均绝对误差（MAE）相较于最佳基线模型平均降低了8.79%。代码已发布于https://github.com/w2obin/airdde-aaai。

摘要 (Abstract)

Accurate air quality forecasting is essential for public health and environmental sustainability, but remains challenging due to the complex pollutant dynamics. Existing deep learning methods often model pollutant dynamics as an instantaneous process, overlooking the intrinsic delays in pollutant propagation. Thus, we propose AirDDE, the first neural delay differential equation framework in this task that integrates delay modeling into a continuous-time pollutant evolution under physical guidance. Specifically, two novel components are introduced: (1) a memory-augmented attention module that retrieves globally and locally historical features, which can adaptively capture delay effects modulated by multifactor data; and (2) a physics-guided delay evolving function, grounded in the diffusion-advection equation, that models diffusion, delayed advection, and source/sink terms, which can capture delay-aware pollutant accumulation patterns with physical plausibility. Extensive experiments on three real-world datasets demonstrate that AirDDE achieves the state-of-the-art forecasting performance with an average MAE reduction of 8.79% over the best baselines. The code is available at https://github.com/w2obin/airdde-aaai.

关键词: Air Quality Forecasting, Neural Delay Differential Equations, Pollutant Dynamics, Memory-Augmented Attention, Physics-Guided Modeling, Diffusion-Advection Equation, State-of-the-Art Performance, MAE Reduction

79. ❌ KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

作者: Gaoge Han, Zhengqing Gao, Ziwen Li, Jiaxin Huang, Shaoli Huang, Fakhri Karray, Mingming Gong, Tongliang Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出KineVLA框架，属于视觉-语言-动作（VLA）模型，与大模型（LLMs）在机器人领域的应用相关，因此与’Large Language Models’、‘AI for Science’等关键词有中等关联（5分）。论文涉及指令调优、监督微调、多步推理、可解释AI等概念，与’Instruction Tuning’、‘Post-training’、‘Chain of Thought’、‘Explainable AI’等关键词有一定关联（5分）。但论文未直接涉及MoE、量化、RAG、RLHF等具体技术，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

论文提出KineVLA框架，通过双层动作分解解决机器人操作中任务目标不变但运动轨迹需适应指令级运动学变化的问题，在仿真和真实机器人实验中实现了更精确、可控和可泛化的操作行为。

摘要翻译

本文提出了一种新型的富运动学视觉-语言-动作任务，其中语言指令从开始到完成的关键时刻，密集编码了多样化的运动学属性（如方向、轨迹、朝向和相对位移），这与现有仅粗略或部分捕捉运动学的动作指令不同，从而支持细粒度和个性化的操控。在此设定下，任务目标保持不变，而执行轨迹必须适应指令级的运动学规范。为应对这一挑战，我们提出了KineVLA，一个视觉-语言-动作框架，通过双层动作表示和双层推理标记，显式地将目标级不变性与运动学级可变性解耦，这些标记作为对齐语言和动作的显式、有监督的中间变量。为支持此任务，我们构建了涵盖仿真和真实机器人平台的运动学感知VLA数据集，其特点是指令级的运动学变化和双层标注。在LIBERO和Realman-75机器人上进行的大量实验表明，KineVLA在运动学敏感基准测试中持续优于现有强VLA基线，实现了更精确、可控和可泛化的操控行为。

摘要 (Abstract)

In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.

关键词: vision-language-action, kinematics-aware, bi-level action decomposition, robotic manipulation, instruction-level kinematic variations, fine-grained manipulation, VLA framework, kinematics-rich task

80. ❌ Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

作者: Madhav S. Baidya, S. S. Baidya, Chirag Chawla 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI生成文本检测，直接涉及LLMs（关键词1）作为检测对象和检测工具，因此给10分。论文使用fine-tuned transformer encoders进行检测，与SFT（关键词6）有一定关联，给5分。XGBoost模型的可解释性分析与Explainable AI（关键词23）相关，给5分。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、Agents等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文通过全面基准测试评估了多种AI生成文本检测方法，发现Transformer模型在分布内表现优异但跨域泛化能力差，基于困惑度的方法在修正后有效，但没有方法能稳健地跨域和跨LLM泛化。

摘要翻译

大型语言模型（LLM）的快速扩散催生了对鲁棒且可泛化的机器生成文本检测器的迫切需求。现有基准测试通常在理想条件下针对单一数据集评估单一检测器，未能解决跨领域迁移、跨LLM泛化及对抗鲁棒性等开放性问题。

我们提出了一个综合性基准测试，在两个语料库上评估了多种检测方法：HC3（包含23,363组人类-ChatGPT对比文本）和ELI5（包含15,000组人类-Mistral-7B对比文本）。评估方法包括经典分类器、微调后的Transformer编码器（BERT、RoBERTa、ELECTRA、DistilBERT、DeBERTa-v3）、卷积神经网络（CNN）、基于XGBoost的文体计量模型、基于困惑度（perplexity）的检测器以及采用LLM作为检测器的提示工程方法。

结果表明，Transformer模型在分布内测试中达到接近完美的性能，但在领域迁移场景下性能显著下降。基于XGBoost的文体计量模型在保持可解释性的同时达到了相当的性能。基于LLM的检测器表现欠佳，且受到生成器-检测器身份偏差的影响。基于困惑度的方法出现极性反转现象——现代LLM生成文本的困惑度低于人类文本，但经校准后仍保持有效性。所有方法均未能实现在跨领域和跨LLM来源场景下的鲁棒泛化。

摘要 (Abstract)

The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.

关键词: AI-generated text detection, large language models, benchmark, cross-domain generalization, adversarial robustness, transformer models, perplexity-based detection, stylometric analysis

81. ❌ QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

作者: Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva, Sangarapillai Lambotharan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究联邦学习中的量化技术，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为核心是使用量化减少通信开销。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为利用了预训练模型初始化。其他关键词均不涉及，因为论文专注于联邦学习和量化，而非大语言模型、推理、对齐、科学AI等具体领域。

!!! tip deepseek-chat TL;DR

该论文提出QuantFL框架，通过预训练模型初始化和轻量级量化技术，显著降低了边缘物联网设备上联邦学习的通信能耗，在MNIST和CIFAR-100数据集上实现了高精度和低比特传输。

摘要翻译

联邦学习（Federated Learning, FL）能够在物联网（IoT）设备上实现隐私保护的智能计算，但由于频繁上行链路传输的高能耗，其产生了显著的碳足迹。尽管预训练模型在边缘设备上日益普及，但其在降低微调能耗方面的潜力仍未得到充分探索。本研究提出QuantFL，一种可持续的联邦学习框架，该框架利用预训练初始化来实现激进且计算轻量的量化。我们证明，预训练自然集中了更新统计量，使我们能够使用内存高效的桶量化，而无需复杂误差反馈机制带来的高能耗开销。在MNIST和CIFAR-100数据集上，QuantFL在严格带宽限制下，匹配或超越了未压缩基线的性能，同时将总通信量降低了40%（在全精度下行链路下实现约40%的总比特减少；在上行链路或下行链路量化时减少≥80%）；BU以数量级更少的比特数实现了89.00%（MNIST）和66.89%（CIFAR-100）的测试准确率。我们还考虑了上行与下行链路的成本，并对量化级别和初始化进行了消融实验。QuantFL为电池受限的物联网网络提供了一种实用、绿色的可扩展训练方案。

摘要 (Abstract)

Federated Learning (FL) enables privacy-preserving intelligence on Internet of Things (IoT) devices but incurs a significant carbon footprint due to the high energy cost of frequent uplink transmission. While pre-trained models are increasingly available on edge devices, their potential to reduce the energy overhead of fine-tuning remains underexplored. In this work, we propose QuantFL, a sustainable FL framework that leverages pre-trained initialisation to enable aggressive, computationally lightweight quantisation. We demonstrate that pre-training naturally concentrates update statistics, allowing us to use memory-efficient bucket quantisation without the energy-intensive overhead of complex error-feedback mechanisms. On MNIST and CIFAR-100, QuantFL reduces total communication by 40% ($\simeq40%$ total-bit reduction with full-precision downlink; $\geq80%$ on uplink or when downlink is quantised) while matching or exceeding uncompressed baselines under strict bandwidth budgets; BU attains 89.00% (MNIST) and 66.89% (CIFAR-100) test accuracy with orders of magnitude fewer bits. We also account for uplink and downlink costs and provide ablations on quantisation levels and initialisation. QuantFL delivers a practical, “green” recipe for scalable training on battery-constrained IoT networks.

关键词: Federated Learning, Quantization, IoT, Pre-trained Models, Energy Efficiency, Communication Reduction, Edge Computing, Model Compression

82. ❌ Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization

作者: Ahmet Kaplan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究无线波束成形和波形优化的AutoML方法，将迭代算法转换为深度神经网络，属于传统深度学习在特定工程领域的应用。所有关键词均与大模型（LLMs）相关，而本文完全不涉及大模型技术，仅使用传统深度神经网络架构。唯一略有相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文提到了透明度工具（per-layer sum-rate logging），但这不是核心内容，只是附带提及，因此给5分。其他关键词如AI for Science虽然涉及科学应用，但关键词明确指向生物信息学或化学信息学，而本文是无线通信工程，不匹配。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合AutoML和深度展开的自动展开近端梯度下降方法，用于优化无线波束成形和波形，仅用5层展开和100个训练样本就达到了传统200次迭代求解器98.8%的频谱效率，同时保持了高可解释性。

摘要翻译

本研究探讨了将自动化机器学习（AutoML）与基于模型的深度展开（Deep Unfolding, DU）相结合，以优化无线波束成形与波形设计。我们将迭代近端梯度下降（Proximal Gradient Descent, PGD）算法转化为深度神经网络，其中每一层的参数通过学习获得而非预先设定。此外，我们通过引入一种混合层来增强网络架构，该层在近端投影之前执行可学习的线性梯度变换。通过利用AutoGluon框架及树结构Parzen估计器（Tree-structured Parzen Estimator, TPE）进行超参数优化（Hyperparameter Optimization, HPO），并在扩展的搜索空间（包括网络深度、步长初始化、优化器、学习率调度器、层类型及梯度后激活函数）中进行搜索，所提出的自动展开PGD（Auto-PGD）仅使用五个展开层和100个训练样本，即可达到传统200次迭代PGD求解器98.8%的频谱效率。我们还解决了梯度归一化问题，以确保训练与评估期间性能的一致性，并通过逐层和速率记录来增强透明度。这些贡献显著减少了所需训练数据量与推理成本，同时相较于传统黑箱架构保持了较高的可解释性。

摘要 (Abstract)

This study explores the combination of automated machine learning (AutoML) with model-based deep unfolding (DU) for optimizing wireless beamforming and waveforms. We convert the iterative proximal gradient descent (PGD) algorithm into a deep neural network, wherein the parameters of each layer are learned instead of being predetermined. Additionally, we enhance the architecture by incorporating a hybrid layer that performs a learnable linear gradient transformation prior to the proximal projection. By utilizing AutoGluon with a tree-structured parzen estimator (TPE) for hyperparameter optimization (HPO) across an expanded search space, which includes network depth, step-size initialization, optimizer, learning rate scheduler, layer type, and post-gradient activation, the proposed auto-unrolled PGD (Auto-PGD) achieves 98.8% of the spectral efficiency of a traditional 200-iteration PGD solver using only five unrolled layers, while requiring only 100 training samples. We also address a gradient normalization issue to ensure consistent performance during training and evaluation, and we illustrate per-layer sum-rate logging as a tool for transparency. These contributions highlight a notable reduction in the amount of training data and inference cost required, while maintaining high interpretability compared to conventional black-box architectures.

关键词: AutoML, deep unfolding, proximal gradient descent, wireless beamforming, waveform optimization, hyperparameter optimization, interpretability, spectral efficiency

83. ❌ Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

作者: Zelin Zang, Yehui Yang, Fei Wang, Liangyu Li, Baigui Sun 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的无监督域自适应（UDA），提出了一种结合有益噪声和跨尺度匹配的Transformer方法。与评分关键词列表高度相关的只有"Pre-training OR Continual Pre-training OR Domain Adaptation”，因为论文明确研究域自适应（Domain Adaptation），这是该关键词的直接子领域。其他关键词主要涉及大语言模型（LLM）的技术、应用或相关概念（如对齐、推理、代理等），而本文研究的是视觉Transformer在跨域特征对齐中的应用，未涉及LLM、科学AI应用或其他特定技术。因此，除域自适应关键词外，其余均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对无监督域自适应中因域和尺度差异导致的性能下降问题，提出了一个结合有益噪声正则化跨注意力机制和跨尺度匹配的框架，在多个基准数据集上实现了最先进的性能。

摘要翻译

无监督域自适应（Unsupervised Domain Adaptation, UDA）旨在将知识从有标注的源域迁移到无标注的目标域，但常因严重的域差异与尺度差异导致性能下降。现有基于交叉注意力的Transformer模型虽能实现跨域特征对齐，却在面对显著的外观与尺度变化时难以保持内容语义的一致性。为明确应对这些挑战，本文引入有益噪声（beneficial noise）的概念，通过注入受控扰动来规范交叉注意力机制，促使模型忽略风格干扰并聚焦于内容本身。我们提出域自适应跨尺度匹配（Domain-Adaptive Cross-Scale Matching, DACSM）框架，该框架包含用于从域特定风格中解耦出域共享内容的域自适应Transformer（Domain-Adaptive Transformer, DAT），以及自适应对齐多分辨率特征的跨尺度匹配（Cross-Scale Matching, CSM）模块。DAT将有益噪声融入交叉注意力中，实现具有更强鲁棒性的渐进式域转换，从而生成内容一致且风格不变的特征表示。同时，CSM模块确保尺度变化下的语义一致性。在VisDA-2017、Office-Home和DomainNet数据集上的大量实验表明，DACSM取得了最先进的性能，在VisDA-2017上较CDTrans提升最高达+2.3%。值得注意的是，DACSM在VisDA中极具挑战性的“卡车”类别上实现了+5.9%的性能增益，这证明了有益噪声在处理尺度差异方面的优势。这些结果凸显了结合域转换、有益噪声增强的注意力机制以及尺度感知对齐对于实现鲁棒的跨域表征学习的有效性。

摘要 (Abstract)

Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging “truck” class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.

关键词: Unsupervised Domain Adaptation, Cross-Attention, Beneficial Noise, Domain-Adaptive Transformer, Cross-Scale Matching, Feature Alignment, Transformer, Robust Representation Learning

84. ❌ VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

作者: Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种视觉引导的概率提示学习（VirPro）方法，用于弱监督的单目3D检测。该方法涉及多模态预训练（关键词’Pre-training’得8分），因为它开发了一种自适应多模态预训练范式，可以集成到弱监督框架中。然而，论文主要关注计算机视觉（3D检测）和视觉-语言融合，而不是大模型或深度学习技术原理的创新。它没有涉及LLMs、MoE、SLMs、缩放定律、后训练、对齐、RLHF、高效微调、RAG、长上下文、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题。因此，除了预训练相关关键词外，其他所有关键词都得0分。

!!! tip deepseek-chat TL;DR

该论文解决了弱监督单目3D检测中手工文本描述难以捕捉视觉多样性的问题，提出了一种视觉引导的概率提示学习（VirPro）方法，通过自适应多模态预训练在KITTI基准上实现了高达4.8%的平均精度提升。

摘要翻译

单目三维目标检测通常依赖伪标签技术以减少对真实世界标注的依赖。最新研究表明，确定性语言线索可作为有效的辅助弱监督信号，提供互补的语义上下文。然而，人工设计的文本描述难以捕捉不同场景中个体固有的视觉多样性，限制了模型学习场景感知表征的能力。为解决这一挑战，我们提出视觉参照概率提示学习（Visual-referred Probabilistic Prompt Learning, VirPro），这是一种可自适应融入多种弱监督单目三维检测框架的多模态预训练范式。具体而言，我们生成一组跨场景的多样化、可学习的实例条件提示，并将其存储于自适应提示库（Adaptive Prompt Bank, APB）中。随后，我们引入多高斯提示建模（Multi-Gaussian Prompt Modeling, MGPM），将基于场景的视觉特征融入对应的文本嵌入中，使文本提示能够表达视觉不确定性。接着，我们从融合的视觉-语言嵌入中解码出针对提示的高斯分布，并从中为每个实例推导出统一的对象级提示嵌入。通过采用感兴趣区域对比匹配来加强模态对齐，使同一场景中共现对象的嵌入在潜在空间中更接近，从而提升语义连贯性。在KITTI基准上的大量实验表明，集成我们的预训练范式能持续带来显著的性能提升，相比基线方法平均精度最高提升4.8%。

摘要 (Abstract)

Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model’s ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.

关键词: weakly-supervised monocular 3D detection, visual-referred probabilistic prompt learning, multi-modal pretraining, adaptive prompt bank, multi-Gaussian prompt modeling, vision-language fusion, RoI-level contrastive matching, KITTI benchmark

85. ❌ VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

作者: Junyoung Kim, Woojoo Kim, Jaehyung Lim, Dongha Kim, Hwanjo Yu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17450v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究使用Vision-Language Models (VLMs)作为多模态编码器进行序列推荐，核心创新在于解决监督微调(SFT)过程中的模态崩溃问题。因此与’Supervised Fine-tuning OR SFT’高度相关(10分)，因为论文明确提到’standard contrastive supervised fine-tuning (SFT)‘并提出了改进方法。与’Large Language Models OR LLMs OR Foundation Models’有一定关联(8分)，因为论文受LLMs启发使用VLMs，但主要焦点是VLMs而非纯LLMs。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态序列推荐中视觉语言模型(VLMs)在监督微调时出现的模态崩溃问题，提出了VLM2Rec框架，通过弱模态惩罚对比学习和跨模态关系拓扑正则化实现了更平衡的模态利用，从而提高了推荐准确性和鲁棒性。

摘要翻译

多模态场景下的序列推荐通常依赖于小型冻结预训练编码器，这限制了语义容量，并阻碍了协同过滤信号充分整合到物品表征中。受近期大语言模型作为高容量嵌入器取得成功的启发，我们探索将视觉语言模型用作序列推荐中具有协同过滤感知能力的多模态编码器。然而，我们发现，为适应嵌入生成并注入协同过滤信号而采用的标准对比监督微调方法，可能加剧模型固有的模态坍缩问题。在此状态下，优化过程被单一模态主导，而其他模态性能退化，最终损害推荐准确性。为解决这一问题，我们提出VLM2Rec——一个基于视觉语言模型嵌入器的多模态序列推荐框架，旨在确保模态利用的平衡。具体而言，我们引入弱模态惩罚对比学习以修正优化过程中的梯度失衡，并采用跨模态关系拓扑正则化来保持模态间的几何一致性。大量实验表明，在不同场景下，VLM2Rec在准确性和鲁棒性方面均持续优于现有先进基线方法。

摘要 (Abstract)

Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.

关键词: Vision-Language Models, Multimodal Sequential Recommendation, Modality Collapse, Supervised Fine-tuning, Contrastive Learning, Cross-Modal Regularization, Collaborative Filtering, Embedding Generation

86. ❌ When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

作者: Yi Nian, Haosen Cao, Shenzhe Zhu, Henry Peng Zou, Qingqing Luan, Yue Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体语言系统的可追溯性和审计问题，与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分），因为核心研究多智能体系统的交互、协调和归因。与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为多智能体系统通常基于大语言模型构建。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了多智能体语言系统中当执行日志不可用时如何实现责任追溯的问题，提出了IET框架，通过嵌入密钥信号到生成文本中实现令牌级归因和交互拓扑重建，实验表明该方法能高精度恢复智能体贡献和协调结构。

摘要翻译

当多智能体系统产生错误或有害答案时，若执行日志与智能体标识不可用，责任应如何追溯？多智能体语言系统日益依赖委托、迭代优化等结构化交互，但最终输出往往掩盖了底层的交互拓扑与智能体贡献。本文提出IET（隐式执行追踪），一种不依赖元数据的框架，可直接从生成文本中实现词元级归因，并提供简单的交互拓扑重建机制。在生成过程中，特定于智能体的密钥信号被嵌入词元分布，将文本转化为仅能通过密钥检测的自描述执行轨迹。在检测阶段，一种基于转移感知的评分方法可识别智能体交接点并重建交互图。实验表明，IET能以高精度恢复智能体分段与协作结构，同时保持生成质量，从而为多智能体语言系统实现隐私保护的审计功能。

摘要 (Abstract)

When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.

关键词: multi-agent systems, agent attribution, execution tracing, interaction topology, language agents, privacy-preserving auditing, token-level attribution, agent coordination

作者: Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang, Chenghao Jiang, Borui Zhang, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17441v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究GUI grounding任务，属于计算机视觉与自然语言处理的交叉领域，主要涉及视觉语言模型（VLM）在GUI截图上的元素定位和指令理解。虽然论文提到了使用Group Relative Policy Optimization（GRPO）进行训练，但整体内容聚焦于视觉定位框架（AdaZoom-GUI）、指令精炼模块和条件缩放策略，并未涉及大语言模型（LLM）的技术原理、训练方法（如预训练、微调、对齐）、推理优化、智能体系统或科学AI应用等关键词。所有关键词均与大语言模型的核心技术、训练范式、应用场景或相关领域直接相关，而本论文的核心是视觉语言模型在GUI理解的具体任务，因此与所有给定关键词完全无关，评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对GUI grounding任务中高分辨率图像、小UI元素和模糊指令的挑战，提出了AdaZoom-GUI框架，通过指令精炼和条件缩放策略提升了定位精度和指令理解，并在公开基准上实现了最先进的性能。

摘要翻译

GUI grounding（图形用户界面定位）是视觉语言模型（VLMs）的一项关键能力，它能够通过自然语言指令定位目标元素，从而实现与图形用户界面的自动化交互。然而，由于高分辨率图像、微小的UI元素以及模糊的用户指令，在GUI截图上进行精准定位仍然具有挑战性。在本研究中，我们提出了AdaZoom-GUI，一种基于自适应缩放的GUI定位框架，旨在同时提升定位精度和指令理解能力。我们的方法引入了一个指令精化模块，能够将自然语言指令重写为明确且详细的描述，从而使定位模型能够专注于精确的元素定位。此外，我们设计了一种条件性放大策略，该策略选择性地对预测出的小型元素进行第二阶段推理，从而在提升定位精度的同时，避免了在简单案例上进行不必要的计算和上下文信息丢失。为了支持该框架，我们构建了一个高质量的GUI定位数据集，并采用组相对策略优化（Group Relative Policy Optimization, GRPO）训练定位模型，使其能够同时预测点击坐标和元素边界框。在公开基准测试上的实验表明，我们的方法在参数量相当甚至更大的模型中实现了最先进的性能，凸显了其在高分辨率GUI理解及实用GUI智能体部署方面的有效性。

摘要 (Abstract)

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

关键词: GUI grounding, vision-language models, instruction refinement, adaptive zoom, element localization, Group Relative Policy Optimization, high-resolution GUI understanding, GUI agent deployment

88. ❌ Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates

作者: Linxiao Yang, Xue Jiang, Gezheng Xu, Tian Zhou, Min Yang, ZhaoYang Zhu, Linyuan Geng, Zhipeng Zeng, Qiming Chen, Xinyue Gu, Rong Jin, Liang Sun 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17439v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文Baguan-TS专注于时间序列预测，核心贡献是提出一个统一框架，将原始序列表示学习与上下文学习（In-context Learning, ICL）相结合，使用3D Transformer在时间、变量和上下文轴上进行联合注意力。因此，它与关键词’In-context Learning OR Many-shot Learning’高度相关（10分），因为ICL是论文的核心方法。同时，论文在能源数据集上进行评估，属于科学应用领域，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心。其他关键词主要涉及大语言模型（LLMs）的特定技术、对齐、推理、代理等，而本文虽然使用Transformer和ICL，但专注于时间序列预测，未涉及LLMs、MoE、对齐、RAG、推理加速等具体技术，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Baguan-TS，一个将原始序列表示学习与上下文学习（ICL）相结合的统一框架，用于带协变量的时间序列预测，通过3D Transformer和校准策略，在公共基准和真实能源数据集上显著优于现有基线。

摘要翻译

Transformer实现了情境学习（ICL），能够在时间序列预测中实现快速、无需梯度的自适应，然而大多数ICL方法依赖于表格化、手工构建的特征，而端到端的序列模型则缺乏推理时的自适应能力。我们通过统一框架Baguan-TS弥合了这一差距，该框架将原始序列表示学习与ICL相结合，并通过一个在时间、变量和情境轴上进行联合注意力计算的3D Transformer实现。为使这一高容量模型具备实用性，我们解决了两个关键障碍：（i）校准与训练稳定性问题，通过一种与特征无关、基于目标空间检索的局部校准方法加以改进；（ii）输出过度平滑问题，通过情境过拟合策略进行缓解。在包含协变量的公开基准测试中，Baguan-TS持续超越现有基线方法，取得了最高的胜率，并在点预测和概率预测指标上均实现显著降低。在多样化的现实世界能源数据集上的进一步评估验证了其鲁棒性，带来了实质性的性能提升。

摘要 (Abstract)

Transformers enable in-context learning (ICL) for rapid, gradient-free adaptation in time series forecasting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to-end sequence models lack inference-time adaptation. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stability, improved with a feature-agnostic, target-space retrieval-based local calibration; and (ii) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and significant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements.

关键词: time series forecasting, in-context learning, Transformer, covariates, 3D attention, calibration, energy datasets, probabilistic forecasting

89. ❌ TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting

作者: Yue Hu, Jialiang Tang, Siwei Yu, Baosheng Yu, Jing Zhang, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting》专注于时间序列预测领域，提出了一种处理非平稳性的自适应归一化框架。论文的核心贡献在于时间序列分析、信号处理（时域和频域建模）和预测方法，不涉及大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术或AI for Science相关，而本文研究的是传统时间序列预测问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多元长期时间序列预测中的非平稳性问题，提出了一种自适应幅相非平稳性归一化框架（TimeAPN），通过联合时域和频域建模来预测非平稳因素，实验表明该方法能显著提升长期预测精度并优于现有可逆归一化方法。

摘要翻译

非平稳性是多元长期时间序列预测中的一个根本性挑战，通常表现为幅值和相位的快速变化。这些变化会导致严重的分布偏移，从而降低预测性能。现有的基于归一化的方法主要依赖一阶和二阶统计量，隐含地假设分布是平滑演变的，而忽略了细粒度的时间动态特性。为解决这些局限性，我们提出了TimeAPN——一种自适应幅相非平稳性归一化框架，该框架从时域和频域两个维度显式地建模并预测非平稳性因素。具体而言，TimeAPN首先在时域和频域联合建模均值序列，进而预测其在未来时段内的演变趋势。同时，在频域中提取相位信息，并显式建模预测序列与真实未来序列之间的相位差异，以捕捉时序错位问题。此外，TimeAPN将幅值信息融入自适应归一化机制，使模型能够有效应对信号能量的突变波动。预测得到的非平稳性因子随后通过协同反归一化过程与主干预测网络的输出相结合，以重建最终的非平稳时间序列。所提出的框架与模型无关，可无缝集成到多种预测主干网络中。在七个真实世界多元数据集上的大量实验表明，TimeAPN能在多个预测时间跨度上持续提升长期预测精度，其性能优于当前最先进的可逆归一化方法。

摘要 (Abstract)

Non-stationarity is a fundamental challenge in multivariate long-term time series forecasting, often manifested as rapid changes in amplitude and phase. These variations lead to severe distribution shifts and consequently degrade predictive performance. Existing normalization-based methods primarily rely on first- and second-order statistics, implicitly assuming that distributions evolve smoothly and overlooking fine-grained temporal dynamics. To address these limitations, we propose TimeAPN, an Adaptive Amplitude-Phase Non-Stationarity Normalization framework that explicitly models and predicts non-stationary factors from both the time and frequency domains. Specifically, TimeAPN first models the mean sequence jointly in the time and frequency domains, and then forecasts its evolution over future horizons. Meanwhile, phase information is extracted in the frequency domain, and the phase discrepancy between the predicted and ground-truth future sequences is explicitly modeled to capture temporal misalignment. Furthermore, TimeAPN incorporates amplitude information into an adaptive normalization mechanism, enabling the model to effectively account for abrupt fluctuations in signal energy. The predicted non-stationary factors are subsequently integrated with the backbone forecasting outputs through a collaborative de-normalization process to reconstruct the final non-stationary time series. The proposed framework is model-agnostic and can be seamlessly integrated with various forecasting backbones. Extensive experiments on seven real-world multivariate datasets demonstrate that TimeAPN consistently improves long-term forecasting accuracy across multiple prediction horizons and outperforms state-of-the-art reversible normalization methods.

关键词: time series forecasting, non-stationarity, amplitude-phase normalization, multivariate forecasting, frequency domain analysis, adaptive normalization, long-term prediction, distribution shift

90. ❌ The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle

作者: Dibakar Sigdel 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17433v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新型Transformer架构（Phasor Transformer）用于时间序列预测，虽然属于深度学习技术原理创新，但所有关键词均针对大语言模型（LLM）的特定技术、应用或评估方法，而本文专注于基础Transformer架构改进（用相位表示和DFT替代注意力机制）用于时间序列建模，未涉及任何LLM相关内容、技术或应用场景，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对传统Transformer中自注意力机制在处理长上下文时间序列时存在的二次计算瓶颈问题，提出了一种基于单位圆流形相位表示和离散傅里叶变换耦合的Phasor Transformer架构，实现了高效的全局序列混合，并在多频率时间序列预测任务中验证了其性能。

摘要翻译

Transformer模型重新定义了序列学习，但点积自注意力机制为长上下文时间序列引入了二次复杂度的标记混合瓶颈。我们提出\textbf{相位变换器（Phasor Transformer）}模块，这是一种原生基于相位表示的替代方案，将序列状态映射至单位圆流形$S^1$上。每个模块将轻量级可训练相移与无需参数的离散傅里叶变换（Discrete Fourier Transform, DFT）标记耦合相结合，无需显式注意力图即可实现全局$\mathcal{O}(N\log N)$复杂度的混合。堆叠此类模块构成了\textbf{大型相位模型（Large Phasor Model, LPM）}。我们在合成多频率基准测试上通过自回归时间序列预测验证了LPM的性能。该模型以高度紧凑的参数规模运行，能够学习稳定的全局动态，并在预测性能上与传统的自注意力基线模型相竞争。我们的研究结果明确划定了效率与性能的边界，表明时间序列的大模型扩展可通过几何约束的相位计算与确定性全局耦合实现，为振荡领域中的可扩展时序建模提供了一条实用路径。

摘要 (Abstract)

Transformer models have redefined sequence learning, yet dot-product self-attention introduces a quadratic token-mixing bottleneck for long-context time-series. We introduce the \textbf{Phasor Transformer} block, a phase-native alternative representing sequence states on the unit-circle manifold $S^1$. Each block combines lightweight trainable phase-shifts with parameter-free Discrete Fourier Transform (DFT) token coupling, achieving global $\mathcal{O}(N\log N)$ mixing without explicit attention maps. Stacking these blocks defines the \textbf{Large Phasor Model (LPM)}. We validate LPM on autoregressive time-series prediction over synthetic multi-frequency benchmarks. Operating with a highly compact parameter budget, LPM learns stable global dynamics and achieves competitive forecasting behavior compared to conventional self-attention baselines. Our results establish an explicit efficiency-performance frontier, demonstrating that large-model scaling for time-series can emerge from geometry-constrained phase computation with deterministic global coupling, offering a practical path toward scalable temporal modeling in oscillatory domains.

关键词: Phasor Transformer, attention bottlenecks, unit-circle manifold, Discrete Fourier Transform, time-series prediction, global token mixing, Large Phasor Model, scalable temporal modeling

91. ❌ Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning

作者: Zhenhai Pan, Yan Liu, Jia You 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究医患对话中的主动知识查询框架，结合状态提取、信念更新和POMDP-lite动作规划，属于AI在生物医学领域的应用。与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws等）完全无关，因为这些技术未在论文中提及或使用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（评8分），因为论文涉及医疗对话和电子病历生成，属于生物信息学应用；以及’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（评5分），因为论文提到’hybrid retrieval over objectified medical knowledge’，但RAG并非核心方法。其他关键词如指令调优、对齐、推理方法等均不适用。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于医患对话的主动知识查询框架，通过状态提取、信念更新和混合检索结合POMDP-lite规划，在受控模拟设置中实现了83.3%的覆盖率和81.4%的结构完整性，但尚未达到临床部署准备阶段。

摘要翻译

当前大多数自动化电子病历（EMR）流程仍以输出为导向：它们在诊疗结束后进行转录、提取和总结，但并未显式建模已知信息、缺失内容、最关键的不确定性，或下一步应提出的问题与建议。本文将医患对话形式化为部分可观测条件下的主动知识探询问题。所提出的框架融合了状态化信息提取、序列化信念更新、缺口感知状态建模、基于结构化医学知识的混合检索，以及轻量级部分可观测马尔可夫决策过程（POMDP-lite）行动规划器。该框架不将电子病历视为唯一目标产物，而是将病历文档视为持续探询循环的结构化投射。为具体呈现该形式化模型，我们报告了一项基于十组标准化多轮对话的受控试点评估，以及跨对话汇总的300条查询检索基准测试。在此试点方案中，完整框架实现了83.3%的信息覆盖率、80.0%的风险召回率、81.4%的结构完整性，且相较于纯分块检索和模板密集型交互基线系统展现出更低冗余度。这些试点结果并不代表临床泛化能力；而是表明在严格受控条件下，主动探询机制在方法论层面具有研究价值，可被视为一种概念上具有吸引力的形式化框架，值得在基于对话的电子病历生成领域深入探索。本研究应被视为受控模拟环境下的概念验证演示，而非临床部署成熟度的证据。不应从本试点方案中推断任何关于临床部署就绪度、临床安全性或实际临床效用的暗示。

摘要 (Abstract)

Most automated electronic medical record (EMR) pipelines remain output-oriented: they transcribe, extract, and summarize after the consultation, but they do not explicitly model what is already known, what is still missing, which uncertainty matters most, or what question or recommendation should come next. We formulate doctor-patient dialogue as a proactive knowledge-inquiry problem under partial observability. The proposed framework combines stateful extraction, sequential belief updating, gap-aware state modeling, hybrid retrieval over objectified medical knowledge, and a POMDP-lite action planner. Instead of treating the EMR as the only target artifact, the framework treats documentation as the structured projection of an ongoing inquiry loop. To make the formulation concrete, we report a controlled pilot evaluation on ten standardized multi-turn dialogues together with a 300-query retrieval benchmark aggregated across dialogues. On this pilot protocol, the full framework reaches 83.3% coverage, 80.0% risk recall, 81.4% structural completeness, and lower redundancy than the chunk-only and template-heavy interactive baselines. These pilot results do not establish clinical generalization; rather, they suggest that proactive inquiry may be methodologically interesting under tightly controlled conditions and can be viewed as a conceptually appealing formulation worth further investigation for dialogue-based EMR generation. This work should be read as a pilot concept demonstration under a controlled simulated setting rather than as evidence of clinical deployment readiness. No implication of clinical deployment readiness, clinical safety, or real-world clinical utility should be inferred from this pilot protocol.

关键词: doctor-patient dialogue, proactive knowledge inquiry, stateful extraction, belief updating, POMDP-lite action planner, electronic medical record generation, hybrid retrieval, controlled pilot evaluation

92. ❌ From Digital Twins to World Models:Opportunities, Challenges, and Applications for Mobile Edge General Intelligence

作者: Jie Zheng, Dusit Niyato, Changyuan Zhao, Jiawen Kang, Jiacheng Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论从数字孪生到世界模型的转变及其在边缘通用智能中的应用，仅与关键词’World Models AND General World Models’高度相关（核心内容），与其他关键词（主要涉及大模型技术细节、训练方法、推理优化、应用领域等）无直接关联。

!!! tip deepseek-chat TL;DR

本文系统综述了从数字孪生到世界模型的转变，探讨了其在无线边缘计算环境中实现边缘通用智能的设计原则、架构、应用及挑战。

摘要翻译

向6G及未来通信系统的快速演进正加速数字孪生与世界模型在网络边缘的融合。传统数字孪生提供物理系统的高保真表征，支持监测、分析与离线优化。然而，在高度动态的边缘环境中，其在自主性、适应性与可扩展性方面面临局限。本文系统综述了从数字孪生向世界模型的演进，并探讨其在实现边缘通用智能（Edge General Intelligence, EGI）中的作用。首先，本文厘清了数字孪生与世界模型在概念上的差异，强调其从基于物理、集中式、以系统为中心的复制体，转向数据驱动、分布式、以智能体为中心的内部模型的演变。这一讨论有助于读者清晰理解该转变如何促使网络边缘实现更具适应性、自主性和资源高效性的智能。本文回顾了世界模型的设计原则、架构与关键组件，包括感知、潜在状态表征、动态学习、基于想象的规划与记忆机制。此外，文章探讨了世界模型与数字孪生在无线EGI系统中的融合，并综述了其在集成感知与通信、语义通信、空天地一体化网络及低空无线网络等新兴领域的应用。最后，本综述为在无线与边缘计算环境中设计世界模型驱动的边缘智能系统提供了系统化路线图与实践洞见，同时展望了面向可扩展、可靠、可互操作的边缘原生智能体人工智能世界模型的关键研究挑战与未来方向。

摘要 (Abstract)

The rapid evolution toward 6G and beyond communication systems is accelerating the convergence of digital twins and world models at the network edge. Traditional digital twins provide high-fidelity representations of physical systems and support monitoring, analysis, and offline optimization. However, in highly dynamic edge environments, they face limitations in autonomy, adaptability, and scalability. This paper presents a systematic survey of the transition from digital twins to world models and discusses its role in enabling edge general intelligence (EGI). First, the paper clarifies the conceptual differences between digital twins and world models and highlights the shift from physics-based, centralized, and system-centric replicas to data-driven, decentralized, and agent-centric internal models. This discussion helps readers gain a clear understanding of how this transition enables more adaptive, autonomous, and resource-efficient intelligence at the network edge. The paper reviews the design principles, architectures, and key components of world models, including perception, latent state representation, dynamics learning, imagination-based planning, and memory. In addition, it examines the integration of world models and digital twins in wireless EGI systems and surveys emerging applications in integrated sensing and communications, semantic communication, air-ground networks, and low-altitude wireless networks. Finally, this survey provides a systematic roadmap and practical insights for designing world-model-driven edge intelligence systems in wireless and edge computing environments. It also outlines key research challenges and future directions toward scalable, reliable, and interoperable world models for edge-native agentic AI.

关键词: World Models, Digital Twins, Edge General Intelligence, Wireless Edge Computing, Agentic AI, Integrated Sensing and Communications, Semantic Communication, Air-ground Networks

93. ❌ Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

作者: Saikat Maiti 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究医疗保健领域自主AI代理的安全架构，核心涉及LLM驱动的自主代理（高度相关）、工具使用能力（高度相关）、多代理系统（有一定关联）以及AI在科学/医疗领域的应用（高度相关）。论文聚焦安全部署而非模型技术本身，因此与大多数技术原理关键词无关。

!!! tip deepseek-chat TL;DR

本文针对医疗保健环境中基于大语言模型的自主AI代理存在的安全漏洞，提出并部署了一个四层深度防御安全架构，成功发现并修复了多个高危漏洞。

摘要翻译

由大型语言模型驱动的自主人工智能代理正被部署于生产环境，其能力涵盖Shell执行、文件系统访问、数据库查询及多方通信。近期红队研究表明，这些代理在真实场景中表现出严重漏洞：未经授权遵从非所有者指令、敏感信息泄露、身份欺骗、不安全实践的跨代理传播，以及通过外部资源实现的间接提示注入[7]。在处理受保护健康信息（Protected Health Information）的医疗环境中，每一项此类漏洞都可能构成潜在的HIPAA违规。本文提出一套安全架构，已为某医疗科技公司生产环境中的九个自主人工智能代理部署实施。我们构建了涵盖六个领域的医疗AI代理威胁模型，包括凭证暴露、执行能力滥用、网络出口窃取、提示完整性失效、数据库访问风险及集群配置漂移。我们实现了四层纵深防御：(1) 在Kubernetes上使用gVisor实现内核级工作负载隔离，(2) 通过凭证代理边车阻止代理容器访问原始密钥，(3) 网络出口策略将各代理限制在允许列表目的地，(4) 包含结构化元数据封装与不可信内容标记的提示完整性框架。我们报告了90天部署结果：自动化安全审计代理发现并修复了四项高危风险；通过三代虚拟机镜像实现渐进式集群强化；防御覆盖范围映射至近期文献中全部十一种攻击模式。所有配置、审计工具及提示完整性框架均已开源发布。

摘要 (Abstract)

Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, cross-agent propagation of unsafe practices, and indirect prompt injection through external resources [7]. In healthcare environments processing Protected Health Information, every such vulnerability becomes a potential HIPAA violation. This paper presents a security architecture deployed for nine autonomous AI agents in production at a healthcare technology company. We develop a six-domain threat model for agentic AI in healthcare covering credential exposure, execution capability abuse, network egress exfiltration, prompt integrity failures, database access risks, and fleet configuration drift. We implement four-layer defense in depth: (1) kernel level workload isolation using gVisor on Kubernetes, (2) credential proxy sidecars preventing agent containers from accessing raw secrets, (3) network egress policies restricting each agent to allowlisted destinations, and (4) a prompt integrity framework with structured metadata envelopes and untrusted content labeling. We report results from 90 days of deployment including four HIGH severity findings discovered and remediated by an automated security audit agent, progressive fleet hardening across three VM image generations, and defense coverage mapped to all eleven attack patterns from recent literature. All configurations, audit tooling, and the prompt integrity framework are released as open source.

关键词: autonomous AI agents, large language models, healthcare security, zero trust architecture, threat modeling, defense in depth, HIPAA compliance, prompt integrity

94. ❌ Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

作者: Xinning Chai, Zhengxue Cheng, Xin Li, Rong Xie, Li Song 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像压缩和超分辨率技术，使用扩散模型进行图像重建。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是图像处理中的具体技术问题，与这些关键词无直接关联。论文未涉及任何语言模型、模型对齐、推理技术、代理系统或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于任意尺度超分辨率的可变速率极端图像压缩框架ASSR-EIC，解决了现有方法需要为不同比特率训练独立模型的问题，实现了在单一模型中支持灵活比特率控制的自适应重建，并在极端图像压缩任务中取得了最先进的性能。

摘要翻译

近期基于扩散模型的极端图像压缩方法在超低码率下展现出卓越性能。然而，大多数方法需要针对每个目标码率训练独立的扩散模型，导致巨大的计算开销并阻碍实际部署。与此同时，近期研究表明联合超分辨率可作为增强低码率重建的有效途径。但在超低码率场景下，这些方法因严重信息丢失而表现不佳，且其对固定超分辨率尺度的依赖限制了跨不同码率的灵活适配。

为应对这些局限，我们提出ASSR-EIC——一种利用任意尺度超分辨率（ASSR）支持可变码率极端图像压缩（EIC）的新型图像压缩框架。编码端引入任意尺度下采样模块以实现可控的码率压缩，而基于扩散的联合退化感知ASSR解码器则支持在单一模型内实现码率自适应重建。我们利用压缩与重缩放感知的扩散先验来指导重建过程，从而在多样化压缩与重缩放设置下实现高保真、高真实感的复原。具体而言，我们设计了全局压缩-重缩放适配器以提供码率适应的整体指导，并构建局部压缩-重缩放调制器来动态平衡生成式与保真导向的行为，从而实现细粒度、码率自适应的细节复原。为进一步提升重建质量，我们引入了双重语义增强设计。

大量实验表明，ASSR-EIC在极端图像压缩任务中实现了最先进的性能，同时支持灵活的码率控制与自适应的码率相关重建。

摘要 (Abstract)

Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.

关键词: extreme image compression, arbitrary-scale super-resolution, diffusion models, variable-rate compression, degradation-aware reconstruction, rate adaptation, image restoration, low-bitrate reconstruction

95. ❌ CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning-Intensive Retrieval

作者: Guangzhi Wang, Yinghao Jiao, Zhi Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17387v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Thought 1 (T1)生成式检索模型，核心创新在于将相关性建模从静态对齐转向动态推理。论文与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），因为模型动态生成中间推理轨迹来桥接隐式推理关系，这是CoT的核心思想。与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），因为论文强调深度推理而非表面语义匹配。论文未涉及大模型技术原理创新或科学领域应用，也未提及其他关键词相关技术（如MoE、RLHF、RAG等），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文针对推理密集型检索中静态对比学习无法处理隐式推理关系的问题，提出T1生成式检索模型，通过动态生成推理轨迹和强化学习优化，在BRIGHT基准上超越了传统对比学习方法。

摘要翻译

推理密集型检索的核心挑战在于识别查询与文档之间的隐含推理关系，而非表面的语义或词汇相似性。对比学习范式本质上是一种静态表征固化技术：在训练阶段，它将层级化的相关性概念编码为向量空间中的固定几何结构，而在推理阶段无法根据每个查询的具体推理需求动态调整相关性判断。因此，当查询与文档之间存在词汇不匹配或需要隐含推理来建立相关性时，性能会显著下降。本文提出Thought 1（T1），一种生成式检索模型，将相关性建模从静态对齐转向动态推理。在查询侧，T1为每个查询动态生成中间推理轨迹以桥接隐含推理关系，并使用作为推理输出的语义聚合点。在文档侧，它采用指令+文本+的编码格式以支持高吞吐量的索引。为了将动态推理能力内化到向量表征中，我们采用三阶段训练课程，并在第三阶段引入GRPO，使模型能够通过试错式强化学习为不同查询学习最优推导策略。在BRIGHT基准测试中，T1-4B在原始查询设置下表现出强劲性能，整体上超越了采用对比学习训练的更大规模模型，并达到了与多阶段检索流程相当的效果。结果表明，用动态推理生成替代静态表征对齐能有效提升推理密集型检索的性能。

摘要 (Abstract)

The central challenge of reasoning-intensive retrieval lies in identifying implicitreasoning relationships between queries and documents, rather than superficial se-mantic or lexical similarity. The contrastive learning paradigm is fundamentallya static representation consolidation technique: during training, it encodes hier-archical relevance concepts into fixed geometric structures in the vector space,and at inference time it cannot dynamically adjust relevance judgments accord-ing to the specific reasoning demands of each query. Consequently, performancedegrades noticeably when vocabulary mismatch exists between queries and doc-uments or when implicit reasoning is required to establish relevance. This pa-per proposes Thought 1 (T1), a generative retrieval model that shifts relevancemodeling from static alignment to dynamic reasoning. On the query side, T1 dy-namically generates intermediate reasoning trajectories for each query to bridgeimplicit reasoning relationships and uses as a semantic aggregationpoint for the reasoning output. On the document side, it employs an instruction+ text + encoding format to support high-throughput indexing. Tointernalize dynamic reasoning capabilities into vector representations, we adopt athree-stage training curriculum and introduce GRPO in the third stage, enablingthe model to learn optimal derivation strategies for different queries through trial-and-error reinforcement learning. On the BRIGHT benchmark, T1-4B exhibitsstrong performance under the original query setting, outperforming larger modelstrained with contrastive learning overall, and achieving performance comparableto multi-stage retrieval pipelines. The results demonstrate that replacing static rep-resentation alignment with dynamic reasoning generation can effectively improvereasoning-intensive retrieval performance.

关键词: generative retrieval, dynamic reasoning, reasoning-intensive retrieval, contrastive learning, GRPO, BRIGHT benchmark, intermediate reasoning trajectories, semantic aggregation

96. ❌ SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction

作者: Shuizhou Chen, Lang Yu, Kedu Jin, Songming Zhang, Hao Wu, Wenxuan Huang, Sheng Xu, Quan Qian, Qin Chen, Lei Bai, Siqi Sun, Zhangyang Gao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17380v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文提出SCALE模型用于虚拟细胞扰动预测，属于AI for Science（生物信息学）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文明确提到构建’large-scale foundation model’，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。模型涉及预训练和领域适应，与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分）。论文强调可扩展基础设施和训练/推理效率改进（12.51倍预训练加速，1.29倍推理加速），与’Scaling Laws AND Data Quality’（5分）和’Speculative Decoding OR Inference Acceleration’（5分）有一定关联。其他关键词如MoE、SFT、RLHF、RAG、CoT等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SCALE的大规模基础模型，通过可扩展基础设施、稳定传输建模和生物保真评估的协同设计，解决了虚拟细胞扰动预测中的训练/推理效率低、高维稀疏表达空间建模不稳定以及评估协议生物保真度不足的问题，在Tahoe-100M基准上显著提升了PDCorr和DE Overlap指标。

摘要翻译

虚拟细胞模型旨在通过单细胞测量预测细胞如何响应遗传、化学或细胞因子扰动，从而实现计算机模拟实验。然而在实践中，大规模扰动预测仍受限于三个相互关联的瓶颈：低效的训练与推断流程、高维稀疏表达空间中的建模不稳定性，以及过度强调类重建精度却低估生物学保真度的评估方案。本研究提出专用于虚拟细胞扰动预测的大规模基础模型SCALE，以协同解决上述局限。首先，我们构建了基于BioNeMo的训练与推断框架，显著提升了数据吞吐量、分布式扩展性和部署效率，在相同系统设置下相比现有最优流程实现了预训练12.51倍加速和推断1.29倍加速。其次，我们将扰动预测形式化为条件传输问题，并通过结合基于LLaMA的细胞编码与面向端点的监督机制，构建了集合感知流架构来实现该框架。该设计实现了更稳定的训练效果和更强的扰动效应还原能力。第三，我们在Tahoe-100M数据集上采用以生物学意义指标为核心（而非单纯重建精度）的严格细胞级评估方案进行模型验证。在此基准测试中，我们的模型相较于现有最优方法将PDCorr提升了12.02%，DE重叠率提高了10.66%。这些结果表明，推进虚拟细胞研究不仅需要更优的生成目标，更需要可扩展基础设施、稳定传输建模与生物学可信评估体系的协同设计。

摘要 (Abstract)

Virtual cell models aim to enable in silico experimentation by predicting how cells respond to genetic, chemical, or cytokine perturbations from single-cell measurements. In practice, however, large-scale perturbation prediction remains constrained by three coupled bottlenecks: inefficient training and inference pipelines, unstable modeling in high-dimensional sparse expression space, and evaluation protocols that overemphasize reconstruction-like accuracy while underestimating biological fidelity. In this work we present a specialized large-scale foundation model SCALE for virtual cell perturbation prediction that addresses the above limitations jointly. First, we build a BioNeMo-based training and inference framework that substantially improves data throughput, distributed scalability, and deployment efficiency, yielding 12.51* speedup on pretrain and 1.29* on inference over the prior SOTA pipeline under matched system settings. Second, we formulate perturbation prediction as conditional transport and implement it with a set-aware flow architecture that couples LLaMA-based cellular encoding with endpoint-oriented supervision. This design yields more stable training and stronger recovery of perturbation effects. Third, we evaluate the model on Tahoe-100M using a rigorous cell-level protocol centered on biologically meaningful metrics rather than reconstruction alone. On this benchmark, our model improves PDCorr by 12.02% and DE Overlap by 10.66% over STATE. Together, these results suggest that advancing virtual cells requires not only better generative objectives, but also the co-design of scalable infrastructure, stable transport modeling, and biologically faithful evaluation.

关键词: virtual cell perturbation prediction, large-scale foundation model, BioNeMo-based framework, conditional transport, LLaMA-based cellular encoding, scalable infrastructure, biologically faithful evaluation, Tahoe-100M benchmark

97. ❌ Efficient Exploration at Scale

作者: Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF算法改进，与’RLHF’关键词高度相关（10分），使用Gemma LLM进行实验，与’Large Language Models’高度相关（10分）。算法显著提升数据效率，与’Scaling Laws AND Data Quality’有一定关联（5分），涉及数据需求减少。其他关键词如MoE、SLMs、PEFT、RAG等未在摘要中提及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在线学习算法，通过添加小幅度正向激励、建模奖励不确定性的认知神经网络和信息导向探索，将RLHF的数据效率提升了10倍以上，使用20K标签即可达到传统离线RLHF使用200K标签的性能。

摘要翻译

我们开发了一种在线学习算法，显著提升了基于人类反馈的强化学习（RLHF）的数据效率。该算法能够在接收选择数据时，逐步更新奖励模型和语言模型。奖励模型根据选择数据进行拟合，而语言模型则通过一种改进的强化学习算法（reinforce）进行更新，其强化信号由奖励模型提供。多项关键设计共同促成了效率提升：在每个强化信号中加入小幅正向激励、采用建模奖励不确定性的认知神经网络，以及基于信息导向的探索策略。使用Gemma大语言模型（LLMs）进行实验时，我们的算法仅需不到2万个标注数据即可达到离线RLHF使用20万个标注数据训练的性能，实现了超过10倍的数据效率提升。根据结果外推，我们预计使用100万个标注数据训练的算法可匹配离线RLHF使用10亿标注数据训练的效果，这相当于1000倍的效率提升。据我们所知，这是首次研究证明如此大幅度的效率改进是可能实现的。

摘要 (Abstract)

We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

关键词: Reinforcement Learning from Human Feedback, RLHF, Large Language Models, LLMs, Data Efficiency, Online Learning, Reward Model, Language Model

98. ❌ Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

作者: Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）在启用思维链（CoT）时安全性下降的问题，并提出一种安全对齐方法。核心相关关键词包括：大型语言模型（LRMs属于此类）、思维链（CoT是核心概念）、监督微调/后训练（提出的对齐方法属于此类）、指令调优/对齐（安全对齐是核心贡献）。其他相关关键词：系统2思维（与推理相关但非核心）、自我纠正（与安全决策相关）、幻觉缓解（与安全性相关）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文发现大型推理模型启用思维链推理时会显著降低安全性，并提出了一种在思维链生成前加强安全决策的对齐方法，有效提升了模型安全性同时保持了推理性能。

摘要翻译

大型推理模型（LRMs）通过思维链（CoT）技术取得了显著性能，但近期研究表明，此类增强的推理能力是以安全性能力显著下降为代价的。本文揭示，LRMs的安全性下降仅发生在启用CoT之后，而在禁用CoT时并未观察到这种退化。这一发现促使我们探索在生成CoT之前鼓励LRMs优先进行安全决策的途径。为此，我们提出一种新颖的安全对齐方法，旨在促进LRMs在开始生成CoT之前完成安全决策。具体而言，我们首先利用基于Bert的分类器从安全模型（例如禁用CoT的LRM）中提取安全决策信号，随后将这些信号作为辅助监督信息整合到LRMs的安全对齐过程中。通过这种方式，安全梯度能够反向传播至LRMs的潜在表征层，从而有效增强LRMs在生成CoT过程中的安全决策能力。大量实验证明，我们的方法在有效保持LRMs通用推理性能的同时，显著提升了其安全能力。

摘要 (Abstract)

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs’ safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs’ safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs’ latent representations, effectively strengthening the LRMs’ safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs’ general reasoning performance.

关键词: Large Reasoning Models, Chain-of-Thought, Safety Degradation, Safety Alignment, Safety Decision-Making, Supervised Fine-tuning, Reasoning Performance, Auxiliary Supervision

作者: Karan Goyal, Dikshant Kukreja, Vikram Goyal, Mukesh Mohania 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17361v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是学术文献引用推荐系统，提出了一种名为Profiler的轻量级非可学习模块来高效捕获人类引用模式，并引入了DAVINCI重排序模型。论文内容聚焦于信息检索、推荐系统和学术文献分析，不涉及大模型、深度学习技术原理、AI for Science等关键词相关的技术或应用。所有关键词均与大模型技术、深度学习原理或科学AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有引用推荐系统忽略人类引用行为模式且评估协议不反映真实场景的问题，提出了一个轻量级非可学习模块Profiler来高效捕获引用模式，并引入严格的归纳评估设置和DAVINCI重排序模型，在多个基准数据集上取得了新的最先进结果。

摘要翻译

对相关文献的规范引用对于界定科学贡献的语境与验证其有效性至关重要。现有的引文推荐系统虽能利用局部与全局文本信息，却常忽视人类引用行为的细微差异。近期一些纳入此类行为模式的方法虽提升了性能，却带来了高昂的计算成本，并为下游重排序器引入了系统性偏差。为解决这一问题，我们提出Profiler——一个轻量级、非可学习的模块，该模块能高效且无偏地捕捉人类引用模式，显著提升候选文献的检索效果。此外，我们发现当前评估体系存在关键局限：系统在直推式设定下进行评估，这无法反映真实场景。为此，我们引入严格的归纳式评估设定，通过强制实施严格的时间约束来模拟对新撰写论文的引文推荐场景。最后，我们提出DAVINCI——一种新颖的重排序模型，它通过自适应向量门控机制，将Profiler生成的置信度先验信息与语义信息相融合。我们的系统在多个基准数据集上取得了最新的最优性能，展现出卓越的效率与泛化能力。

摘要 (Abstract)

Proper citation of relevant literature is essential for contextualising and validating scientific contributions. While current citation recommendation systems leverage local and global textual information, they often overlook the nuances of the human citation behaviour. Recent methods that incorporate such patterns improve performance but incur high computational costs and introduce systematic biases into downstream rerankers. To address this, we propose Profiler, a lightweight, non-learnable module that captures human citation patterns efficiently and without bias, significantly enhancing candidate retrieval. Furthermore, we identify a critical limitation in current evaluation protocol: the systems are assessed in a transductive setting, which fails to reflect real-world scenarios. We introduce a rigorous Inductive evaluation setting that enforces strict temporal constraints, simulating the recommendation of citations for newly authored papers in the wild. Finally, we present DAVINCI, a novel reranking model that integrates profiler-derived confidence priors with semantic information via an adaptive vector-gating mechanism. Our system achieves new state-of-the-art results across multiple benchmark datasets, demonstrating superior efficiency and generalisability.

关键词: citation recommendation, human citation patterns, inductive evaluation, reranking model, Profiler, DAVINCI, temporal constraints, academic literature

100. ❌ WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

作者: Nathan Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机使用代理中的视觉PII检测，创建了WebPII基准数据集和WebRedact模型。虽然涉及AI应用（计算机视觉），但所有关键词均针对大模型/深度学习技术原理或特定科学领域应用（如生物信息学）。论文未涉及大模型架构、训练方法、推理优化、对齐、代理系统等任何关键词内容，也未涉及生物/化学信息学等科学AI应用。因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对计算机使用代理的隐私风险，创建了首个用于网页截图个人身份信息检测的细粒度合成基准数据集WebPII，并训练了WebRedact模型，在保持实时CPU延迟的同时将检测准确率提高了一倍以上。

摘要翻译

计算机使用代理带来了新的隐私风险：从真实网站收集的训练数据不可避免地包含敏感信息，而云端托管的推理过程会暴露用户屏幕截图。检测网页截图中的个人可识别信息对于隐私保护部署至关重要，但目前该任务缺乏公开基准。我们推出WebPII——一个包含44,865张标注电子商务界面图像的细粒度合成基准数据集，其设计具有三个关键特性：扩展的PII分类体系包含可实现重新识别的交易级标识符；针对用户正在输入数据的部分填写表单设计的前瞻性检测能力；以及通过基于视觉语言模型（VLM）的界面复现实现的规模化生成。实验验证表明，这些设计选择提升了跨多样界面的布局无关检测能力，并对未见的页面类型具有良好泛化性。我们训练了WebRedact模型以展示其实用价值，在实时CPU延迟（20毫秒）下，其文本提取准确率较基线提升超过一倍（0.753 vs 0.357 mAP@50）。我们公开数据集与模型，以支持隐私保护计算机使用研究。

摘要 (Abstract)

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

关键词: PII detection, computer use agents, privacy preservation, web screenshots, synthetic benchmark, UI images, real-time detection, visual information extraction

101. ❌ Learning Permutation Distributions via Reflected Diffusion on Ranks

作者: Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17353v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是在对称群S_n上学习排列概率分布的扩散模型方法，提出了Soft-Rank Diffusion框架，属于机器学习中的生成模型和组合优化领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都特指大型语言模型及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在排序和组合优化基准上进行了实验，这些可以视为AI在科学计算或优化问题中的应用，但并非论文的核心焦点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对在大型离散对称群上学习排列分布的挑战，提出了Soft-Rank Diffusion框架，通过将排列提升到连续的软秩表示来构建更平滑的扩散轨迹，并在排序和组合优化任务上超越了之前的扩散基线方法。

摘要翻译

有限对称群S_n为排列提供了自然定义域，但由于其阶乘级增长的规模以及离散、非欧几里得的结构，学习S_n上的概率分布具有挑战性。近期的排列扩散方法通过基于洗牌的随机游走（如鸽尾式洗牌）定义前向噪声过程，并利用普拉克特-卢斯模型的变体学习逆向转移，但由此产生的轨迹可能较为突变，且随着n增大去噪难度递增。我们提出软排名扩散，这是一种离散扩散框架，用结构化的软排名前向过程替代基于洗牌的破坏过程：通过将离散排名松弛为软排名，将排列提升至连续的序潜在表示，从而产生更平滑且更易处理的轨迹。对于逆向过程，我们引入了情境化广义普拉克特-卢斯去噪器，该模型推广了先前的普拉克特-卢斯式参数化方法，并提升了序列决策结构的表达能力。在排序与组合优化基准测试上的实验表明，软排名扩散始终优于先前的扩散基线方法，在长序列和本质序列化场景中尤其表现出显著优势。

摘要 (Abstract)

The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

关键词: permutation distributions, diffusion models, soft ranks, Plackett-Luce, combinatorial optimization, symmetric group, generative modeling, sequential decision structures

102. ❌ A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

作者: Weiming Wu, Zi-Jian Cheng, Jie Meng, Peng Zhen, Shan Huang, Qun Li, Guobin Wu, Lan-Zhe Guo 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RideJudge框架，核心是解决多模态LLM在网约车责任判定中的视觉-逻辑对齐问题，涉及LLM应用、对齐、RLHF、上下文优化、推理链、幻觉缓解和可解释性等关键技术。与LLM基础技术、对齐、RLHF、上下文窗口扩展、推理链、幻觉缓解和可解释性高度相关（10分），与预训练/领域适应有一定关联（5分），其他关键词如MoE、SLM、量化等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对网约车责任判定中多模态LLM存在的视觉-逻辑对齐问题，提出了RideJudge框架，通过轨迹合成、自适应上下文优化、裁决链和序数敏感强化学习等机制，在8B模型上实现了88.41%的准确率，超越了32B基线模型。

摘要翻译

责任纠纷的高效裁决对于维护市场公平至关重要。然而，网约车订单量的指数级增长使得人工审核难以为继，而传统的自动化方法又缺乏准司法决策所需的推理透明度。尽管多模态大语言模型提供了一个有前景的范式，但其本质上难以弥合通用视觉语义与严谨证据规则之间的鸿沟，常常导致感知幻觉与逻辑松散。为应对这些系统性偏差，我们提出了RideJudge——一个渐进式视觉-逻辑对齐框架。我们摒弃依赖通用预训练，转而通过SynTraj（一种将抽象责任概念锚定于具体轨迹模式的合成引擎）来弥合语义鸿沟。为解决海量规则文本与有限上下文窗口之间的矛盾，我们提出了一种提炼专家知识的自适应上下文优化策略，并辅以一个裁决链机制来强制执行主动证据质询。此外，针对稀疏二元反馈在复杂责任判定中的不足，我们实现了一种新颖的序数敏感强化学习机制，该机制能依据分级的严重程度校准决策边界。大量实验表明，我们的RideJudge-8B模型达到了88.41%的准确率，超越了32B规模的基线模型，为可解释的裁决系统树立了新标准。

摘要 (Abstract)

The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.

关键词: Multimodal LLMs, Visual-Logic Alignment, Ride-hailing Adjudication, Chain-of-Adjudication, Ordinal-Sensitive Reinforcement Learning, Hallucination Mitigation, Interpretable AI, Adaptive Context Optimization

103. ❌ ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

作者: Ang Li, Xinyang Gong, Bozhou Chen, Yunlong Lu, Jiaming Ji, Yongyi Wang, Yaodong Yang, Wenxin Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于开发一个基于数据的羽毛球模拟环境（ShuttleEnv）用于强化学习和策略分析，不涉及大模型、深度学习技术原理或科学AI应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、科学AI等相关，而本文是纯粹的强化学习环境构建和体育AI应用，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于精英球员比赛数据的交互式羽毛球模拟环境ShuttleEnv，用于支持强化学习和策略行为分析，并展示了训练智能体在该环境中的表现和可视化功能。

摘要翻译

我们推出ShuttleEnv，一个用于羽毛球的交互式数据驱动仿真环境，旨在支持快节奏对抗性体育中的强化学习与策略行为分析。该环境基于精英运动员比赛数据，采用显式概率模型模拟回合级动态，从而在不依赖基于物理仿真的情况下实现逼真且可解释的智能体-对手交互。在本演示中，我们展示了ShuttleEnv中多个训练完成的智能体，并提供羽毛球回合的实时逐步可视化，使参与者能够探索不同比赛风格、观察涌现策略，并交互式分析决策行为。ShuttleEnv可作为体育人工智能领域中智能体研究、可视化与演示的可复用平台。我们的ShuttleEnv演示视频地址：https://drive.google.com/file/d/1hTR4P16U27H2O0-w316bR73pxE2ucczX/view

摘要 (Abstract)

We present ShuttleEnv, an interactive and data-driven simulation environment for badminton, designed to support reinforcement learning and strategic behavior analysis in fast-paced adversarial sports. The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics, enabling realistic and interpretable agent-opponent interactions without relying on physics-based simulation. In this demonstration, we showcase multiple trained agents within ShuttleEnv and provide live, step-by-step visualization of badminton rallies, allowing attendees to explore different play styles, observe emergent strategies, and interactively analyze decision-making behaviors. ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI. Our ShuttleEnv demo video URL: https://drive.google.com/file/d/1hTR4P16U27H2O0-w316bR73pxE2ucczX/view

关键词: ShuttleEnv, badminton, reinforcement learning, strategic behavior analysis, data-driven simulation, interactive environment, sports AI, agent-opponent interactions

104. ❌ Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing

作者: Aniruddha Bora, Julie Chalfant, Chryssostomos Chryssostomidis 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于物理信息的离线强化学习（PIER）在海上航线优化中的应用，旨在减少燃料消耗和碳排放。论文的核心是强化学习框架，而不是大语言模型（LLM）或深度学习技术原理的创新。所有关键词（除了最后一个）都直接与LLM、深度学习技术或特定AI方法（如MoE、RLHF、RAG等）相关，而本文未涉及这些内容。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”评分为5分，因为论文将AI（强化学习）应用于科学问题（海事路由优化以减少温室气体排放），这属于“AI for Science”的范畴，但并非核心匹配（如生物信息学或化学信息学）。因此，其他关键词均评分为0分，加权总分较低，表明论文与评审关注的大模型和深度学习技术主题相关性弱。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于物理信息的离线强化学习框架（PIER），用于优化海上航线，以减少燃料消耗和碳排放，实验证明其能显著降低极端燃料浪费并保持稳定性能。

摘要翻译

国际航运约占全球温室气体排放量的3%，但船舶航路规划仍主要依赖启发式方法。本文提出PIER（物理信息、节能、风险感知路由）框架——一种基于离线强化学习的航路规划系统，该框架从基于历史船舶轨迹数据和海洋再分析产品构建的物理校准环境中，学习具有燃油效率与安全意识的航路策略，且无需在线模拟器。通过在墨西哥湾七条航线上对全年（2023年）AIS数据（每种方法840个航次）进行验证，PIER相较于大圆航线平均减少10%的CO2排放。然而，PIER的主要贡献在于消除了灾难性燃油浪费：大圆航线有4.8%的航次出现极端燃油消耗（>1.5倍中位数）；PIER将此比例降至0.5%，实现了9倍的降低。单航次燃油消耗方差降低至3.5倍（p<0.001），平均节油量的自助法95%置信区间为[2.9%，15.7%]。基于实际AIS船舶行为的局部验证表明，PIER与最快实际航行的表现一致，同时方差降低23.1倍。关键的是，PIER不依赖预报：与A*路径优化方法在现实预报不确定性下波浪防护性能下降4.5倍不同，PIER仅依靠局部观测即可保持稳定性能。该框架融合了物理信息状态构建、演示增强的离线数据以及解耦的事后安全防护层，其架构可迁移至野火疏散、航空轨迹优化及未知地形自主导航等领域。

摘要 (Abstract)

International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER’s primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (>1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p<0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.

关键词: offline reinforcement learning, physics-informed, maritime routing, fuel efficiency, greenhouse gas emissions, AIS data, safety-aware, environmental sustainability

105. ❌ InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

作者: Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17310v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的推理效率问题，直接涉及LLMs、CoT推理和RLHF（使用强化学习训练），与System 2 Thinking相关（深入推理质量）。其他关键词如MoE、SFT、RAG等未在摘要中提及或无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型推理过程中产生的冗余和冗长轨迹问题，提出了一种基于信息密度的奖励框架InfoDensity，通过强化学习训练在数学推理基准上实现了与最先进方法相当的准确性，同时显著减少了token使用，达到了良好的准确性与效率平衡。

摘要翻译

具备扩展推理能力的大语言模型（LLMs）常生成冗长且冗余的推理轨迹，导致不必要的计算成本。现有的强化学习方法虽通过优化最终响应长度来解决此问题，却忽视了中间推理步骤的质量，使模型易受奖励破解的影响。我们认为，冗长性不仅是长度问题，更是中间推理质量低下的表现。为探究此问题，我们开展了一项实证研究，追踪推理步骤中答案分布的条件熵变化。研究发现，高质量的推理轨迹展现出两个一致特性：低不确定性收敛与单调进展。这些发现表明，高质量的推理轨迹具有信息密集性，即每个步骤相对于总推理长度都能带来有意义的熵减。受此启发，我们提出InfoDensity——一个用于强化学习训练的奖励框架，该框架结合了基于AUC的奖励和单调性奖励，作为推理质量的统一度量，并通过长度缩放项加权，以鼓励以更简洁的方式达到同等推理质量。在数学推理基准测试上的实验表明，InfoDensity在准确率上达到或超越了现有先进基线，同时显著减少了令牌使用量，实现了优异的准确率与效率权衡。

摘要 (Abstract)

Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.

关键词: Large Language Models, reasoning traces, reinforcement learning, information density, computational efficiency, mathematical reasoning, token reduction, accuracy-efficiency trade-off

106. ❌ Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

作者: Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17312v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出R²VLM模型，核心创新在于将Chain of Thought（CoT）推理机制融入视觉语言模型，用于具身智能体的长时程任务进度估计。因此与CoT Reasoning（10分）和System 2 Thinking（8分）高度相关。模型属于Vision-Language Models，可视为大模型在具身智能领域的应用，故与LLMs（5分）相关。模型在ALFRED和Ego4D数据集上训练，涉及预训练和微调，故与Pre-training（5分）和SFT（5分）有一定关联。论文未涉及其他关键词的技术细节或应用场景，故其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为R²VLM的循环推理视觉语言模型，通过引入演化的Chain of Thought机制来显式记录任务分解和完成状态，有效解决了长视频轨迹处理的计算瓶颈问题，在长时程具身任务进度估计上实现了新的最先进性能。

摘要翻译

准确估计任务进度对于具身智能体规划和执行长周期、多步骤任务至关重要。尽管现有基于视觉语言模型（VLM）的方法取得了显著进展，但它们主要利用了模型的视频理解能力，而忽视了其复杂的推理潜力。此外，在实际部署中，使用VLM处理长视频轨迹在计算上代价高昂。为解决这些挑战，我们提出了循环推理视觉语言模型（$\text{R}^2$VLM）。该模型采用循环推理框架，迭代处理局部视频片段，并通过动态演进的思维链（Chain of Thought, CoT）维持全局上下文。该思维链明确记录任务分解、关键步骤及其完成状态，使模型能够推理复杂的时间依赖关系。这一设计避免了处理长视频的高昂成本，同时保留了必要的推理能力。我们使用来自ALFRED和Ego4D的大规模自动生成数据集对$\text{R}^2$VLM进行训练。在进度估计及下游应用（包括进度增强的策略学习、强化学习的奖励建模以及主动辅助）上的大量实验表明，$\text{R}^2$VLM实现了优异的性能和泛化能力，在长周期任务进度估计中达到了新的最优水平。模型与基准测试已公开于\href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}。

摘要 (Abstract)

Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}.

关键词: Vision-Language Models, Recurrent Reasoning, Chain of Thought, Long-horizon Embodied Tasks, Task Progress Estimation, Video Understanding, Temporal Dependencies, Autonomous Agents

107. ❌ ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

作者: Panuganti Chirag Sai, Gandholi Sarat, R. Raghunatha Sarma, Venkata Kalyan Tavva, Naveen M 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究内存控制器优化，使用强化学习和多智能体系统，与大多数大模型/深度学习关键词无关。仅与’Multi-agent Systems’（论文使用多智能体框架）和’Explainable AI’（论文强调可解释性）有中等关联（5分），其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于强化学习的可解释多智能体内存控制器框架ReLMXEL，通过动态优化内存参数来降低延迟和能耗，实验证明其在不同工作负载下优于基线配置。

摘要翻译

降低延迟与能耗对提升现代计算中内存系统的效率至关重要。本研究提出ReLMXEL（面向内存控制器的可解释能耗与延迟优化强化学习框架），这是一种可解释的多智能体在线强化学习框架，通过奖励分解动态优化内存控制器参数。ReLMXEL在内存控制器内部运行，利用细粒度的内存行为指标指导决策。在不同工作负载下的实验评估表明，相较于基准配置，该框架能持续获得性能提升，其优化机制由工作负载特定的内存访问行为驱动。通过将可解释性融入学习过程，ReLMXEL不仅提升了系统性能，还增强了控制决策的透明度，为构建更具可问责性与自适应能力的内存系统设计开辟了新路径。

摘要 (Abstract)

Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.

关键词: memory controller, reinforcement learning, multi-agent system, explainable AI, energy optimization, latency reduction, workload adaptation, reward decomposition

108. ❌ Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

作者: Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Symphony多智能体系统用于长视频理解，核心创新在于多智能体协作和认知启发式推理机制。高度相关关键词包括：LLM Agents（10分，核心内容）、Multi-agent Systems（10分，核心内容）、Chain of Thought（8分，涉及深度推理）、System 2 Thinking（8分，模拟人类深度认知）、Self-Reflection（8分，包含反思机制）、Large Language Models（8分，基于MLLM）。中等相关关键词：Retrieval-Augmented Generation（5分，提及基于嵌入的检索）、Long Context LLMs（5分，处理长视频上下文）。其余关键词与论文技术细节（如MoE、量化、RLHF等）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型智能体在长视频理解任务中存在的推理能力不足问题，提出了一个受人类认知启发的多智能体系统Symphony，通过细粒度任务分解和深度推理协作机制，在多个基准测试上实现了最先进的性能。

摘要翻译

尽管多模态大语言模型智能体发展迅速且应用广泛，其在长视频理解任务上仍面临挑战，这类任务具有信息密度高、时间跨度大的特点。近期关于长视频理解智能体的研究表明，简单的任务分解与协作机制难以应对长链推理任务。此外，直接通过基于嵌入的检索压缩时间上下文可能导致复杂问题的关键信息丢失。本文提出名为Symphony的多智能体系统以缓解这些局限。通过模拟人类认知模式，Symphony将长视频理解分解为细粒度子任务，并引入基于反思增强的深度推理协作机制，有效提升了推理能力。同时，Symphony提供基于视觉语言模型的定位方法，用于分析长视频理解任务并评估视频片段的相关性，显著增强了针对隐含意图和大时间跨度复杂问题的定位能力。实验结果表明，Symphony在LVBench、LongVideoBench、VideoMME和MLVU基准测试中均达到最先进性能，其中在LVBench上较之前最优方法提升5.0%。代码发布于https://github.com/Haiyang0226/Symphony。

摘要 (Abstract)

Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

关键词: multi-agent system, long-video understanding, cognitive-inspired, deep reasoning collaboration, reflection mechanism, MLLM agents, task decomposition, state-of-the-art performance

109. ❌ Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

作者: Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CRAFT框架，专注于大语言模型的安全对齐，通过强化学习优化隐藏表示空间中的目标，提升模型对越狱攻击的鲁棒性。核心相关关键词包括：大语言模型（论文使用Qwen3-4B-Thinking和R1-Distill-Llama-8B）、对齐/价值对齐（核心研究问题）、RLHF/DPO（使用GRPO方法）、思维链/多步推理（利用模型推理能力生成安全感知的推理轨迹）、系统2思维/深度推理（强调推理级安全对齐）。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出CRAFT框架，通过结合对比表示学习和强化学习来对齐大语言模型的隐藏表示空间，显著提升了模型在推理安全和最终响应安全方面的鲁棒性，在多个安全基准测试中优于现有防御方法。

摘要翻译

我们提出CRAFT（利用模型推理能力与隐层表征的红队对齐框架），一种通过利用模型推理能力与隐层表征来提升针对越狱攻击鲁棒性的红队对齐框架。与主要在输出层面运作的现有防御方法不同，CRAFT通过显式优化基于隐状态空间定义的目标函数，使大推理模型生成具备安全意识的推理轨迹。在方法论上，CRAFT将对比表征学习与强化学习相结合，以分离安全与不安全的推理路径，从而构建支持鲁棒的推理级安全对齐的隐空间几何结构。在理论上，我们证明将隐空间-文本一致性融入GRPO（梯度奖励策略优化）能通过排除表面对齐策略作为局部最优解，消除这类策略。在实证方面，我们使用两个强推理模型（Qwen3-4B-Thinking和R1-Distill-Llama-8B）在多个安全基准上评估CRAFT，其表现始终优于IPO（迭代策略优化）和SafeKey等前沿防御方法。值得注意的是，相较于基础模型，CRAFT在推理安全性上平均提升79.0%，在最终响应安全性上平均提升87.7%，这证明了隐空间推理对齐方法的有效性。

摘要 (Abstract)

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

关键词: CRAFT, alignment, reinforcement learning, hidden representations, reasoning safety, jailbreak attacks, contrastive learning, robustness

110. ❌ From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

作者: Bangju Han, Yingqi Wang, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Abibulla Atawulla, Yating Yang, Xi Zhou 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器翻译中的跨文化理解评估，通过构建CulT-Eval基准测试来评估大语言模型处理文化表达的能力。论文明确提到对大型语言模型进行了广泛评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。其他关键词主要涉及大模型的技术原理、训练方法、优化技术、应用场景等，而本文专注于评估基准构建和错误分析，不涉及这些具体技术细节或应用领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对机器翻译系统在处理文化表达（如成语、俚语等）时面临的挑战，构建了CulT-Eval基准测试，通过评估发现当前大语言模型在保留文化含义和捕捉文化细微差别方面存在系统性失败模式，并提出了针对文化诱导意义偏差的补充评估指标。

摘要翻译

文化表达，如习语、俚语和文化专有项（Culture-Specific Items, CSIs），在自然语言中普遍存在，其含义往往超越字面语言形式。准确翻译此类表达对机器翻译系统而言仍具挑战性。尽管如此，现有评估基准仍较为零散，未能为文化负载表达的翻译性能提供系统化的评估框架。为填补这一空白，我们推出了CulT-Eval基准，旨在评估模型处理各类文化根基表达的能力。CulT-Eval包含超过7,959个精心筛选的实例，涵盖多种类型的文化根基表达，并提供了一个全面的错误分类体系，专门针对文化根基表达。通过对大语言模型的广泛评估和详细分析，我们识别出现有自动指标未能充分捕捉的重复性和系统性失败模式。据此，我们提出了一种补充性评估指标，专门针对标准机器翻译指标所忽略的文化因素引发的意义偏差。结果表明，当前模型难以保持文化根基意义，也无法捕捉准确翻译所必需的文化和语境细微差别。我们的基准和代码已公开于https://anonymous.4open.science/r/CulT-Eval-E75D/。

摘要 (Abstract)

Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.

关键词: machine translation, cross-cultural understanding, cultural expressions, benchmark evaluation, large language models, error analysis, culturally grounded meaning, translation metrics

111. ❌ GUIDE: GenAI Units In Digital Design Education

作者: Weihua Xiao, Jason Blocklove, Matthew DeLorenzo, Johann Knechtel, Ozgur Sinanoglu, Kanad Basu, Jeyavijayan Rajendran, Siddharth Garg, Ramesh Karri 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17296v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要介绍GUIDE教育平台，将大语言模型应用于数字设计教育领域，特别是芯片设计和硬件安全。核心相关关键词：1) ‘Large Language Models’ (10分)：论文多次提到LLM在RTL生成、测试平台生成等应用；2) ‘Chain of Thought’ (8分)：VeriThoughts单元涉及推理和形式验证支持的RTL生成，与多步推理相关；3) ‘AI for Science’ (10分)：属于大模型在科学/工程领域的应用，特别是芯片设计和硬件安全。其他关键词如MoE、量化、对齐等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为GUIDE的开放课程资源库，用于将生成式AI（特别是大语言模型）应用于数字设计教育，通过标准化教学单元和实际案例（如硬件安全项目）展示了LLM在芯片设计和硬件安全领域的教育应用成果。

摘要翻译

数字设计教育中的生成式人工智能单元（GUIDE）是一个开放课件资源库，包含可运行的Google Colab实验及其他材料。本文描述了该资源库的架构与教育方法，其基础是标准化的教学单元，每个单元由幻灯片、短视频、可运行实验及相关论文组成。这种组织形式既保证了学生学习体验的一致性，也便于教师进行内容复用与作业评估。我们通过三个代表性单元展示了GUIDE的实际应用：用于推理与形式验证支持RTL生成的VeriThoughts、增强型LLM辅助测试平台生成，以及面向IP盗版的LLMPirate。我们还详细介绍了四个课程实例（GUIDE4ChipDesign、Build your ASIC、GUIDE4HardwareSecurity和Hardware Design），这些实例将GUIDE单元整合为完整的学期课程、学习成果和顶点项目，所有内容均基于经过验证的材料。例如，GUIDE4HardwareSecurity课程包含一个关于LLM辅助硬件木马植入的项目，该项目已成功应用于课堂教学及网络安全学生竞赛与学术会议（Cybersecurity Games and Conference，CSAW）。我们还组织了纽约大学Cognichip黑客松，吸引了来自24支国际团队的学生参与人工智能辅助RTL设计工作流程。GUIDE资源库开放贡献，访问地址为：https://github.com/FCHXWH823/LLM4ChipDesign。

摘要 (Abstract)

GenAI Units In Digital Design Education (GUIDE) is an open courseware repository with runnable Google Colab labs and other materials. We describe the repository’s architecture and educational approach based on standardized teaching units comprising slides, short videos, runnable labs, and related papers. This organization enables consistency for both the students’ learning experience and the reuse and grading by instructors. We demonstrate GUIDE in practice with three representative units: VeriThoughts for reasoning and formal-verification-backed RTL generation, enhanced LLM-aided testbench generation, and LLMPirate for IP Piracy. We also provide details for four example course instances (GUIDE4ChipDesign, Build your ASIC, GUIDE4HardwareSecurity, and Hardware Design) that assemble GUIDE units into full semester offerings, learning outcomes, and capstone projects, all based on proven materials. For example, the GUIDE4HardwareSecurity course includes a project on LLM-aided hardware Trojan insertion that has been successfully deployed in the classroom and in Cybersecurity Games and Conference (CSAW), a student competition and academic conference for cybersecurity. We also organized an NYU Cognichip Hackathon, engaging students across 24 international teams in AI-assisted RTL design workflows. The GUIDE repository is open for contributions and available at: https://github.com/FCHXWH823/LLM4ChipDesign.

关键词: GenAI, Digital Design Education, LLM, RTL Generation, Hardware Security, Courseware, Chip Design, AI-assisted Workflow

112. ❌ ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

作者: Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器翻译中的性别偏见问题，提出了ConGA框架进行性别标注。论文明确提到LLMs在处理跨语言性别问题时的挑战，因此与’Large Language Models’关键词高度相关（8分）。论文未涉及其他关键词的技术原理、方法或应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对机器翻译和大语言模型中性别偏见问题，提出了ConGA性别标注框架，通过创建gENder-IT数据集评估发现当前系统存在系统性男性化倾向和不一致的女性表达问题。

摘要翻译

处理跨语言性别信息仍然是机器翻译（MT）与大规模语言模型（LLMs）面临的一项持续挑战，尤其是在从性别中立语言（如英语）翻译至形态上具有性别特征的语言（如意大利语）时。英语在很大程度上省略了语法性别，而意大利语则要求多个语法范畴保持明确的性数一致。这种不对称性常导致机器翻译系统默认使用阳性形式，从而加剧偏见并降低翻译准确性。为解决此问题，我们提出了语境化性别标注（Contextual Gender Annotation, ConGA）框架，这是一套基于语言学原理的词级性别标注准则。该方案通过三个标签区分英语中的语义性别——阳性（M）、阴性（F）与模糊性（A），并标注意大利语中的语法性别实现形式（阳性（M）、阴性（F）），同时结合实体级标识符以实现跨句子追踪。我们将ConGA应用于gENder-IT数据集，构建了一个用于评估翻译中性别偏见的黄金标准资源。研究结果揭示了系统性的阳性形式过度使用及阴性形式实现不一致的问题，凸显了当前机器翻译系统存在的持续局限性。通过将细粒度语言标注与量化评估相结合，本研究为构建更具性别意识的多语言自然语言处理系统提供了方法论基础与评估基准。

摘要 (Abstract)

Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.

关键词: Gender Annotation, Machine Translation, Large Language Models, Gender Bias, Multilingual NLP, Contextual Gender, Translation Accuracy, Linguistic Annotation

113. ❌ Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

作者: Chiara Manna, Hosein Mohebbi, Afra Alishahi, Frédéric Blain, Eva Vanmassenhove 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大语言模型在机器翻译中的性别偏见问题，核心涉及LLMs和post-training技术。摘要明确提到’Large Language Models’和’post-training (e.g., instruction tuning)’，因此这两个关键词高度相关（10分）。‘Instruction Tuning’作为post-training的一种形式被提及，有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在仅解码器架构的机器翻译模型中，大语言模型存在的性别偏见问题，并提出了一种新的'先验偏见'度量方法，发现后训练（如指令微调）能提高上下文意识并减少男性先验偏见。

摘要翻译

尽管大语言模型在众多自然语言处理任务中取得了最先进的成果，其仍易受系统性偏见的影响。其中，由于不同语言在性别标记与否及标记方式上存在系统性差异，性别偏见在机器翻译中尤为突出。因此，翻译常需将隐性的源语言信号消歧为显性的性别标记形式。在此背景下，标准基准测试虽能捕捉广泛的差异，却未能全面反映现代机器翻译中性别偏见的全部复杂性。本文通过以下方式扩展了近期偏见评估框架：（i）引入一种称为“先验偏见”的新度量，用以捕捉模型的默认性别假设；（ii）将该框架应用于仅解码器架构的机器翻译模型。我们的研究结果表明，尽管仅解码器模型规模庞大且处于技术前沿，其在性别相关指标上通常并未超越编码器-解码器架构；然而，后训练（例如指令微调）不仅能提升上下文感知能力，还能降低男性化的先验偏见。

摘要 (Abstract)

While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined “Prior Bias”, capturing a model’s default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.

关键词: Gender Disambiguation, Machine Translation, Large Language Models, Decoder-Only Architectures, Gender Bias, Prior Bias, Post-training, Instruction Tuning

114. ❌ ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

作者: Xuyang Cao, Qianying Liu, Chuan Xiao, Yusuke Oda, Pontus Stenetorp, Daisuke Kawahara, Makoto Onizuka, Sadao Kurohashi, Shuyuan Zheng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17945v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多语言预训练中的缩放定律（Scaling Laws），直接高度相关于’Scaling Laws AND Data Quality’（10分）和’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为聚焦于预训练阶段的语言混合比例优化。论文涉及大模型（LLMs）在预训练中的应用，但与’Large Language Models OR LLMs OR Foundation Models’关联为一般相关（8分），因未深入LLM具体架构。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多语言预训练中语言混合比例优化问题，提出了一种基于合作博弈论的ShapleyLaw方法，以量化跨语言迁移效应，实验表明其在模型性能预测和语言混合优化上优于基线方法。

摘要翻译

在多语言预训练中，预训练模型的测试损失受预训练数据中各语言比例（即语言混合比例）的显著影响。多语言缩放定律能够预测不同语言混合比例下的测试损失，因此可用于估计最优比例。然而，当前的多语言缩放定律方法未能衡量跨语言迁移效应，导致所得混合比例并非最优。本文将多语言预训练视为一种合作博弈，其中每种语言作为参与者共同贡献于预训练过程，并以测试损失的降低作为收益。基于合作博弈论的视角，我们通过每种语言在博弈中的贡献来量化其跨语言迁移效应，并提出一种称为ShapleyLaw的博弈论多语言缩放定律。实验表明，ShapleyLaw在模型性能预测和语言混合优化方面均优于基线方法。

摘要 (Abstract)

In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.

关键词: multilingual pretraining, scaling laws, language mixture ratios, cross-lingual transfer, cooperative game theory, Shapley value, model performance prediction, optimization

115. ❌ Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

作者: Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的推理加速技术，提出了一种无需训练的多令牌预测方法，通过嵌入空间探测实现并行预测。与’Large Language Models’高度相关（10分），因为论文以LLMs为研究对象；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为该方法属于推理加速技术，通过并行验证候选预测减少模型调用次数。其他关键词如MoE、SFT、RAG、量化等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的多令牌预测方法，通过嵌入空间探测实现并行预测，在保持无损生成的同时显著减少模型调用次数，在LLaMA3和Qwen3上分别提升接受长度约12%和8-12%，实现高达15-19%的吞吐量增益。

摘要翻译

尽管大语言模型（LLMs）仅针对下一词元生成进行训练，但它们展现出潜在的多词元预测（MTP）能力。我们提出了一种简单、无需训练的MTP方法，该方法通过从模型嵌入空间中动态提取掩码词元来探测LLM，从而在不修改模型权重或依赖辅助草稿模型的情况下实现未来词元的并行预测。我们的方法通过从掩码词元逻辑中采样Top-K候选来构建推测性词元树，并应用轻量级剪枝策略以保留高概率的延续序列。在解码过程中，候选预测被并行验证，从而实现无损生成，同时显著减少模型调用次数并提升词元吞吐量。在多项基准测试中，我们基于探测的MTP方法持续优于现有的无需训练基线，在LLaMA3上接受长度提升约12%，在Qwen3上提升8-12%，并实现高达15-19%的吞吐量增益。最后，我们提供了理论分析和实证证据，表明解码器层自然地将掩码词元表征与下一词元状态对齐，从而无需重新训练或辅助模型即可实现准确的多步预测。

摘要 (Abstract)

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12% on LLaMA3 and 8–12% on Qwen3, and achieving throughput gains of up to 15–19%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

关键词: Large Language Models, Multi-token Prediction, Training-free Method, Embedding-space Probing, Speculative Decoding, Inference Acceleration, Parallel Prediction, Token Throughput

116. ❌ Only relative ranks matter in weight-clustered large language models

作者: Borja Aizpurua, Sukhbinder Singh, Román Orús 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的权重聚类压缩方法，与’Large Language Models’高度相关（10分）。研究涉及模型压缩技术，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分）。论文使用预训练模型（Llama 3.1-8B-Instruct和SmolLM2-135M），与’Pre-training’有一定关联（5分）。方法包含可选的微调聚类中心，与’Post-training’和’PEFT’有一定关联（各5分）。研究探讨权重相对排序的重要性，与’Mechanistic Interpretability’有一定关联（5分）。方法涉及权重聚类（类似权重平均），与’Model Merging’有一定关联（5分）。论文提到SmolLM2-135M，与’Small Language Models’有一定关联（5分）。其他关键词如MoE、Scaling Laws、Instruction Tuning、RAG等与论文内容无关或未提及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在大语言模型中，权重的相对排序比精确数值更重要，通过权重聚类将每个权重矩阵压缩为16-64个共享值，可以在不重新训练的情况下保持强准确性，并提供了一种简单的、无需训练的LLM磁盘压缩方法。

摘要翻译

大型语言模型（LLMs）包含数十亿参数，但许多精确数值并非必需。研究表明，最关键的是权重的相对排序——即一个连接是否强于另一个——而非具体数值大小。为减少唯一权重值的数量，我们对预训练模型应用权重聚类，通过K均值算法将每个权重矩阵替换为K个共享值。对于Llama 3.1-8B-Instruct和SmolLM2-135M模型，将每个矩阵压缩至仅16-64个不同值后，无需重新训练即可保持较强准确性，这为磁盘上的LLMs压缩提供了一种简单、免训练的方法。若仅对聚类均值（质心）进行微调，能以极低成本恢复剩余精度差距的30-40%。随后我们系统性地随机化聚类均值，同时保持分配关系固定。打乱聚类间的相对排序会急剧降低模型质量——困惑度可能增加数个数量级——即使全局统计量（如均值与方差）保持不变。相反，保持排序不变的随机化操作在中层与深层几乎不会造成损失。另一方面，当多层同时受到扰动时，逐层渐进替换实验表明：尺度漂移（而非排序失真）是模型性能崩溃的主导机制；然而，采用系数a>0的仿射校正w’ = aw + b（同时保持排序关系与整体权重分布）可显著延缓这种漂移。这种基于排序的视角为模型压缩与鲁棒性研究提供了新的理论框架。

摘要 (Abstract)

Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w’ = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.

关键词: Large Language Models, Weight Clustering, Model Compression, Relative Rank, K-means, Llama 3.1, SmolLM2, Perplexity

117. ❌ Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

作者: Yue Zhao, Jiatao Gu, Paloma Jeretič, Weijie Su 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是利用预训练多语言Transformer模型的注意力机制（特别是Attention Transport Distance）来量化语言距离，属于大模型在语言学领域的创新应用。高度相关关键词：Pre-training（10分，论文基于预训练模型）、Large Language Models（8分，使用多语言Transformer）、KV Cache Compression（8分，涉及注意力机制分析）、Mechanistic Interpretability（8分，解释模型内部机制）。AI for Science得5分，因为属于AI在语言学（科学分支）的应用。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练多语言Transformer模型注意力机制的新方法（Attention Transport Distance），用于量化人类语言之间的距离，并验证了该方法能准确反映语言分类、地理关系，且能提升低资源机器翻译性能。

摘要翻译

理解人类语言之间的距离是语言学、人类学以及追溯人类演化历史的核心课题。然而，尽管语言学长期以来对跨语言差异提供了丰富的定性描述，一种统一且可扩展的量化语言距离测量方法仍然缺失。本文提出一种方法，利用预训练的多语言模型作为系统性的语言测量工具。具体而言，我们证明这些模型自发形成的注意力机制能够提供一种稳健的、与分词无关的跨语言距离度量，我们将其称为注意力传输距离（Attention Transport Distance, ATD）。通过将注意力矩阵视为概率分布，并借助最优传输测量其几何散度，我们量化了翻译过程中语言之间的表征距离。将ATD应用于一个大规模且多样化的语言集合，我们证明所得距离能够高保真地还原已知的语言谱系分组，并揭示出与地理分布及接触引发的语言关系相吻合的模式。此外，将ATD作为正则化项引入，能够提升低资源机器翻译的迁移性能。我们的研究结果为使用人工神经网络检验语言学假说建立了一个原则性基础。该框架将多语言模型转化为定量语言发现的强大工具，有助于推动更公平的多语言人工智能发展。

摘要 (Abstract)

Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.

关键词: pretrained multilingual transformers, attention mechanisms, language distance, Attention Transport Distance, quantitative linguistics, machine translation, low-resource transfer, optimal transport

118. ❌ DebugLM: Learning Traceable Training Data Provenance for LLMs

作者: Wenjie Jacky Mo, Qin Liu, Xiaofei Wen, Wenxuan Zhou, Zhe Zhao, Muhao Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17884v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs训练数据溯源和调试框架，与’Large Language Models’高度相关（10分）。涉及多阶段训练流程，与’Pre-training’和’Post-training’有一定关联（各5分）。研究模型行为可解释性，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、量化、推理加速、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出DebugLM框架，通过为LLMs添加数据溯源能力来解决训练数据来源难以追踪的问题，使模型能够将行为追溯到特定训练数据源，并支持无需重新训练的目标测试时修复。

摘要翻译

大型语言模型（LLMs）通过基于异构数据源的多阶段流程进行训练，然而开发者缺乏一种系统性的方法来精确定位导致特定行为的具体数据。这种可观测性的缺失使得调试工作只能依赖被动修补，并导致模型在数据分布偏移或后续更新时容易重复出现故障。为解决这一局限，我们提出DebugLM框架，该框架为LLMs内置数据溯源能力，使其能够将自身行为明确追溯至特定的训练数据源。具体而言，模型学习将其响应与唯一的溯源标签相关联，这些标签指示了责任数据集，从而使开发者能够精确识别不良行为的学习来源。基于此能力，DebugLM进一步支持针对性的测试时修复，允许开发者在无需重新训练或修改模型参数的情况下，针对特定数据源选择性地触发定向拒绝响应。实验表明，DebugLM能够在多阶段训练流程中提供准确的行为溯源，并实现有效的测试时修复，同时保持模型的通用性能。

摘要 (Abstract)

Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.

关键词: Large Language Models, Data Provenance, Training Data Tracing, Debugging Framework, Test-time Remediation, Behavior Attribution, Multi-stage Training, Model Observability

119. ❌ Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark

作者: Yao Wang, Xin Liu, Zhuochen Liu, Jiankang Chen, Adam Jatowt, Kyoungsook Kim, Noriko Kando, Haitao Yu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究新闻领域的人类价值理解，构建了NEVU基准数据集，并使用LLM进行评估。与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为论文核心关注人类价值识别和方向判断。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为论文使用LLM进行评估和辅助标注。与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关（8分），因为论文提到使用LoRA进行轻量级适应。其他关键词与论文的技术焦点（新闻价值理解基准）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了NEVU基准数据集，用于评估大模型在新闻文本中识别和理解人类价值的能力，并通过实验发现轻量级适应方法（如LoRA）能有效提升开源模型在该任务上的表现。

摘要翻译

现有的人类价值观数据集无法直接支持对事实性新闻的价值观理解：许多数据集与行为主体无关、依赖孤立的语句或合成场景，且缺乏明确的事件结构或价值取向。我们提出NEVU（以新闻事件为中心的价值观理解）基准，用于在事实性新闻中进行以行为主体为条件、以事件为中心且具有方向感知的人类价值观识别。NEVU评估模型能否识别价值观线索、将其正确归因于特定行为主体，并基于事实证据判断价值取向。该基准基于2,865篇英文新闻文章构建，在四个语义单元层级（子事件、基于行为的复合事件、基于故事的复合事件和文章）组织标注，并对（单元，行为主体）配对进行标注，以实现局部与复合语境下的细粒度评估。标注通过一个LLM辅助的流程生成，该流程包含分阶段验证和针对性人工审核。NEVU采用包含54个细粒度价值观和20个粗粒度类别的分层价值体系，涵盖45,793个单元-行为主体对和168,061个有明确取向的价值实例。我们为专有和开源大语言模型提供了统一基线，发现轻量级适配（LoRA）能持续提升开源模型性能，这表明尽管NEVU主要设计为评估基准，它也支持超越单纯提示词评估的监督式适配。数据可用性说明详见附录~\ref{app:data_code_availability}。

摘要 (Abstract)

Existing human value datasets do not directly support value understanding in factual news: many are actor-agnostic, rely on isolated utterances or synthetic scenarios, and lack explicit event structure or value direction. We present \textbf{NEVU} (\textbf{N}ews \textbf{E}vent-centric \textbf{V}alue \textbf{U}nderstanding), a benchmark for \emph{actor-conditioned}, \emph{event-centric}, and \emph{direction-aware} human value recognition in factual news. NEVU evaluates whether models can identify value cues, attribute them to the correct actor, and determine value direction from grounded evidence. Built from 2{,}865 English news articles, NEVU organizes annotations at four semantic unit levels (\textbf{Subevent}, \textbf{behavior-based composite event}, \textbf{story-based composite event}, and \textbf{Article}) and labels \mbox{(unit, actor)} pairs for fine-grained evaluation across local and composite contexts. The annotations are produced through an LLM-assisted pipeline with staged verification and targeted human auditing. Using a hierarchical value space with \textbf{54} fine-grained values and \textbf{20} coarse-grained categories, NEVU covers 45{,}793 unit–actor pairs and 168{,}061 directed value instances. We provide unified baselines for proprietary and open-source LLMs, and find that lightweight adaptation (LoRA) consistently improves open-source models, showing that although NEVU is designed primarily as a benchmark, it also supports supervised adaptation beyond prompting-only evaluation. Data availability is described in Appendix~\ref{app:data_code_availability}.

关键词: human value understanding, news domain, benchmark, actor-conditioned, event-centric, LLM-assisted, LoRA adaptation, value recognition

120. ❌ The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

作者: Donghang Wu, Tianyu Zhang, Yuxin Li, Hexin Liu, Chen Chen, Eng Siong Chng, Yoshua Bengio 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17837v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出FLAIR方法，在语音对话系统中实现边听边思考的潜在推理机制。核心相关关键词：1) ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 使用基于ELBO的目标进行高效监督微调；2) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）- 实现递归的潜在推理过程；3) ‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）- 模拟人类内部认知处理；4) ‘Large Language Models OR LLMs OR Foundation Models’（8分）- 属于大模型在对话系统中的应用；5) ‘Self-Correction OR Self-Improvement OR Self-Reflection’（5分）- 涉及内部认知处理。其他关键词与论文内容无关或未提及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FLAIR的全双工潜在内部推理方法，通过在语音感知过程中进行连续的潜在思考来模拟人类对话时的内部认知处理，实验证明该方法在多个语音基准测试中取得了有竞争力的结果。

摘要翻译

在对话交互过程中，人类在聆听说话者时会下意识地进行并行思考。尽管这种内部认知处理未必总以显性语言结构呈现，但它对构建高质量回应具有关键作用。受此认知现象启发，我们提出了一种名为FLAIR的新型全双工潜在与内部推理方法，该方法在语音感知的同时进行潜在思考。与传统自然语言处理中需要事后生成的“思考”机制不同，我们的方法能与口语对话系统无缝衔接：在用户说话阶段，系统将上一步输出的潜在嵌入递归馈送至下一步，实现严格遵循因果关系的持续推理，且不引入额外延迟。为实现这种潜在推理，我们设计了基于证据下界的训练目标，通过教师强制方法支持高效的监督微调，从而规避对显式推理标注的需求。实验证明这种“边听边思考”设计的有效性，该模型在多项语音基准测试中均取得具有竞争力的结果。此外，FLAIR能稳健处理对话动态变化，并在全双工交互指标上达到优越性能。

摘要 (Abstract)

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional “thinking” mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user’s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

关键词: full-duplex spoken dialogue, latent reasoning, internal cognition, think-while-listening, supervised fine-tuning, causal reasoning, speech perception, conversational dynamics

121. ❌ Multi-Source Evidence Fusion for Audio Question Answering

作者: Aivo Olev, Tanel Alumäe 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于音频问答任务，使用大型音频语言模型（LALMs）作为核心组件，因此与’Large Language Models’高度相关。研究重点在于提升推理过程的质量（如逻辑性、事实准确性），这直接关联’Chain of Thought’、‘System 2 Thinking’、‘Hallucination Mitigation’和’Mechanistic Interpretability’。系统采用多源证据融合和工具调用（25个声学工具）来验证推理，体现了’LLM Agents’和’Tool Use’的核心思想。其他关键词（如MoE、量化、RAG等）在论文中未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于音频问答的多源证据融合系统，通过结合两个大型音频语言模型的观察结果，并利用25个声学工具进行交叉验证，生成了可验证的密集推理链，从而在Interspeech 2026音频推理挑战赛的Agent Track中取得了第一名。

摘要翻译

大型音频语言模型（LALMs）能够回答关于语音、音乐和环境声音的问题，但其内部推理过程在很大程度上是不透明的，且难以验证。本文介绍了塔尔图理工大学针对Interspeech 2026音频推理挑战赛智能体赛道的解决方案。该赛事重点评估系统的推理过程质量，特别是推理链的事实准确性、逻辑严密性和完整性。我们的多源集成流程采用两个LALMs分别生成独立观察结果，同时由一个独立的纯文本推理模型，将这些结果与25个按可靠性分级组织的声学工具输出进行交叉验证。通过将每个推理步骤都建立在带有明确可靠性标记的证据基础上，系统能够生成密集且可验证的推理链。在本次挑战赛中，我们的系统在推理质量指标上以显著优势超越所有竞争对手，荣获第一名。

摘要 (Abstract)

Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech’s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge’s reasoning quality metric.

关键词: Audio Question Answering, Large Audio Language Models (LALMs), Multi-source Evidence Fusion, Reasoning Chain Quality, Acoustic Tools, Factual Accuracy, Agent Track, Verifiable Reasoning

122. ❌ Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

作者: Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17815v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM的链式推理过程监督，与’Large Language Models’和’Chain of Thought’高度相关（10分），涉及深度推理（8分）。通过信息论方法自动生成步骤标签，与’System 2 Thinking’相关。研究旨在提高推理可靠性，与’Self-Correction’和’Hallucination Mitigation’有一定关联（5分）。方法具有解释性（5分），并在科学问答等任务中应用（5分）。其他关键词如MoE、SFT、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信息论的方法，通过估计每个推理步骤对正确答案可能性的影响来自动生成步骤级标签，从而实现对大型语言模型链式推理过程的高效监督，提高了多步推理任务的可靠性。

摘要翻译

多步推理提升了大语言模型（LLM）的能力，但也增加了错误在中间步骤传播的风险。过程奖励模型（PRM）通过单独评估每个步骤来缓解这一问题，从而实现细粒度监督并提升可靠性。现有的PRM训练方法依赖于昂贵的人工标注或计算密集的自动标注。我们提出了一种利用信息论自动生成步骤级标签的新方法。该方法通过估计每个推理步骤如何影响正确答案的似然性，来提供步骤质量的信号。重要的是，它将计算复杂度降低至 $\mathcal{O}(N)$，优于先前 $\mathcal{O}(N \log N)$ 的方法。我们证明，在包括数学、Python编程、SQL和科学问答在内的多种推理基准测试中，这些标签能够在最佳-$K$ 评估设置下实现有效的思维链选择。这项工作为LLM推理提供了可扩展且高效的监督方式，尤其适用于错误传播影响显著的任务。

摘要 (Abstract)

Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.

关键词: Chain-of-Thought Reasoning, Process Supervision, Monte Carlo Net Information Gain, Large Language Models, Multi-step Reasoning, Information Theory, Step-level Labels, Reasoning Reliability

123. ❌ Modeling Overlapped Speech with Shuffles

作者: Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17769v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音处理领域，提出了一种基于shuffle product和部分顺序有限状态自动机（FSAs）的重叠语音对齐和说话人归属转录方法。论文内容涉及语音识别、多说话人处理、有限状态自动机、Viterbi对齐等传统语音处理技术，但完全没有涉及大语言模型、深度学习技术原理创新、AI for Science等关键词所涵盖的内容。所有关键词都与大模型、深度学习、AI科学应用相关，而该论文是纯粹的语音信号处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于shuffle product和部分顺序有限状态自动机的新算法，首次实现了多说话人录音的单次对齐和说话人归属转录。

摘要翻译

我们提出采用混洗运算对并行数据流（如重叠语音）进行建模。具体而言，本文展示了如何利用混洗积与偏序有限状态自动机（FSA）实现重叠语音的对齐和说话人归属转录。我们以这些FSA上的总得分作为损失函数进行训练，通过在子词、词和短语层级上对所有可能的重叠序列串行化结果进行边缘化处理。为缩减图规模，我们通过构建偏序FSA施加时序约束。我们直接对（词元，说话人）元组建模以解决说话人归属问题。基于混洗积FSA的维特比对齐可直接实现单次遍历对齐。我们在合成的LibriSpeech重叠语音数据集上评估了算法性能。据我们所知，这是首个能够实现多说话人录音单次遍历对齐的算法。所有算法均基于k2/Icefall框架实现。

摘要 (Abstract)

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

关键词: overlapped speech, shuffle product, partial order FSAs, speaker-attributed transcription, Viterbi alignment, multi-talker recordings, single-pass alignment, speech recognition

124. ❌ Complementary Reinforcement Learning

作者: Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents的强化学习训练方法，提出Complementary RL框架实现经验提取器与策略执行器的协同进化。与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究LLM-based agents；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文聚焦于agent学习范式。其他关键词如MoE、SLMs、Scaling Laws、训练技术（Pre-training、SFT、RLHF等）、推理优化（CoT、MCTS）、模型效率（Quantization、Speculative Decoding）、应用领域（AI for Science）等均未在摘要中提及或关联，故评0分。

!!! tip deepseek-chat TL;DR

论文针对LLM-based agents在强化学习中样本效率低、历史经验与执行器能力不匹配的问题，提出Complementary RL框架，通过协同优化经验提取器和策略执行器，在单任务和多任务场景中实现了10%的性能提升。

摘要翻译

强化学习（Reinforcement Learning, RL）已成为训练基于大语言模型（LLM）智能体的强大范式，但其仍受限于较低的样本效率。这一局限不仅源于稀疏的结果反馈，也源于智能体无法有效利用跨轮次的历史经验。尽管通过历史经验增强智能体是一种有前景的解决方案，但现有方法存在一个关键缺陷：从历史中提炼的经验要么被静态存储，要么未能与持续改进的行动者（actor）协同进化，导致经验与行动者不断演进的能力之间逐渐失配，从而在训练过程中削弱了经验的效用。受神经科学中互补学习系统的启发，我们提出了互补强化学习（Complementary RL），以实现经验提取器与策略行动者在RL优化循环内的无缝协同进化。具体而言，行动者通过稀疏的结果奖励进行优化，而经验提取器则根据其提炼的经验是否显著促进行动者的成功来进行优化，从而使其经验管理策略能够与行动者增长的能力同步演进。实证表明，互补强化学习优于那些不从经验中学习的、基于结果反馈的智能体RL基线方法，在单任务场景中实现了10%的性能提升，并在多任务环境中展现出强大的可扩展性。这些结果确立了互补强化学习作为一种高效经验驱动智能体学习的新范式。

摘要 (Abstract)

Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent’s inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor’s evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor’s success, thereby evolving its experience management strategy in lockstep with the actor’s growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.

关键词: Reinforcement Learning, LLM-based agents, sample efficiency, historical experience, co-evolution, experience extractor, policy actor, multi-task learning

125. ❌ Temporal Narrative Monitoring in Dynamic Information Environments

作者: David Farr, Stephen Prochaska, Jack Moody, Lynnette Hui Xian Ng, Iain Cruickshank, Kate Starbird, Jevin West 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Temporal Narrative Monitoring in Dynamic Information Environments》研究的是危机事件中动态信息环境的叙事监测框架，核心方法是集成语义嵌入、基于密度的聚类和滚动时间链接来建模随时间演化的语义结构。虽然论文使用了语义嵌入（可能涉及NLP技术），但摘要和标题中完全没有提及大模型（LLMs）、深度学习、MoE、Scaling Laws、预训练、微调、对齐、RLHF、PEFT、RAG、推理加速、幻觉缓解、模型压缩、智能体等任何指定的关键词技术。论文属于信息科学/危机信息学领域，关注的是叙事建模和情境感知，而非大模型技术原理或其在科学领域的创新应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个面向系统的框架，用于在动态信息环境（如危机事件）中将新兴叙事建模为随时间演化的语义结构，无需先验标签，并通过真实危机事件的应用验证了该框架能产生高内聚的聚类并揭示异质的叙事生命周期。

摘要翻译

在危机事件中，由于领域变化迅速且具有抽象性，理解信息环境（IE）是一项具有挑战性的任务。许多现有方法侧重于通过分类方法或网络分析来捕捉信息环境的静态快照，忽略了信息随时间演变的时间动态特性。

本研究提出一种面向系统的框架，用于将新兴叙事建模为随时间演化的语义结构，且无需预先设定标签。通过整合语义嵌入、基于密度的聚类和滚动时间关联技术，该框架将叙事表示为共享语义空间中持久且具有适应性的实体。我们将此方法应用于一个真实世界的危机事件，并通过分层聚类验证与时间生命周期分析来评估系统行为。结果表明，该方法具有较高的聚类内聚性，并揭示了异质性的叙事生命周期，其特征表现为短暂的叙事片段与稳定的叙事锚点并存。

我们的方法基于态势感知理论，通过将非结构化的社交媒体流转化为可解释的、具有时间结构的表征，从而支持对信息环境的感知与理解。最终构建的系统为动态信息环境中的监测与决策支持提供了一套方法论。

摘要 (Abstract)

Comprehending the information environment (IE) during crisis events is challenging due to the rapid change and abstract nature of the domain. Many approaches focus on snapshots via classification methods or network approaches to describe the IE in crisis, ignoring the temporal nature of how information changed over time. This work presents a system-oriented framework for modeling emerging narratives as temporally evolving semantic structures without requiring prior label specification. By integrating semantic embeddings, density-based clustering, and rolling temporal linkage, the framework represents narratives as persistent yet adaptive entities within a shared semantic space. We apply the methodology to a real-world crisis event and evaluate system behavior through stratified cluster validation and temporal lifecycle analysis. Results demonstrate high cluster coherence and reveal heterogeneous narrative lifecycles characterized by both transient fragments and stable narrative anchors. We ground our approach in situational awareness theory, supporting perception and comprehension of the IE by transforming unstructured social media streams into interpretable, temporally structured representations. The resulting system provides a methodology for monitoring and decision support in dynamic information environments.

关键词: temporal narrative monitoring, dynamic information environments, semantic embeddings, density-based clustering, rolling temporal linkage, crisis events, situational awareness, narrative lifecycles

126. ❌ VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

作者: Yaoxiang Wang, Qi Shi, ShangZhan Li, Qingguo Hu, Xinyu Yin, Bo Guo, Xu Han, Maosong Sun, Jinsong Su 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在硬件设计领域的应用，提出一个多智能体系统VeriAgent，包含Programmer Agent、Correctness Agent和PPA Agent，通过工具集成和演化记忆机制优化RTL代码生成。与关键词高度相关的包括：LLMs（论文明确使用LLMs进行代码生成）、LLM Agents/Multi-agent Systems（核心框架是多智能体系统）、Tool Use（集成EDA工具形成闭环工作流）。其他关键词如MoE、SFT、RAG等未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出一个PPA感知的多智能体系统VeriAgent，通过集成EDA工具和演化记忆机制，在保持功能正确性的同时显著优化了硬件设计的功耗、性能和面积指标。

摘要翻译

近年来，大语言模型在自动生成寄存器传输级代码方面展现出强大能力，实现了较高的语法正确性与功能正确性。然而，现有方法大多仅关注功能正确性，而忽视了关键的物理设计目标，包括功耗、性能和面积。本文提出一种面向功耗、性能与面积优化的工具集成多智能体框架，用于生成高质量的Verilog代码。该框架将电子设计自动化工具显式地整合到一个由编程智能体、正确性验证智能体与PPA优化智能体构成的闭环工作流中，实现了功能正确性与物理指标的共同优化。为支持无需模型重新训练的持续改进，我们引入一种进化记忆机制，将优化经验外化为结构化记忆节点。一个专用的记忆管理器动态维护记忆池，使系统能够基于历史执行轨迹持续优化策略。大量实验表明，我们的方法在保持强大功能正确性的同时，显著提升了功耗、性能与面积指标。通过将工具驱动的反馈与结构化、可进化的记忆相结合，本框架将寄存器传输级代码生成从单次推理转变为持续、反馈驱动的优化过程，为大语言模型在实际硬件设计流程中的部署提供了可扩展的路径。

摘要 (Abstract)

LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.

关键词: LLMs, multi-agent system, tool integration, RTL code generation, PPA optimization, evolved memory, hardware design, EDA tools

127. ❌ Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution

作者: Sofía Aguilar-Valdez, Stefania Degaetano-Ortlieb 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究利用复杂网络建模科学概念演变，以化学革命为案例，属于AI在科学领域的应用（AI for Science），因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词高度相关（8分）。论文提到使用LLM的上下文嵌入来估计概念变化，但指出其局限性，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于复杂网络的框架来建模科学概念的演变，以化学革命中的燃素说与氧化学说为案例，发现名称变化与更高的熵和拓扑密度相关，表明思想多样性和连接努力的增加。

摘要翻译

尽管大型语言模型生成的语境嵌入可用于评估概念演变，但这些表征往往既缺乏可解释性也不具备时间感知能力。此外，历史数据中的偏见增强对数字人文领域研究者构成了不容忽视的风险。因此，为在学术演进过程中建立可靠的概念轨迹模型，本研究开发了一个基于主题复杂网络的原型概念表征框架。通过使用英国皇家学会语料库，我们以化学革命中的两种竞争理论（燃素说与氧化学说）为案例进行分析，研究表明名称指称演变与更高的信息熵及拓扑密度相关，这反映了思想多样性的提升与概念连接需求的增强。

摘要 (Abstract)

While context embeddings produced by LLMs can be used to estimate conceptual change, these representations are often not interpretable nor time-aware. Moreover, bias augmentation in historical data poses a non-trivial risk to researchers in the Digital Humanities. Hence, to model reliable concept trajectories in evolving scholarship, in this work we develop a framework that represents prototypical concepts through complex networks based on topics. Utilizing the Royal Society Corpus, we analyzed two competing theories from the Chemical Revolution (phlogiston vs. oxygen) as a case study to show that onomasiological change is linked to higher entropy and topological density, indicating increased diversity of ideas and connectivity effort.

关键词: conceptual change, complex networks, Chemical Revolution, topic modeling, historical data, entropy, topological density, Digital Humanities

128. ❌ From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

作者: Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu, Yong Huang, Tianrui Guo, Wei Lu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在科学论文评估中的应用，属于大模型在科学领域的应用创新。高度相关的关键词包括：1) ‘Large Language Models’（论文明确研究LLM应用），2) ‘Post-training/SFT’（论文使用监督微调），3) ‘RLHF’（论文使用基于比较的强化学习），4) ‘AI for Science’（论文应用于科学论文评估）。其他关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLM在科学论文评估中孤立评分方法的局限性，提出了一个基于比较的协作排名框架CNPE，通过图相似性采样、监督微调和强化学习实现更稳健的论文质量评估，实验表明比基线方法提升了21.8%的性能。

摘要翻译

当前，大语言模型（LLM）在学术论文评估中的应用主要体现为对每篇论文独立赋予一个绝对分数。然而，由于评分标准在不同会议、时期和评价准则间存在差异，基于绝对分数训练的模型容易拟合狭隘的、特定于上下文的规则，而非形成稳健的学术判断力。为克服这一局限，我们提出将论文评估从孤立评分转向协同排序。具体而言，我们设计了面向论文评估的比较原生框架（CNPE），将比较机制融入数据构建与模型学习两个环节。我们首先提出一种基于图的相似性排序算法，以促进从论文集合中采样信息量更丰富、区分度更高的论文对。随后，我们通过基于比较奖励的监督微调和强化学习，来增强模型的相对质量判断能力。在推理阶段，模型对采样的论文对进行两两比较，并将这些偏好信号聚合为全局的相对质量排序。实验结果表明，我们的框架相较于强基线DeepReview-14B平均实现了**21.8%**的相对性能提升，并在五个未见数据集上展现出良好的泛化能力。代码已开源：https://github.com/ECNU-Text-Computing/ComparisonReview。

摘要 (Abstract)

Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design \textbf{C}omparison-\textbf{N}ative framework for \textbf{P}aper \textbf{E}valuation (\textbf{CNPE}), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of \textbf{21.8%} over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. \href{https://github.com/ECNU-Text-Computing/ComparisonReview}{Code}.

关键词: Large Language Models, Paper Evaluation, Collaborative Ranking, Comparison-Native Framework, Supervised Fine-tuning, Reinforcement Learning, Scientific Paper Assessment, Relative Quality Ranking

129. ❌ KA2L: A Knowledge-Aware Active Learning Framework for LLMs

作者: Haoxuan Yin, Bojian Liu, Chen Tang, Yangfan Wang, Lian Yan, Jingchi Jiang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文KA2L专注于LLMs的微调（fine-tuning）和主动学习（active learning），因此与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文涉及通过分析隐藏状态来评估LLMs的知识掌握情况，这属于模型理解和优化的一部分，但未直接涉及其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RLHF、PEFT、RAG、推理加速、幻觉缓解、可解释AI、科学AI等具体技术或应用领域，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个知识感知的主动学习框架（KA2L），通过分析LLMs的隐藏状态来评估其知识掌握情况并生成未知问题，从而在减少50%标注和计算成本的同时提升模型性能。

摘要翻译

通过高质量知识对大型语言模型（LLM）进行微调已被证明能有效提升其性能。然而，目前关于LLM对领域特定知识理解深度的研究较少，且缺乏针对性的主动学习方法来提升其专业能力。为填补这一空白，我们提出了知识感知主动学习（KA2L）框架。该框架通过潜在空间分析评估LLM对特定知识点的掌握程度，从而辅助构建模型无法回答或未知的问题。这种主动学习策略通过聚焦于模型尚未掌握的知识来提高训练效率，从而最小化对已掌握信息的重复学习。本研究创新性地采用知识分布探测技术，通过分析特定Transformer层的隐藏状态来识别LLM中已知与未知知识的分布。此外，我们提出了一种隐藏状态解码方法，能够从潜在知识空间中生成大量自然语言形式的未知问题。在实验中，我们选取了九个开源LLM来验证所提框架的有效性。结果表明，KA2L在两个开放领域数据集和一个垂直领域数据集上不仅显著降低了50%的标注与计算成本，同时取得了更优的性能，这为LLM的主动学习策略提供了重要启示。代码公开于https://anonymous.4open.science/r/KA2L-F15C。

摘要 (Abstract)

Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs’ mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at https://anonymous.4open.science/r/KA2L-F15C.

关键词: Large Language Models, Fine-tuning, Active Learning, Knowledge-Aware, Hidden States, Transformer Layers, Domain-Specific Knowledge, Training Efficiency

130. ❌ Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

作者: Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出Zipper-LoRA框架，这是对LoRA（Parameter-efficient Fine-tuning）方法的创新改进，因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分）。论文基于Speech-LLMs（大型语言模型在语音领域的应用），与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’直接相关（各10分）。论文涉及多语言适应，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、推理方法等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多语言语音识别中数据分布不平衡导致的稳定性-可塑性困境，提出了Zipper-LoRA框架，通过动态解耦共享和语言特定子空间的LoRA更新，在12种语言的混合资源设置中显著提升了性能，特别是在极低资源场景下。

摘要翻译

语音大语言模型（Speech-LLMs）通过将语音编码器与大语言模型对齐，已成为自动语音识别（ASR）的一种强大方法。然而，在数据分布不平衡的多语言场景中适应这些系统仍然具有挑战性。在此类情况下，常出现稳定性与可塑性困境：完全共享的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）可能导致对低资源语言的负面跨语言干扰，而完全语言特定的调优则会限制低资源任务所需的跨语言有益知识迁移。为解决此问题，我们提出了Zipper-LoRA，一种新颖的秩级解耦框架，包含三种变体（静态、硬性和软性），能够动态地从共享和语言特定子空间中合成LoRA更新。通过使用轻量级的语言条件路由器，Zipper-LoRA在LoRA秩级别动态控制每个子空间的贡献，实现语言兼容时的细粒度共享，以及在冲突发生时的严格解耦。为进一步稳定不平衡数据下的优化过程，我们提出了一种包含初始B热启动的两阶段训练策略，显著加速了收敛。在12种语言的混合资源设置上的实验表明，Zipper-LoRA始终优于完全共享和完全独立的基线方法，尤其在极低资源场景中表现突出。此外，我们证明这些性能提升在分块和非分块编码器配置中均保持稳健，证实了该框架对于实际大规模多语言ASR的可靠性。我们的代码和数据将在https://github.com/YuCeong-May/Zipper-LoRA 公开，以确保可复现性。

摘要 (Abstract)

Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework’s reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.

关键词: Speech-LLMs, Multilingual ASR, Parameter-Efficient Fine-Tuning, LoRA, Zipper-LoRA, Stability-Plasticity Dilemma, Dynamic Parameter Decoupling, Language-Conditioned Router

131. ❌ AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications

作者: Patrycja Strycharczuk, Sam Kirkham 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是语音学领域的声学-发音模型（AURORA），用于根据共振峰预测舌位，属于传统的计算语音学/生物医学工程应用，未涉及大模型、深度学习技术原理或AI for Science的创新研究。所有关键词均与大模型、深度学习技术、AI科学应用相关，而本文是特定领域的传统计算模型，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了AURORA模型，解决了从元音的前两个共振峰值预测舌位移和形状的问题，并开发了教学工具和实时生物反馈软件原型。

摘要翻译

本文阐述了AURORA（声学理解与共振构型实时观测）模型的概念与计算基础。该模型依据前两个共振峰数值预测元音发音中的舌位移与舌形。其设计目标兼具教学辅助工具与生物反馈应用基础功能，旨在阐释共振峰与底层发音机制之间的关系。模型构建基于40名英语母语者的超声舌位成像与声学数据。本文论述了模型的构建动机、建模目标及架构设计，并对模型进行了定性评估，重点分析了若干选定舌部特征。随后介绍了为提升模型普及度而开发的两款工具：一款Shiny交互应用和一款用于实时舌位生物反馈的原型软件。潜在用户包括语音学学生、语音学相关领域研究者，以及言语语言治疗从业者与患者。

摘要 (Abstract)

This paper outlines the conceptual and computational foundations of the AURORA (Acoustic Understanding and Real-time Observation of Resonant Articulations) model. AURORA predicts tongue displacement and shape in vowel sounds based on the first two formant values. It is intended as a didactic aid helping to explain the relationship between formants and the underlying articulation, as well as a foundation for biofeedback applications. The model is informed by ultrasound tongue imaging and acoustic data from 40 native speakers of English. In this paper we discuss the motivation for the model, the modelling objectives as well as the model architecture. We provide a qualitative evaluation of the model, focusing on selected tongue features. We then present two tools developed to make the model more accessible to a wider audience, a Shiny app and a prototype software for real-time tongue biofeedback. Potential users include students of phonetics, linguists in fields adjacent to phonetics, as well as speech and language therapy practitioners and clients.

关键词: AURORA model, formant-to-tongue inversion, tongue displacement prediction, acoustic-articulatory modeling, ultrasound tongue imaging, real-time biofeedback, phonetics education, speech therapy applications

132. ❌ Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

作者: Mengyu Bu, Yang Feng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的多语言能力扩展，通过组合LLMs与预训练翻译模型（XBridge架构）解决多语言不平衡问题。高度相关关键词：LLMs（核心研究对象，10分）；中等相关：Pre-training（涉及预训练翻译模型）、Post-training/SFT（涉及微调映射层）、PEFT（轻量级映射层属于参数高效微调，5分）。其他关键词如MoE、SLMs、RAG、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs多语言性能不平衡的问题，提出XBridge架构将LLMs与预训练翻译模型组合，通过轻量级映射层实现语义对齐，显著提升了低资源和未见语言的多语言理解与生成能力。

摘要翻译

大型语言模型（LLM）展现出强大的通用智能，但其多语言性能仍存在高度不平衡。尽管LLM在统一的语义空间中编码了大量跨语言知识，它们往往难以可靠地将这些知识与低资源或未见语言进行对接。幸运的是，预训练的编码器-解码器翻译模型已具备均衡的多语言能力，这为LLM提供了自然的补充。本文提出XBridge，一种组合式的编码器-LLM-解码器架构，该架构将多语言理解与生成任务卸载给外部预训练的翻译模型，同时保留LLM作为以英语为中心的核心，用于通用知识处理。为解决由此产生的跨模型表示失准问题，我们引入了轻量级跨模型映射层和基于最优传输的对齐目标，从而实现了多语言生成的细粒度语义一致性。在涵盖多语言理解、推理、摘要和生成的四个LLM上的实验表明，XBridge优于强基线方法，尤其在低资源和先前未见语言上表现突出，且无需重新训练LLM。

摘要 (Abstract)

Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.

关键词: Large Language Models, Multilingual, Encoder-Decoder Translation Models, Compositional Architecture, Semantic Alignment, Low-resource Languages, Knowledge Processing, Cross-model Mapping

133. ❌ Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination

作者: Cem Uluoglakci, Tugba Taskaya Temizel 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17504v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉问题，通过创建HypoTermInstruct SFT数据集和HypoTermQA-Enhanced基准，使用LoRA SFT方法在Llama3.1-8B和Gemma3-4B模型上进行实验，显著降低了幻觉并提高了事实性分数。因此，与LLMs、SFT、LoRA、Hallucination Mitigation高度相关（10分）；与Instruction Tuning、Self-Correction、Mechanistic Interpretability有一定关联（5分）；其他关键词如MoE、SLMs、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文通过创建专门训练数据集和基准，采用LoRA SFT方法有效降低了大型语言模型的幻觉问题，提高了事实准确性，同时保持了模型的一般能力。

摘要翻译

大语言模型（LLM）常产生幻觉，生成流畅但虚假的信息，部分原因在于监督微调（SFT）隐式地奖励模型总是作出回应。我们引入了 $\textit{HypoTermInstruct}$，这是一个SFT数据集（包含11,151个问题的31,487条回答），旨在教导模型具备认识论上的谦逊——即识别自身知识局限并承认不确定性的能力。这是通过询问关于不存在的“假设性”术语的问题来实现的。我们还发布了 $\textit{HypoTermQA-Enhanced}$，这是一个通过多重验证强化的、用于评估幻觉倾向的基准。我们在 $\textit{Llama3.1-8B}$ 和 $\textit{Gemma3-4B}$（基础版和指导版）上进行了800次受控的LoRA SFT实验，测试了100种微调配置及其配对对照组。我们的结果表明，用 $\textit{HypoTermInstruct}$ 替换通用指令数据能显著提升HypoTerm分数（中位数提升0.19%至25.91%）和FactScore（+0.39%至+0.86%），同时在MMLU基准上保持稳定的性能（仅出现0.26%至0.35%的最小降幅）。我们的工作表明，针对性地教授元认知技能的高质量SFT数据，无需偏好学习或强化学习（RL）流程，即可有效减少幻觉，这为理解其机制提供了洞见，并为构建更可靠的人工智能系统提供了一条实用路径。

摘要 (Abstract)

Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce $\textit{HypoTermInstruct}$, an SFT dataset (31,487 responses for 11,151 questions) designed to teach models epistemological humility-the ability to recognize the limits of their own knowledge and admit uncertainty. This is achieved through questions about non-existent “hypothetical” terms. We also release $\textit{HypoTermQA-Enhanced}$, a benchmark for hallucination tendency strengthened through multiple validations. We conducted 800 controlled LoRA SFT runs across $\textit{Llama3.1-8B}$ and $\textit{Gemma3-4B}$ (base and instruct), testing 100 fine-tuning configurations with paired controls. Our results demonstrate that replacing generic instruction data with $\textit{HypoTermInstruct}$ significantly improves the HypoTerm Score (median increases of 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable performance on MMLU (minimal decreases of 0.26% to 0.35%). Our work demonstrates that targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and a practical path toward more reliable AI systems.

关键词: Large Language Models, Hallucination, Supervised Fine-tuning, LoRA, Epistemological Humility, FactScore, Parameter-efficient Fine-tuning, HypoTermInstruct

134. ❌ Learning When to Attend: Conditional Memory Access for Long-Context LLMs

作者: Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文LLM的高效注意力机制，与"Large Language Models”、“Context Window Extension”、“KV Cache Compression"高度相关（10分），直接改进FlashAttention。与"Pre-training”、“Post-training"相关（8分），涉及继续预训练和训练后剪枝。与"Mixture of Experts”、“Retrieval-Augmented Generation”、“Quantization”、“Speculative Decoding"有一定关联（5分），涉及稀疏模型、检索、模型压缩和推理加速。其他关键词如SLMs、对齐、推理方法等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM难以泛化到预训练上下文长度之外的问题，提出了一种名为L2A的条件记忆访问层，通过决定何时调用全局注意力来扩展模型的有效上下文长度至128K tokens，同时减少约80%的全局注意力计算，实现了训练吞吐量提升和KV缓存内存减少。

摘要翻译

语言模型难以泛化至预训练上下文长度之外，这限制了其长程推理与检索能力。通过在长上下文数据上持续预训练可缓解此问题，但由于注意力机制（Attention）的二次方复杂度，该方法成本高昂。我们观察到多数词元（tokens）无需对整个序列进行（全局）注意力计算，仅依赖局部上下文即可。基于此，我们提出L2A（学习注意力机制，Learning To Attend）——一种通过动态决策何时调用全局注意力，从而实现条件化（按词元粒度）长程记忆访问的神经网络层。我们在Qwen 2.5和Qwen 3模型上评估L2A，将其有效上下文长度从32K词元扩展至128K词元。L2A在跳过约80%词元的全局注意力计算的同时，性能与标准长上下文训练的差距保持在3%以内，优于现有基线方法。我们还设计了定制化Triton计算内核，在GPU上高效实现这种按词粒度的条件化注意力机制，相比FlashAttention实现了最高约2倍的训练吞吐量提升和首词元生成时间优化。此外，L2A支持对高度稀疏的全局注意力层进行训练后剪枝，在性能损失可忽略的前提下将KV缓存内存降低最高达50%。

摘要 (Abstract)

Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.

关键词: Long-context LLMs, Conditional Attention, Global Attention, KV Cache Compression, Training Efficiency, Context Window Extension, Sparse Attention, FlashAttention

135. ❌ Humans and transformer LMs: Abstraction drives language learning

作者: Jasper Jian, Christopher D. Manning 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究transformer语言模型（GPT-2 small）如何学习语言类别，并与人类语言习得理论进行比较。核心相关关键词：1) ‘Large Language Models OR LLMs OR Foundation Models’（10分）：论文明确研究transformer-based LM（GPT-2），是核心研究对象。2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）：论文分析模型在训练过程中的学习轨迹，直接涉及预训练阶段的学习机制。3) ‘Mechanistic Interpretability OR Explainable AI’（10分）：论文通过新颖的度量方法追踪学习轨迹，旨在理解LM的内部学习机制，属于可解释性研究。其他关键词如MoE、SFT、RAG、推理加速等均未涉及，评0分。

!!! tip deepseek-chat TL;DR

该研究通过比较transformer语言模型（GPT-2 small）在训练过程中的行为与人类语言习得理论，发现抽象类别行为比具体词汇行为出现更早，且不同语言行为在训练中依次突现，表明抽象化在语言模型学习中起关键作用。

摘要翻译

分类是人类语言能力的核心组成部分。本研究通过比较基于Transformer的语言模型（LM）在训练过程中的行为与人类语言习得中基于抽象特征和基于具体样例的两种理论描述，探究了语言模型如何习得语言学范畴。我们采用基于分布差异的新度量方法，通过追踪下一词元预测分布的学习轨迹，考察词汇语义和句法范畴的涌现机制。在GPT-2 small模型的实验中，我们发现：（i）当某个语言结构被习得时，抽象类别层面的行为特征比具体词汇项层面的行为特征更早显现；（ii）不同的语言学行为在训练过程中会按顺序突然涌现，表明抽象化在语言模型的学习过程中起着关键作用。这一结果为语言模型可能作为存在性验证的人类语言习得模型提供了新的启示。

摘要 (Abstract)

Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature-based and concrete exemplar-based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item-specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.

关键词: transformer language model, language acquisition, abstraction, learning trajectories, GPT-2, linguistic categories, next-token distributions, human language learning

136. ❌ TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL

作者: Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia, Ni Li, Yan Tao, Haiwei Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17449v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在复杂推理任务中链式思维（Chain of Thought）序列的冗余问题，提出最小充分长度（MSL）理论度量，并基于强化学习（GRPO算法）开发TRiMS方法压缩推理链，实现80%以上的token减少。因此，与’Large Language Models’和’Chain of Thought’高度相关（10分），与’System 2 Thinking’（涉及深度推理）和’Speculative Decoding’（涉及推理效率）有一定关联（5分），其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂推理中链式思维序列过长导致的计算冗余问题，提出了最小充分长度（MSL）理论度量和基于强化学习的TRiMS方法，实现了超过80%的推理token压缩同时略微提升准确率。

摘要翻译

大语言模型通过长链思维序列在复杂推理任务中取得突破，但这也常导致严重的推理膨胀，造成大量计算冗余。为最大化单位令牌的智能效率，我们引入一个理论度量指标——MSL（最小充分长度）。MSL严格刻画了在保持答案正确性前提下的最短推理长度。我们基于独立采样的序列给出了递归定义，并证明了其极限的存在性，从而首次为推理链压缩建立了可度量的下界。通过对主流思维链压缩策略的分析，我们识别出使模型能够逼近MSL的关键结构因素。基于这些发现，我们提出TRiMS方法，该方法在训练过程中结合GRPO算法与基于MSL的估计，同时通过动态批次聚合以及利用批次级标准差进行优势计算来缓解训练过程中的不稳定性。TRiMS在所有基准测试中实现了超过80%的思维链令牌缩减，同时准确率略有提升。

摘要 (Abstract)

Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.

关键词: Large Language Models, Chain of Thought, Reasoning Compression, Minimal Sufficient Length, Reinforcement Learning, GRPO Algorithm, Computational Efficiency, Token Reduction

137. ❌ Argument Reconstruction as Supervision for Critical Thinking in LLMs

作者: Hyun Ryu, Gyouk Chu, Gregor Betz, Eunho Yang, Carolyn Rose, Sean Welleck 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs通过论证重构训练提升批判性思维，直接涉及LLMs和SFT训练方法（10分）。批判性思维任务与多步推理和深度思考相关（8分）。自我改进概念有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了通过训练LLMs学习论证重构是否能提升其在七个批判性思维任务中的表现，实验结果表明学习论证重构的模型表现更优。

摘要翻译

为培养批判性思考论证的能力，人类学习者需接受识别、重构与评估论证的训练。论证重构尤为重要，因其能使论证背后的隐含推理显性化。然而，大型语言模型是否也能通过学习论证重构来类似地提升其批判性思维能力，目前尚不明确。为探究此问题，我们提出了一个包含三项贡献的整体框架。我们（1）设计了一个能自动重构任意论证的引擎（GAAR），（2）利用该引擎合成一个新的高质量论证重构数据集（Arguinas），并（3）探究学习论证重构是否有利于下游的批判性思维任务。实验结果表明，在七项批判性思维任务中，经过论证重构训练的模型均优于未经此类训练的模型，且在使用所提出的Arguinas数据集进行训练时，性能提升最为显著。源代码与数据集将公开提供。

摘要 (Abstract)

To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument’s underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.

关键词: Argument Reconstruction, Critical Thinking, Large Language Models, Supervised Fine-tuning, Reasoning, Dataset, GAAR Engine, Arguinas

138. ❌ PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

作者: Guangzhi Wang, Xiaohui Yang, Kai Li, Jiawen He, Kai Yang, Ruixuan Zhang, Zhi Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval》专注于构建一个用于人岗匹配检索任务的诊断性评测基准，涉及数据集构建、评测方法、领域分析和模块诊断。虽然论文属于AI应用领域（招聘检索），但所有关键词均聚焦于大模型/深度学习的技术原理、训练方法、推理技术、优化技术或特定科学应用（如生物信息学），而本文未涉及任何大模型技术、深度学习创新或AI for Science的具体技术内容，仅使用传统的密集检索模型进行实验，因此所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对人岗匹配检索任务缺乏系统性诊断能力的问题，构建了一个基于真实招聘数据的推理感知评测基准PJB，并通过实验发现行业领域间的性能差异远大于模型模块升级的收益，且查询理解模块与重排序模块存在根本不同的改进瓶颈。

摘要翻译

随着检索模型在通用基准测试上趋于收敛，紧迫的问题已不再是"谁的分数更高”，而是"系统在何处失败，以及为何失败？“人岗匹配正是亟需此类诊断能力的领域——它要求系统不仅能验证显性约束，还需进行技能迁移推断与岗位胜任力推理，然而现有基准测试对此任务缺乏系统性诊断支持。我们推出PJB（人岗匹配基准），这是一个具备推理感知的检索评估数据集：它以完整职位描述作为查询、完整简历作为文档，通过岗位胜任力判断定义相关性，数据基础覆盖六大行业领域近20万份真实招聘数据，并通过领域族与推理类型的诊断标签将评估从"谁得分更高"升级为"系统差异何在及其原因”。基于稠密检索的诊断实验表明，跨行业领域的性能异质性远超同一模型模块升级带来的增益，这证明仅依赖综合分数会严重误导优化决策。在模块层面，重排序能带来稳定提升，而查询理解不仅未能提供帮助，在与重排序结合时反而降低整体性能——这两个模块面临着本质不同的改进瓶颈。PJB的价值不在于提供又一个平均分数排行榜，而在于为招聘检索系统绘制能力地图，精准指明研发投入的方向。

摘要 (Abstract)

As retrieval models converge on generic benchmarks, the pressing question is no longer “who scores higher” but rather “where do systems fail, and why?” Person-job matching is a domain that urgently demands such diagnostic capability – it requires systems not only to verify explicit constraints but also to perform skill-transfer inference and job-competency reasoning, yet existing benchmarks provide no systematic diagnostic support for this task. We introduce PJB (Person-Job Benchmark), a reasoning-aware retrieval evaluation dataset that uses complete job descriptions as queries and complete resumes as documents, defines relevance through job-competency judgment, is grounded in real-world recruitment data spanning six industry domains and nearly 200,000 resumes, and upgrades evaluation from “who scores higher” to “where do systems differ, and why” through domain-family and reasoning-type diagnostic labels. Diagnostic experiments using dense retrieval reveal that performance heterogeneity across industry domains far exceeds the gains from module upgrades for the same model, indicating that aggregate scores alone can severely mislead optimization decisions. At the module level, reranking yields stable improvements while query understanding not only fails to help but actually degrades overall performance when combined with reranking – the two modules face fundamentally different improvement bottlenecks. The value of PJB lies not in yet another leaderboard of average scores, but in providing recruitment retrieval systems with a capability map that pinpoints where to invest.

关键词: Person-Job Retrieval, Reasoning-Aware Benchmark, Diagnostic Evaluation, Dense Retrieval, Industry Domain Heterogeneity, Query Understanding, Reranking, Recruitment Systems

139. ❌ SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

作者: Rima Hazra, Bikram Ghuku, Ilona Marchenko, Yaroslava Tokarieva, Sayan Layek, Somnath Banerjee, Julia Stoyanovich, Mykola Pechenizkiy 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究大语言模型作为AI导师时的教学安全性评估，与’Large Language Models’高度相关（10分）。论文涉及AI在教育领域的应用，与’AI for Science’有一定关联（8分）。论文提到模型安全性和事实性问题，与’Hallucination Mitigation’有弱关联（5分）。其他关键词（如MoE、Scaling Laws、RLHF等）在论文中未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI辅导系统中大语言模型的教学安全性问题，发现所有模型都存在广泛的教学危害，多轮对话会显著恶化教学行为，且危害因学科而异。

摘要翻译

大型语言模型正迅速被部署为人工智能导师，然而当前的评估范式孤立地评估问题解决准确性和通用安全性，未能捕捉模型是否能在师生互动中同时具备教学效能与安全性。我们认为，辅导安全性与传统LLM安全性存在本质区别：其主要风险并非有害内容，而是通过答案过度披露、错误概念强化以及支架式教学缺失所导致的学习过程的隐性侵蚀。为系统研究这一失效模式，我们引入了SafeTutors基准，该基准在数学、物理和化学领域对安全性与教学效能进行联合评估。SafeTutors围绕基于学习科学文献构建的理论风险分类体系展开，涵盖11个危害维度和48个子风险。研究发现：所有模型均表现出广泛危害；模型规模扩大并未带来可靠改善；多轮对话会恶化模型行为，教学失误率从17.7%上升至77.8%。危害程度还因学科而异，因此缓解措施需具备学科敏感性，而单轮对话的“安全/有益”评估结果可能掩盖长期互动中系统性导师失效问题。

摘要 (Abstract)

Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn’t reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn “safe/helpful” results can mask systematic tutor failure over extended interaction.

关键词: AI tutoring systems, pedagogical safety, large language models, benchmark evaluation, multi-turn dialogue, harm taxonomy, mathematics physics chemistry, scaffolding abdication

140. ❌ PACE-RAG: Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation

作者: Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Jinse Park, Jong Chul Ye 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是提出PACE-RAG框架，将LLMs与RAG技术结合应用于临床药物推荐，属于AI for Science（生物信息学）领域。因此，与’Large Language Models’、‘Retrieval-Augmented Generation’和’AI for Science’高度相关（10分）。摘要提到生成’explainable clinical summary’，与’Explainable AI’有一定关联（5分）。其他关键词如MoE、SFT、量化等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究针对现有LLMs和RAG方法在临床药物推荐中难以捕捉个体患者细微差异的问题，提出了PACE-RAG框架，通过结合患者上下文和相似病例的处方模式，在帕金森病队列和MIMIC-IV基准测试中实现了最先进的性能。

摘要翻译

药物推荐需要深入理解患者个体情况，尤其对于帕金森病等复杂病症。尽管大语言模型具备广泛的医学知识，却难以捕捉实际处方模式中的细微差异。现有的检索增强生成方法同样受限于这些复杂性：基于指南的检索过于泛化，而相似患者检索往往复制主流模式，未能充分考虑个体患者独特的临床细节。为弥补这一差距，我们提出了PACE-RAG（患者感知情境化循证策略检索增强生成框架），该创新框架旨在将患者个体情境与相似病例的处方倾向进行融合分析。通过解析针对特定临床信号定制的治疗模式，PACE-RAG能够识别最优处方并生成可解释的临床摘要。基于帕金森病队列和MIMIC-IV基准数据集，使用Llama-3.1-8B与Qwen3-8B模型进行评估，PACE-RAG取得了最先进的性能表现，F1分数分别达到80.84%和47.22%。这些结果验证了PACE-RAG作为个性化决策支持系统具备稳健且贴合临床实际的特性。代码已开源：https://github.com/ChaeYoungHuh/PACE-RAG。

摘要 (Abstract)

Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson’s disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson’s cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

关键词: PACE-RAG, clinical drug recommendation, patient-aware contextual, evidence-based policy, Retrieval-Augmented Generation, Parkinson’s disease, personalized decision support, explainable clinical summary

141. ❌ Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

作者: Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, Lei Jiang, Hayden Kwok-Hay So, Ngai Wong 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于层间混合精度量化（LMPQ）技术，提出了一种基于数值和结构双重敏感性的无校准量化框架NSDS。论文核心是模型压缩和量化技术，仅与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为论文直接研究混合精度量化方法。其他关键词涉及大模型架构、训练方法、推理优化、对齐技术、代理系统、科学应用等，论文均未涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于数值和结构双重敏感性的无校准层间混合精度量化框架NSDS，通过分解层的操作角色并量化其敏感性，实现了在极端低比特设置下的有效模型压缩，并在多种模型和下游任务中优于现有基线方法。

摘要翻译

分层混合精度量化（LMPQ）通过为敏感层分配更高精度，能够在极低位宽设置下实现有效压缩。然而，现有方法通常将层内所有权重模块统一对待，并在估计敏感性时依赖单一数值属性，忽视了其不同的运算角色与结构特性。为此，我们提出NSDS，一种由数值与结构双重敏感性驱动的、无需校准的新型LMPQ框架。具体而言，该方法首先将每层机制性地分解为不同的运算角色，并从数值和结构两个角度量化其敏感性。随后，这些双重敏感性评分通过基于MAD-Sigmoid和Soft-OR的鲁棒聚合方案，整合为一个统一的层间度量，以指导比特分配。大量实验表明，在不依赖任何校准数据的情况下，NSDS在多种模型和下游任务中均能持续优于各类基线方法，展现出优越的性能。

摘要 (Abstract)

Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.

关键词: Layer-wise mixed-precision quantization, LMPQ, Numerical and Structural Dual-Sensitivity, NSDS, calibration-free, model compression, low-bit quantization, bit allocation

142. ❌ Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

作者: Risham Sidhu, Julia Hockenmaier 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17333v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在空间推理任务上的能力，并探索通过微调小型模型来匹配前沿模型性能的方法。与’Large Language Models’高度相关（10分），因为论文评估LLMs在空间推理任务上的表现；与’Small Language Models’相关（8分），因为论文提到微调小型LM的潜力；与’Post-training OR Supervised Fine-tuning OR SFT’相关（8分），因为论文涉及完全微调小型模型；与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确提到LoRA微调小型LLM；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文研究专门化具身智能体。其他关键词如MoE、Scaling Laws、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过引入GSU文本网格数据集评估LLMs在导航、物体定位和结构组合等空间推理任务上的能力，发现前沿模型能解决这些任务，而通过完全微调小型LM或LoRA微调小型LLM可以匹配前沿模型性能，为专门化具身智能体提供了新途径。

摘要翻译

我们推出GSU——一个纯文本网格数据集，用于评估大语言模型在三大核心任务中的空间推理能力：导航、物体定位与结构组合。通过摒弃视觉输入，将空间推理与感知能力分离，我们发现尽管大多数模型能掌握基础的网格概念，但在处理与具身智能体相关的参照系问题以及从坐标列表中识别三维形状方面仍存在困难。研究还表明，接触视觉模态并不能为视觉语言模型提供可用于这些任务的、可泛化的三维空间理解能力。最后，我们发现虽然最新的前沿模型能够解决现有任务（尽管更复杂的变体仍可能使其受阻），但对小型语言模型进行全参数微调或对小型大语言模型进行LORA微调，均展现出匹配前沿模型性能的潜力，这为开发专用具身智能体提供了新的研究方向。

摘要 (Abstract)

We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.

关键词: spatial reasoning, LLMs, grid dataset, embodied agents, fine-tuning, LoRA, visual-language models, 3D shapes

143. ❌ Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

作者: Gexin Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文使用三个大型语言模型（LLMs）研究英语音素与语义之间的系统性关联，属于大模型在语言学和认知科学领域的应用研究。论文核心是使用LLMs作为研究工具来探索语言中的声音象征现象，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及语言学和认知科学，可视为AI在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词主要涉及大模型的技术原理、训练方法、优化技术或特定应用场景，论文未涉及这些具体技术细节，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究使用大型语言模型系统性地揭示了英语中单个字母音素携带结构化、多维语义信号，挑战了语言中音义关系完全任意的传统假设，并发现这些关联可通过发音语音特征预测，且在行为实验中得到了验证。

摘要翻译

语言学的一个基础假设认为，词语的发音与其意义之间的关系是任意的。来自语音象征研究的不断累积的证据正挑战这一观点，然而尚无研究系统性地描绘一种语言内每个音系单位的多维语义轮廓。本文研究表明，英语中的单个字母-音素携带着结构化的、多维的语义信号。通过一个涵盖所有220对字母对比的最小对立对范式，三个大型语言模型独立地在九个感知维度上恢复出一致的音素-意义关联。这些关联可由发音-语音特征系统性地预测，其中发音方式和发音部位映射到不同的语义维度。来自英语母语者的行为数据以远高于随机水平（80.8%）的准确率证实了这些模式，并且来自五种类型学上不同语言的初步跨语言证据表明，核心映射关系可推广至英语之外。我们的研究结果表明，语音-意义象似性并非偶然的奇特现象，而是语音信号中一种普遍的、结构化的属性，其系统性如此之强，以至于大型语言模型在仅接收文本输入、任务过程中未接触语音或发音的情况下，仍能将其恢复出来。

摘要 (Abstract)

A foundational assumption in linguistics holds that the relationship between a word’s sound and its meaning is arbitrary. Accumulating evidence from sound symbolism challenges this view, yet no study has systematically mapped the multidimensional semantic profile of every phonological unit within a language. Here we show that individual letter-phonemes in English carry structured, multidimensional semantic signals. Using a minimal-pair paradigm spanning all 220 pairwise letter contrasts, three large language models independently recover consistent phoneme-meaning associations across nine perceptual dimensions. These associations are systematically predicted by articulatory-phonetic features, with manner and place of articulation mapping onto distinct semantic dimensions. Behavioral data from English speakers confirm these patterns at rates well above chance (80.8%), and preliminary cross-linguistic evidence from five typologically diverse languages suggests that core mappings generalize beyond English. Our findings indicate that sound-meaning iconicity is not an occasional curiosity but a pervasive, structured property of the phonological signal, one so systematic that large language models recover it when given only text input, without exposure to speech or articulation during the task.

关键词: sound symbolism, phoneme-meaning associations, large language models, articulatory-phonetic features, minimal-pair paradigm, semantic dimensions, cross-linguistic evidence, iconicity

144. ❌ LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

作者: Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文档布局分析中的结构错误检测，提出LED基准测试。与关键词的相关性分析：1）论文明确提到LLMs在文档分析中的应用，因此"Large Language Models"相关度较高（8分）；2）论文涉及"Hallucination"错误类型，与"Hallucination Mitigation"有一定关联（5分）；3）论文强调可解释性评估，与"Explainable AI"相关（5分）；4）其他关键词如MoE、SFT、RAG等与论文核心内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对文档布局分析中传统重叠指标无法检测结构错误的问题，提出了Layout Error Detection（LED）基准测试，通过定义八种标准错误类型和设计评估任务，实现了对文档理解模型结构推理能力的细粒度和可解释性评估。

摘要翻译

近年来，大型语言模型（LLMs）与大型多模态模型（LMMs）的进展推动了文档布局分析（Document Layout Analysis, DLA）的发展，但区域合并、分裂及遗漏等结构错误依然普遍存在。传统的基于重叠度的评估指标（如IoU、mAP）无法有效捕捉此类逻辑不一致问题。为克服这一局限，我们提出布局错误检测（Layout Error Detection, LED）基准，旨在超越表层精度，评估DLA预测中的结构推理能力。LED定义了八种标准化错误类型（缺失、幻觉、尺寸错误、分裂、合并、重叠、重复及分类错误），并提供了量化规则与注入算法以模拟真实错误。基于这些定义，我们构建了LED数据集，并设计了三项评估任务：文档级错误检测、文档级错误类型分类以及元素级错误类型分类。通过对前沿多模态模型的实验，LED能够对结构理解能力进行细粒度、可解释的评估，揭示不同模态与架构间的明显缺陷。总体而言，LED为诊断文档理解模型的结构鲁棒性与推理能力建立了一个统一且可解释的基准。

摘要 (Abstract)

Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.

关键词: Layout Error Detection, Document Layout Analysis, Structural Reasoning, Benchmark Evaluation, Multimodal Models, Error Simulation, Interpretable Assessment, Document Understanding

145. ❌ Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

作者: Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型音频语言模型（LALMs）中的神经元级情感控制，属于大模型技术原理的创新研究。与’Large Language Models’高度相关（10分），因为LALMs是LLMs在音频领域的扩展应用。与’Mechanistic Interpretability’高度相关（10分），因为研究通过神经元级分析建立了情感控制的机制框架。与’Hallucination Mitigation’有一定关联（5分），因为研究涉及内容保真度问题（如幻觉、改写）。其他关键词如MoE、SFT、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型音频语言模型中可靠情感控制的难题，通过识别情感敏感神经元实现了无需训练的情感引导，建立了神经元级情感控制的机制框架。

摘要翻译

大型音频-语言模型（LALMs）能够生成富有表现力的语音，但可靠的情感控制仍难以实现：转换结果常偏离目标情感，并可能因拒绝、幻觉或转述而损害语言保真度。据我们所知，我们首次对语音生成LALMs中的情感控制进行了神经元层面的研究，并证明紧凑的情感敏感神经元（ESNs, emotion-sensitive neurons）具有因果可操作性，可在推理阶段实现无需训练的情感调控。ESNs通过基于成功筛选的激活聚合方法识别，该方法同时强制实现情感表达与内容保持。在三种LALMs（Qwen2.5-Omni-7B、MiniCPM-o 4.5、Kimi-Audio）上的实验表明，ESN干预能产生特定于情感的提升效果，并可泛化至未见过的说话人，自动评估与人工评估均支持这一结论。可控性取决于选择器设计、掩码稀疏度、筛选策略及干预强度。我们的研究结果为语音生成中无需训练的情感控制建立了一个机制性框架。

摘要 (Abstract)

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

关键词: Large audio-language models, emotion control, neuron-level study, emotion-sensitive neurons, training-free steering, mechanistic framework, speech generation, content preservation

146. ❌ TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

作者: Prajwal Panth, Agniva Maiti 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs（Gemini）为低资源语言（Tharu）生成合成数据，并基于此构建专门的指令跟随模型Tharu-LLaMA（3B），以解决现有模型因数据污染导致的幻觉问题。因此，与LLMs、指令调优、监督微调、幻觉缓解高度相关（10分）。模型规模为3B，属于小型语言模型，且强调在消费级硬件上实现，与SLMs相关（8分）。涉及预训练/领域适应（使用合成数据适应低资源语言）和数据质量（分析合成数据的局限性）有一定关联（5-8分）。其他关键词如MoE、RAG、量化等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对低资源Tharu语言面临的数据稀缺和模型幻觉问题，提出通过LLM生成合成数据并构建专门的指令跟随模型Tharu-LLaMA（3B），有效降低了困惑度并证明了生成式AI在保护濒危语言方面的可行性。

摘要翻译

大型语言模型（LLM）的快速扩散造成了深刻的数字鸿沟，实质上将全球南方的原住民语言排除在人工智能革命之外。塔鲁语——一种在尼泊尔和印度特莱地带约170万人使用的印度-雅利安语族方言——正是这一危机的例证。尽管拥有丰富的口头传统，塔鲁语仍面临严重的数据稀缺和语言碎片化问题，导致最先进的多语言模型在处理时频繁出现“幻觉”或默认转向印地语、尼泊尔语等占主导地位的高资源邻接语言，这源于预训练语料库中的污染。

本文提出了Tharu-LLaMA（3B），一个专门设计的指令遵循模型，旨在应对这种排斥问题。我们引入了TharuChat，这是一个通过LLM到人类的引导式流程构建的新型数据集。我们利用经过提示工程优化的Gemini模型，输入拉纳塔鲁语语法和民间传说，合成了训练数据。与精心筛选的黄金标准语料库不同，TharuChat反映了该地区嘈杂、异质的语言现实：它以拉纳塔鲁方言为主（约70%），同时融入了丹高拉和科奇拉方言的元素。我们对数据集的局限性进行了透明分析，包括方言间的语码混合以及残留的阿瓦迪语/印地语影响。通过严格的实证消融研究，我们证明尽管存在这些不完美之处，小规模合成数据仍然非常有效：将数据集规模从25%增加到100%会导致困惑度从6.42线性下降至2.88。最终模型作为一个概念验证，展示了通过生成式人工智能在消费级硬件上实现保护资源匮乏的喜马拉雅地区语言的可行性。

摘要 (Abstract)

The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely “hallucinate” or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset’s limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware.

关键词: Large Language Models, Low-Resource Language, Synthetic Data, Instruction-Following Model, Hallucination Mitigation, Domain Adaptation, Small Language Models, Data Quality

147. ❌ Alignment Makes Language Models Normative, Not Descriptive

作者: Eilam Shapira, Moshe Tennenholtz, Roi Reichart 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	15.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究后训练对齐（post-training alignment）对语言模型预测人类行为能力的影响，与’Post-training OR Supervised Fine-tuning OR SFT’和’Instruction Tuning OR Alignment OR Value Alignment’高度相关（分别给10分和15分）。论文明确提到’Post-training alignment’，这是研究的核心对象，因此’Post-training’关键词得10分。‘Alignment’是论文的核心主题，贯穿整个研究，因此给最高分15分。‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’得5分，因为对齐通常涉及这些技术，但论文未明确提及具体方法。‘Large Language Models OR LLMs OR Foundation Models’得10分，因为研究基于120个基础-对齐模型对，属于大模型研究。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，后训练对齐使语言模型在预测人类行为时产生规范性偏差：在人类行为符合规范解的情况下（如单次教科书游戏）对齐模型预测更准，但在多轮战略互动中（行为受互惠、报复等描述性动态影响）基础模型预测更准，揭示了优化模型供人类使用与将其作为人类行为代理之间的根本权衡。

摘要翻译

训练后对齐技术旨在优化语言模型以匹配人类偏好信号，但这一目标并不等同于对人类实际行为进行建模。我们基于多轮策略博弈场景——包括议价、说服、谈判和重复矩阵博弈——中的一万余条真实人类决策数据，对120组基础模型与对齐模型进行了比较。在这些情境中，基础模型在预测人类选择方面以接近10:1的优势稳定优于其对应对齐模型，该结果在不同模型家族、提示词设计及博弈配置中均保持稳健。然而，当人类行为更可能遵循规范性预测时，这一模式发生逆转：在所有测试的12类单次经典博弈任务以及非策略性彩票选择任务中，对齐模型均表现更优；甚至在多轮博弈内部的第一轮（尚未形成交互历史时），对齐模型同样占据优势。这种边界条件模式表明，对齐过程会引入规范性偏差：当人类行为相对符合规范性解决方案时，对齐能提升预测能力；但在多轮策略环境中，当行为受到互惠性、报复性及历史依赖适应等描述性动态机制影响时，对齐反而会损害预测准确性。这些结果揭示了在优化模型以供人类使用与将其作为人类行为代理之间存在的根本性权衡。

摘要 (Abstract)

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

关键词: post-training alignment, language models, human behavior prediction, normative bias, strategic games, base models, aligned models, descriptive dynamics

148. ❌ SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization

作者: Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Charles Mackin, Ashutosh Jadhav, David Beymer, Ehsan Degan, Vandana Mukherjee 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SYMDIREC专注于硬件设计自动化中的RTL合成与摘要任务，核心是结合神经符号方法增强LLM性能。高度相关关键词：‘Large Language Models’（论文核心使用LLM）、‘Retrieval-Augmented Generation’（明确提及RAG方法并改进）。中等相关：‘Chain of Thought’和’System 2 Thinking’（涉及LLM推理和分解任务）、‘AI for Science’（硬件设计属于工程科学应用）。其余关键词如MoE、量化、对齐等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对硬件设计自动化中RTL合成与摘要任务因HDL语法严格、监督有限导致LLM性能不足的问题，提出了SYMDIREC神经符号框架，通过符号子目标分解、检索增强和LLM推理，在Verilog和VHDL上实现了比基线方法高约20%的Pass@1合成成功率和15-20%的ROUGE-L摘要改进。

摘要翻译

寄存器传输级（Register-Transfer Level, RTL）综合与摘要生成是硬件设计自动化的核心任务，但由于硬件描述语言（HDL）语法严格、监督数据有限以及与自然语言的对齐性较弱，这对大语言模型（Large Language Models, LLMs）而言仍具挑战。现有的提示方法与检索增强生成（Retrieval-Augmented Generation, RAG）方法未能融入符号规划，限制了其结构精确性。我们提出了SYMDIREC，一种神经符号框架，该框架将RTL任务分解为符号子目标，通过微调的检索器获取相关代码，并借助LLM推理组装经过验证的输出。SYMDIREC无需对大语言模型进行微调即可同时支持Verilog和VHDL，在综合任务上实现了比提示方法与RAG基线高约20%的Pass@1成功率，在摘要任务上获得了15-20%的ROUGE-L提升，这证明了符号化指导在RTL任务中的优势。

摘要 (Abstract)

Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.

关键词: RTL synthesis, hardware design automation, neuro-symbolic framework, retrieval-augmented generation, LLM reasoning, Verilog, VHDL, symbolic planning

149. ❌ OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

作者: Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17205v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究检索模型的高效微调，核心贡献是数据剪枝框架OPERA。与关键词相关性分析：1）高度相关（8-10分）：‘Post-training/SFT’（核心微调方法）、‘Pre-training/Domain Adaptation’（涉及领域适应）、‘Retrieval-Augmented Generation/RAG’（涉及检索生成）。2）中等相关（5分）：‘Large Language Models’（实验使用Qwen3-Embedding LLM）、‘Scaling Laws AND Data Quality’（涉及数据质量与性能权衡）。3）无关（0分）：其余关键词未涉及模型架构创新（如MoE、量化）、对齐方法（如RLHF）、推理技术（如CoT）、科学AI应用等。

!!! tip deepseek-chat TL;DR

该论文提出了OPERA数据剪枝框架，通过静态和动态剪枝策略优化检索模型的微调过程，在提升排名性能的同时显著减少训练时间。

摘要翻译

领域特定的微调对于稠密检索模型至关重要，但并非所有训练样本对学习过程的贡献均等。我们提出OPERA，一种数据剪枝框架，利用这种异质性来提升检索模型适应的效果与效率。我们首先研究了静态剪枝方法，该方法仅保留高相似度的查询-文档对，揭示了内在的质量-覆盖权衡：排序性能（NDCG）得到提升，而检索性能（召回率）可能因查询多样性减少而下降。为解决这一权衡，我们提出了一种两阶段动态剪枝策略，该策略在训练过程中自适应地调整查询级和文档级的采样概率，在优先处理高质量样本的同时保持对完整训练集的访问。在涵盖六个领域的八个数据集上的评估证明了两种方法的有效性：静态剪枝相比标准微调提升了排序性能（NDCG@10 +0.5%），而动态剪枝在排序（NDCG@10 +1.9%）和检索（召回率@20 +0.7%）上均取得了最强性能，在所有方法中平均排名达到1.38。这些发现可扩展至基于大语言模型的稠密检索器Qwen3-Embedding，证实了其架构无关的益处。值得注意的是，动态剪枝仅需标准微调不到50%的训练时间即可达到相当的性能。

摘要 (Abstract)

Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9%) and retrieval (Recall@20 +0.7%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50% of the training time required by standard finetuning.

关键词: data pruning, retrieval model adaptation, dense retrievers, domain-specific finetuning, training efficiency, query-document pairs, dynamic pruning, Qwen3-Embedding

150. ❌ Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

作者: Parsa Mirtaheri, Mikhail Belkin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在CoT推理中的动机性推理现象，通过激活探测检测模型内部表示。高度相关关键词：LLMs（研究对象）、CoT Reasoning（研究现象）、Mechanistic Interpretability（通过激活探测解释模型行为）。中等相关：System 2 Thinking（涉及推理过程）、Self-Correction（与检测错误推理相关）、Hallucination Mitigation（涉及事实性问题）。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究大型语言模型在链式思维推理中出现的动机性推理现象，通过激活探测方法在CoT生成前后检测模型内部表示，发现比基于CoT的监控更可靠地识别这种偏差。

摘要翻译

大型语言模型（LLM）生成的思维链（CoT）可能无法准确反映驱动其答案的实际因素。在存在偏向特定选项的注入提示的多选场景中，模型可能将其最终答案转向提示选项，并生成一个合理化该回答的思维链，而不承认提示的影响——这是一种动机性推理的实例。我们研究了多种LLM系列和数据集中的这一现象，证明即使无法从思维链中轻易识别动机性推理，仍可通过探测内部激活状态来识别它。利用在模型残差流上训练的监督探针，我们发现：（i）预生成探针（在生成任何思维链标记之前应用）预测动机性推理的效果与基于LLM的思维链监控器（可访问完整思维链轨迹）相当；（ii）后生成探针（在思维链生成后应用）的表现优于同一监控器。这些结果表明，从内部表征检测动机性推理比通过思维链监控更为可靠。此外，预生成探针能够早期标记动机性行为，从而可能避免不必要的生成过程。

摘要 (Abstract)

Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model’s residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.

关键词: Large Language Models, Chain of Thought, Motivated Reasoning, Activation Probing, Internal Representations, Residual Stream, Rationalization, Model Interpretability

151. ❌ Abstraction as a Memory-Efficient Inductive Bias for Continual Learning

作者: Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究在线持续学习（Continual Learning）中的遗忘问题，提出了一种名为Abstraction-Augmented Training（AAT）的方法，通过鼓励模型捕获跨示例的潜在关系结构来稳定学习。虽然论文涉及机器学习模型训练，但其核心内容（持续学习、遗忘、抽象归纳偏置、内存效率）与提供的关键词列表（主要聚焦于大语言模型及其特定技术、应用和优化方法）没有直接关联。所有关键词均未在标题或摘要中被提及或暗示，因此相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对在线持续学习中的灾难性遗忘问题，提出了一种名为Abstraction-Augmented Training（AAT）的损失级修改方法，通过联合优化具体实例及其抽象表示来引入内存高效的归纳偏置，从而在不使用回放缓冲区的情况下实现了与强经验回放基线相当或更优的性能。

摘要翻译

现实世界具有非平稳性与无限复杂性，这要求智能体能够持续学习，同时避免从头重新训练带来的高昂代价。在线持续学习为此提供了框架，但学习新信息常会干扰已掌握的知识，导致遗忘与泛化能力下降。为解决这一问题，我们提出抽象增强训练（Abstraction-Augmented Training, AAT），这是一种损失层面的改进方法，旨在促使模型捕捉样本间共享的潜在关系结构。通过对具体实例及其抽象表征进行联合优化，AAT引入了一种内存高效的归纳偏置，能够在严格的在线数据流中稳定学习过程，从而无需使用回放缓冲区。为捕捉抽象的多维特性，我们在两个基准测试中引入并评估了AAT：一是通过实体掩码实现抽象的可控关系数据集，二是通过共享谚语表达抽象的叙事数据集。实验结果表明，尽管AAT无需额外内存且仅对训练目标进行最小改动，其性能仍达到或超越了强经验回放（Experience Replay, ER）基线的水平。这项工作凸显了结构抽象作为一种无需内存的ER替代方案的有效性。

摘要 (Abstract)

The real world is non-stationary and infinitely complex, requiring intelligent agents to learn continually without the prohibitive cost of retraining from scratch. While online continual learning offers a framework for this setting, learning new information often interferes with previously acquired knowledge, causes forgetting and degraded generalization. To address this, we propose Abstraction-Augmented Training (AAT), a loss-level modification encouraging models to capture the latent relational structure shared across examples. By jointly optimizing over concrete instances and their abstract representations, AAT introduces a memory-efficient inductive bias that stabilizes learning in strictly online data streams, eliminating the need for a replay buffer. To capture the multi-faceted nature of abstraction, we introduce and evaluate AAT on two benchmarks: a controlled relational dataset where abstraction is realized through entity masking, and a narrative dataset where abstraction is expressed through shared proverbs. Our results show that AAT achieves performance comparable to or exceeding strong experience replay (ER) baselines, despite requiring zero additional memory and only minimal changes to the training objective. This work highlights structural abstraction as a powerful, memory-free alternative to ER.

关键词: Continual Learning, Abstraction-Augmented Training, Memory-efficient, Inductive Bias, Online Learning, Forgetting, Generalization, Replay Buffer

152. ❌ Tabular LLMs for Interpretable Few-Shot Alzheimer’s Disease Prediction with Multimodal Biomedical Data

作者: Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen, Li Shen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是应用大语言模型（LLMs）于阿尔茨海默病预测，属于AI for Science领域，因此相关关键词得高分。论文涉及预训练模型（TableGPT2）、领域适应、监督微调（SFT）、自反思（self-reflection）、LLM代理、多代理系统、可解释AI等，这些关键词得8分。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出TAP-GPT，一个基于TableGPT2的领域适应表格LLM框架，用于少样本阿尔茨海默病分类，在多个ADNI数据集上优于传统机器学习方法，并展示了其在缺失数据处理、可解释性和多代理系统中的稳定性。

摘要翻译

阿尔茨海默病（Alzheimer’s disease, AD）的精准诊断需要处理表格型生物标志物数据，然而此类数据通常规模较小且不完整，导致深度学习模型往往难以超越经典方法。预训练大语言模型（Large Language Models, LLMs）具备少样本泛化、结构化推理和可解释输出能力，为临床预测提供了强大的范式转变。我们提出TAP-GPT（表格型阿尔茨海默病预测GPT），这是一个基于TableGPT2构建、并针对少样本AD分类进行微调的领域自适应表格大语言模型框架，其使用表格提示而非纯文本进行训练。我们在四个源自ADNI的数据集上评估TAP-GPT，这些数据集包含用于二元AD分类的QT-PAD生物标志物以及区域级结构MRI、淀粉样蛋白PET和tau PET数据。在多模态与单模态设置下，TAP-GPT均优于其骨干模型，并在少样本设定中超越了传统机器学习基线，同时与最先进的通用大语言模型保持竞争力。研究表明，特征选择能缓解高维输入导致的性能下降，且TAP-GPT在模拟及真实世界缺失数据情况下无需插补即可保持稳定性能。此外，TAP-GPT能生成与已确立的AD生物学知识一致的结构化、模态感知推理，并在自我反思中表现出更高的稳定性，这支持了其在迭代多智能体系统中的应用。据我们所知，这是首次将专门针对表格数据的大语言模型系统性地应用于基于多模态生物标志物的AD预测，证明了此类预训练模型能有效处理结构化临床预测任务，并为表格大语言模型驱动的多智能体临床决策支持系统奠定了基础。源代码已在GitHub上公开：https://github.com/sophie-kearney/TAP-GPT。

摘要 (Abstract)

Accurate diagnosis of Alzheimer’s disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer’s Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: https://github.com/sophie-kearney/TAP-GPT.

关键词: Tabular LLMs, Alzheimer’s disease prediction, Few-shot learning, Multimodal biomedical data, Domain adaptation, Interpretable AI, Self-reflection, Multi-agent systems

153. ❌ Exploiting the English Grammar Profile for L2 grammatical analysis with LLMs

作者: Stefano Bannò, Penny Karanasou, Kate Knill, Mark Gales 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是应用LLMs进行第二语言语法分析，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用LLMs进行语法构造分类和熟练度评估。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文属于AI在教育/语言科学领域的应用，但非生物信息学或化学信息学。其他关键词（如MoE、SFT、RAG等）涉及大模型技术原理创新或特定应用方法，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用大语言模型和英语语法档案来评估第二语言学习者语法能力并自动提供反馈的新框架，实验表明LLMs在语义和语用复杂的语法构造分类上优于基于规则的方法，且混合方法在熟练度评估中表现最佳。

摘要翻译

评估第二语言（L2）学习者的语法能力对于提供针对性反馈和评估语言水平至关重要。为此，我们提出一种新颖框架，该框架利用英语语法能力量表（English Grammar Profile, EGP）——一种将语法结构映射至欧洲语言共同参考框架（CEFR）能力等级的体系——来检测学习者对语法结构的尝试，并将其分类为成功或不成功。此检测结果可用于提供细粒度反馈。此外，通过将自动检测到的尝试作为整体CEFR水平的预测因子，这些语法结构也被用于语言能力评估。在基于EGP选择语法结构时，我们比较了基于规则的方法和基于大语言模型（LLM）的分类器。研究表明，对于语义和语用层面较为复杂的结构，大语言模型的表现优于基于规则的方法；而对于纯粹依赖形态或句法特征、无需语义解释的结构，基于规则的方法仍具竞争力。在能力评估方面，我们评估了基于规则和混合两种流程，结果表明结合规则预过滤与大语言模型的混合方法持续表现出最优性能。由于本框架基于原始学习者句子及其修正版本的配对进行操作，我们也评估了使用自动语法纠错技术的全自动流程。该流程的性能接近基于人工修正的半自动系统，尤其在检测语法结构成功尝试方面表现突出。总体而言，本框架不仅关注学习者的不成功尝试，也强调其成功之处，从而能够提供积极的形成性反馈，并为语法发展提供可操作的见解。

摘要 (Abstract)

Evaluating the grammatical competence of second language (L2) learners is essential both for providing targeted feedback and for assessing proficiency. To achieve this, we propose a novel framework leveraging the English Grammar Profile (EGP), a taxonomy of grammatical constructs mapped to the proficiency levels of the Common European Framework of Reference (CEFR), to detect learners’ attempts at grammatical constructs and classify them as successful or unsuccessful. This detection can then be used to provide fine-grained feedback. Moreover, the grammatical constructs are used as predictors of proficiency assessment by using automatically detected attempts as predictors of holistic CEFR proficiency. For the selection of grammatical constructs derived from the EGP, rule-based and LLM-based classifiers are compared. We show that LLMs outperform rule-based methods on semantically and pragmatically nuanced constructs, while rule-based approaches remain competitive for constructs that rely purely on morphological or syntactic features and do not require semantic interpretation. For proficiency assessment, we evaluate both rule-based and hybrid pipelines and show that a hybrid approach combining a rule-based pre-filter with an LLM consistently yields the strongest performance. Since our framework operates on pairs of original learner sentences and their corrected counterparts, we also evaluate a fully automated pipeline using automatic grammatical error correction. This pipeline closely approaches the performance of semi-automated systems based on manual corrections, particularly for the detection of successful attempts at grammatical constructs. Overall, our framework emphasises learners’ successful attempts in addition to unsuccessful ones, enabling positive, formative feedback and providing actionable insights into grammatical development.

关键词: Large Language Models, grammatical analysis, second language learning, English Grammar Profile, CEFR proficiency, automatic feedback, hybrid approach, grammatical error correction

154. ❌ How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

作者: Rebecca Ansell, Autumn Toney-Wails 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体环境中的多步演绎推理能力，使用GPT-4o-mini和Gemini-2.5-Flash构建基于文本的Clue游戏测试平台，并探索微调对推理能力的影响。与以下关键词高度相关：LLMs（核心研究对象）、Multi-step Reasoning（评估重点）、LLM Agents（研究载体）、Multi-agent Systems（实验设置）。与Supervised Fine-tuning相关（实验涉及微调研究），与System 2 Thinking相关（涉及深度推理评估）。其他关键词如MoE、SLMs、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文通过构建基于文本的多智能体Clue游戏测试平台，评估LLM在多步演绎推理中的表现，发现LLM在完整游戏中难以保持一致的推理能力，且微调并不能可靠提升性能。

摘要翻译

对于大型语言模型智能体而言，推断“凶手是谁”仍具挑战性。本文基于经典棋盘游戏《妙探寻凶》实现了一个文本环境下的多智能体版本，将其作为基于规则的测试平台，用于评估多步演绎推理能力。实验使用了六个分别基于GPT-4o-mini和Gemini-2.5-Flash构建的智能体。我们进一步探究了在结构化逻辑谜题上进行微调是否能够迁移并提升游戏内的推理能力与游戏表现。在18场模拟对局中，智能体仅取得四场正确胜利，这表明其在完整游戏过程中维持连贯演绎推理存在困难。此外，研究发现微调并不能稳定提升性能，在某些情况下甚至可能增加推理量却未能提高推理精度。

摘要 (Abstract)

Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

关键词: LLM agents, multi-step deductive reasoning, text-based game, multi-agent system, fine-tuning, GPT-4o-mini, Gemini-2.5-Flash, reasoning evaluation

155. ❌ Multilingual Reference Need Assessment System for Wikipedia

作者: Aitolkyn Baigutanova, Francisco Navas, Pablo Aragon, Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Miriam Redi, Diego Saez-Trumper 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究一个多语言机器学习系统，用于帮助维基百科编辑识别需要引用的声明，属于AI应用领域。与大多数关键词（如MoE、SFT、RAG等）完全无关，因为这些关键词涉及大模型技术原理或特定方法，而论文未提及。唯一相关的是’Large Language Models OR LLMs OR Foundation Models’，因为摘要提到维基百科是大语言模型的关键资源，但论文本身不研究LLMs技术，只是应用背景相关，因此给5分（有一定关联）。其他关键词如’AI for Science’不相关，因为论文关注维基百科内容验证，而非科学领域AI应用。

!!! tip deepseek-chat TL;DR

该论文开发了一个多语言机器学习系统，用于自动识别维基百科中需要引用的声明，在10种语言版本中超越了现有基准，并在生产环境中部署，平衡了模型准确性和计算效率。

摘要翻译

维基百科是互联网上数百万用户至关重要的信息来源。它作为大型语言模型、搜索引擎、问答系统及其他网络应用的关键资源。在维基百科中，内容需具备可验证性，即读者能够核实各项主张是否附有可靠来源的参考文献支持。这依赖于编辑人员的人工核查——这一过程虽有效但劳动密集，尤其考虑到每日编辑量巨大的现实挑战。为应对此问题，我们开发了一套多语言机器学习系统，以协助编辑识别需要添加引用的陈述。我们的方法在维基百科10种语言版本中进行了测试，其参考文献需求评估性能超越了现有基准。我们不仅考量机器学习评估指标，同时兼顾系统实际需求，从而能够在真实基础设施约束下探索模型准确性与计算效率之间的平衡关系。本系统已投入实际部署，相关数据与代码均已公开以支持后续研究。

摘要 (Abstract)

Wikipedia is a critical source of information for millions of users across the Web. It serves as a key resource for large language models, search engines, question-answering systems, and other Web-based applications. In Wikipedia, content needs to be verifiable, meaning that readers can check that claims are backed by references to reliable sources. This depends on manual verification by editors, an effective but labor-intensive process, especially given the high volume of daily edits. To address this challenge, we introduce a multilingual machine learning system to assist editors in identifying claims requiring citations. Our approach is tested in 10 language editions of Wikipedia, outperforming existing benchmarks for reference need assessment. We not only consider machine learning evaluation metrics but also system requirements, allowing us to explore the trade-offs between model accuracy and computational efficiency under real-world infrastructure constraints. We deploy our system in production and release data and code to support further research.

关键词: Wikipedia, reference need assessment, multilingual machine learning, citation detection, computational efficiency, production deployment, verifiable content, editor assistance

156. ❌ Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency

作者: Lucas Bandarkar, Alan Ansell, Trevor Cohn 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究混合专家（MoE）大语言模型的知识定位方法，利用跨语言不一致性进行可解释性分析，因此与’Large Language Models’、‘Mixture of Experts’和’Mechanistic Interpretability’高度相关（10分）。其他关键词如SLMs、训练方法、推理技术、代理系统、科学应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用跨语言不一致性来定位混合专家大语言模型中知识存储位置的方法，通过对比成功和失败语言的路由激活，识别出对特定知识至关重要的专家，实验表明禁用少量专家即可显著影响模型回答正确率。

摘要翻译

现代大型语言模型在不同语言间的行为仍存在显著差异，例如能够用某些语言回忆事实信息，却无法用其他语言实现。尽管这种差异通常被视为需要缓解的问题，但本研究提出利用这种跨语言不一致性作为混合专家模型的可解释性工具。我们的知识定位框架通过对比模型能正确回忆信息与无法回忆信息的语言集合中的路由机制，从而分离出对回答特定知识问题起功能作用的模型组件。该方法分为两个阶段：（1）使用多种语言向模型提出困难的事实性问题，生成“成功”与“失败”的激活数据桶；（2）对混合专家模型的路由器逻辑值进行统计对比分析，以识别对知识处理至关重要的专家模块。为验证这一小部分专家对回答知识问题的必要性，我们将其停用后重新提问。研究发现，尽管仅停用约6000个专家中的20个，模型在超过40%的情况下不再能正确回答问题。总体而言，该方法为应对日益复杂的大型语言模型提供了一种现实且可扩展的知识定位途径。

摘要 (Abstract)

Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate “success” and “failure” activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.

关键词: Mixture-of-Experts, LLMs, knowledge localization, cross-lingual inconsistency, interpretability, router logits, experts, statistical contrastive analysis

作者: Ryo Kamoi, Ameya Godbole, Longqi Yang, Rui Zhang, Mengting Wan, Pei Zhou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM模拟人类对话的能力，直接涉及LLM技术应用，因此’Large Language Models’得10分；论文明确提到使用监督微调（SFT）方法，因此’Post-training OR Supervised Fine-tuning OR SFT’得8分；其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM模拟人类对话中不一致和非协作行为的能力，发现LLM生成的对话比人类对话更少这些行为，提示工程无法可靠控制这些行为，而监督微调可能导致行为过度产生。

摘要翻译

利用大型语言模型模拟人类对话已成为一种可扩展的人类社交互动建模方法。然而，模拟人类对话具有挑战性，因为对话本质上包含不一致与非协作行为，例如误解与打断。尽管复现这些行为是模拟类人且复杂社交互动的关键，但目前针对人类对话与LLM生成对话中不一致及非协作行为的对比分析仍较为有限。本研究提出CoCoEval评估框架，该框架采用LLM-as-a-Judge方法，在话轮层面检测10类不一致与非协作行为，以此分析LLM模拟对话。通过CoCoEval，我们评估了GPT-4.1、GPT-5.1和Claude Opus 4模型，在学术会议、商务会议、政府会议及辩论四种场景中，对比各模型模拟对话与人类对话中检测到的行为频率。分析表明：（1）在基础提示条件下，LLM模拟对话表现出的不一致与非协作行为远少于人类对话；（2）提示工程无法可靠控制这些行为，实验显示不同提示会导致行为生成不足或过度；（3）基于人类对话的监督微调可能导致LLM过度生成特定行为（如重复）。我们的研究结果揭示了模拟人类对话的困难性，并对使用LLM作为人类社交互动替代方案提出了警示。

摘要 (Abstract)

Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.

关键词: LLM-simulated conversations, human social interaction, inconsistent behaviors, uncollaborative behaviors, evaluation framework, supervised fine-tuning, prompt engineering, CoCoEval

158. ❌ Ensemble Self-Training for Unsupervised Machine Translation

作者: Ido Aharon, Jonathan Shaki, Sarit Kraus 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无监督神经机器翻译（UNMT）的集成自训练框架，属于传统的机器翻译领域，而非大模型或深度学习技术原理的创新。论文未涉及任何评分关键词中的大模型技术（如LLMs、MoE、RLHF等）、大模型应用（如AI for Science）或相关技术（如量化、推理加速等）。所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于集成自训练的无监督神经机器翻译框架，通过多模型集成生成伪翻译数据来提升翻译性能，实验表明该方法在英译外和外译英任务上分别取得了1.7和0.67 chrF的显著提升。

摘要翻译

我们提出一种基于集成学习的自训练框架，用于无监督神经机器翻译（UNMT）。以核心语言对为起点，我们训练多个共享相同翻译任务但采用不同辅助语言的UNMT模型，从而在模型间引入结构化差异。随后，我们通过词元级集成解码为核心语言对生成伪翻译，即对双向翻译的模型预测结果进行平均。这些集成输出被用作合成平行数据以进一步训练每个模型，使各模型能够通过共享监督信号实现性能提升。在部署阶段，我们根据验证集表现选择单一模型，从而保持单模型推理成本。实验表明，该框架相比单模型UNMT基线取得了统计学意义上的显著改进：从英语翻译时平均提升1.7 chrF，译入英语时平均提升0.67 chrF。

摘要 (Abstract)

We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.

关键词: unsupervised neural machine translation, ensemble self-training, pseudo-translations, token-level ensemble decoding, synthetic parallel data, model diversity, translation performance, chrF improvement

159. ❌ Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts

作者: Lucas Bandarkar, Alan Ansell, Trevor Cohn 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大型推理LLMs在跨语言知识转移中的局限性，特别是脚本障碍问题。高度相关的关键词包括：1) ‘Large Language Models’ (论文明确研究现代推理LLMs)；2) ‘Post-training/SFT’ (论文开发SFT样本来改进模型推理)；3) ‘Chain of Thought/System 2 Thinking’ (论文分析推理模型的表现并改进推理过程)。其他关键词如MoE、SLMs、Scaling Laws、RAG等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该研究发现大型推理模型在跨语言知识转移中存在脚本障碍问题，通过开发SFT样本改进模型推理能力，可以有效减少跨脚本知识转移的差距。

摘要翻译

本研究分析了现代大型推理语言模型在跨语言知识迁移中存在的缺陷。我们证明，当前观察到的知识迁移差距主要源于文字书写体系的障碍。首先，我们对多个思维模型在两个包含全球本土知识的数据集（ECLeKTic和MultiLoKo）上的表现进行了观测性数据分析。回归分析表明，在控制模型能力与问题难度后，文字书写体系的匹配度——而非语言或语系——是预测知识迁移失败的主要指标。我们通过向模型提供源语言的关键实体信息进一步验证了这一发现，结果显示该方法对跨文字体系问题的提升效果尤为显著。基于此，我们提出假设：这些模型在测试阶段可能具备更强的推理潜力。为验证该假设，我们开发了合成数据生成流程，设计监督微调样本以鼓励模型在推理时检索参数化知识过程中更好地处理音译歧义问题。实验证明，通过引导两个模型提升推理能力，跨文字体系的迁移差距得以缩小。由此我们得出结论：在预训练后阶段改善跨语言参数化知识迁移具有显著潜力。

摘要 (Abstract)

In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.

关键词: Large Language Models, cross-lingual knowledge transfer, script barrier, reasoning models, parametric knowledge, supervised fine-tuning, knowledge transfer failure, cross-script transfer gap

160. ❌ LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

作者: Lifu Tu, Rongguang Wang, Tao Sheng, Sujjith Ravi, Dan Roth 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在NL2SQL任务中的鲁棒性评估，在传统和智能体（agentic）设置下测试了多个SOTA LLM（如GPT-5.2、Claude-Opus-4.6等），因此与’Large Language Models’高度相关（10分）。论文明确在’agentic settings’下进行评估，与’LLM Agents’高度相关（10分）。其他关键词如MoE、SFT、RAG、CoT等均未在摘要中提及或涉及，因此评分为0。论文属于大模型应用研究，符合评分背景中’大模型在不同领域的研究应用’的要求。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在自然语言转SQL任务中的鲁棒性，发现模型在表面噪声和语义保留的语言变异下性能下降，且传统管道对表面噪声更敏感，而智能体设置对语言变异更敏感。

摘要翻译

自然语言转SQL（NL2SQL）系统的鲁棒性评估至关重要，因为现实世界的数据库环境具有动态性、噪声干扰且持续演变，而传统基准评估通常假设静态模式与规范化的用户输入。本研究引入了一个包含约十类扰动的鲁棒性评估基准，并在传统流程与智能体化（agentic）设置下分别进行评测。我们评估了包括Grok-4.1、Gemini-3-Pro、Claude-Opus-4.6和GPT-5.2在内的多种前沿大语言模型（LLMs）。结果表明，这些模型在多数扰动下能保持较强性能；但在表层噪声（如字符级损坏）以及保持语义不变而改变词汇或句法形式的语言变异情况下，模型性能出现显著下降。此外，我们发现表层噪声在传统流程中导致更大幅度的性能衰减，而语言变异则在智能体化场景中构成更大挑战。这些发现凸显了构建鲁棒NL2SQL系统仍面临的难题，尤其在处理语言多样性方面。

摘要 (Abstract)

Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

关键词: NL2SQL, robustness evaluation, large language models, agentic settings, linguistic variation, surface noise, benchmark, performance degradation

161. ❌ HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

作者: Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出HopChain框架，专注于为视觉语言模型（VLMs）合成多跳推理数据以提升其推理能力。核心相关关键词为’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分），因为论文直接研究多步推理链的合成与训练；‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分），涉及深度推理；‘Large Language Models OR LLMs OR Foundation Models’（8分），因VLMs属于大模型范畴。‘Scaling Laws AND Data Quality’（5分）和’Hallucination Mitigation OR Factuality OR Truthfulness’（5分）有间接关联，分别涉及数据合成对性能的影响及幻觉错误分析。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）因论文在STEM领域评估。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在细粒度多步推理中的失败模式，提出了HopChain框架来合成多跳视觉语言推理数据，实验表明添加该数据能广泛提升模型在24个基准测试上的性能，并显著增强长链推理能力。

摘要翻译

视觉语言模型展现出强大的多模态能力，但在细粒度视觉语言推理方面仍面临挑战。我们发现，长链思维推理过程会暴露多种错误模式，包括感知错误、推理错误、知识错误和幻觉错误，这些错误可能在中间步骤中累积放大。然而，现有大多数用于强化学习视觉推理的视觉语言数据并未涉及全程依赖视觉证据的复杂推理链，使得这些缺陷难以被充分揭示。为此，我们提出HopChain——一个可扩展的框架，专门用于合成多跳视觉语言推理数据以支持视觉语言模型的强化学习视觉推理训练。每个合成的多跳查询构成一条逻辑上相互依赖的实例锚定跳链：早期跳步为后续跳步建立所需的实例、集合或条件，而最终答案仍保持为具体、明确的数值，便于进行可验证的奖励计算。我们将HopChain合成的多跳数据添加到用于训练Qwen3.5-35B-A3B和Qwen3.5-397B-A17B的原始强化学习视觉推理数据中，并在涵盖STEM与谜题、通用视觉问答、文本识别与文档理解、视频理解等领域的24个基准测试上，与仅使用原始强化学习视觉推理数据的训练结果进行对比。尽管这些多跳数据并非针对特定基准设计，但其加入使得两个模型在24个基准中的20个上表现均有提升，表明该方法能带来广泛且可泛化的性能增益。为证明完整链式查询的重要性，我们将其替换为半多跳或单跳变体，导致24个基准的平均准确率分别下降5.3和7.0个百分点。多跳训练还显著增强了长链思维下的视觉语言推理能力，在超长链思维推理场景中准确率提升峰值超过50个百分点。这些实验证明，HopChain是一个高效、可扩展的框架，能够合成提升视觉语言推理泛化能力的多跳数据。

摘要 (Abstract)

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

关键词: Vision-Language Models, Multi-hop Reasoning, Chain of Thought, Data Synthesis, Generalizable Reasoning, RLVR Training, Visual Evidence, Benchmark Evaluation

162. ❌ Universal Skeleton Understanding via Differentiable Rendering and MLLMs

作者: Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在非视觉模态（人体骨骼）上的应用，属于大模型在不同领域的研究应用，具有创新性。与"Large Language Models"高度相关（10分），因为MLLMs是LLMs的扩展；与"Chain of Thought"和"System 2 Thinking"相关（8分），因为论文提到"step-by-step reasoning"和"reasoning capabilities”；与"Post-training"有一定关联（5分），因为使用了微调策略；与"AI for Science"有一定关联（5分），因为涉及人体骨骼分析，可视为生物信息学相关应用。其他关键词与论文内容无关或未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了多模态大语言模型无法直接处理非视觉结构化数据（如人体骨骼序列）的问题，通过提出SkeletonLLM框架和DrAction可微分渲染器，成功实现了将任意骨骼序列转换为视觉模态并进行识别、描述、推理等任务，展示了MLLMs在非原生模态上的应用潜力。

摘要翻译

多模态大语言模型（MLLMs）展现出强大的视觉-语言推理能力，但仍受限于其原生模态，无法直接处理结构化、非视觉的数据（如人体骨骼）。现有方法要么将骨骼动态压缩为有损特征向量以对齐文本，要么将运动量化为离散标记，这些标记在不同异构骨骼格式间泛化能力较差。本文提出SkeletonLLM，通过将任意骨骼序列转换为MLLM的原生视觉模态，实现了通用骨骼理解。其核心是DrAction——一个可微分、格式无关的渲染器，能够将骨骼运动学转换为紧凑的图像序列。由于该流程是端到端可微分的，MLLM的梯度可以直接指导渲染过程，生成富含任务信息的视觉标记。为进一步增强推理能力，我们引入协同训练策略：因果推理蒸馏从教师模型迁移结构化的逐步推理能力，而判别性微调则锐化易混淆动作间的决策边界。SkeletonLLM在识别、描述、推理及跨格式迁移等多种任务上表现出强大的泛化能力，为将MLLMs应用于非原生模态提供了一条可行路径。代码将在论文录用后公开。

摘要 (Abstract)

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM’s native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer – suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.

关键词: Multimodal Large Language Models, Skeleton Understanding, Differentiable Rendering, Visual-Language Reasoning, Cross-format Transfer, Causal Reasoning Distillation, Task-informative Visual Tokens, Non-native Modalities

163. ❌ EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

作者: Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EchoGen专注于计算机视觉领域的布局到图像生成和图像定位任务，提出了一种统一的框架和渐进式训练策略（包括PMTP、DJO和Cycle RL阶段）。虽然论文涉及多任务学习、预训练和强化学习（GRPO策略），但其核心内容与深度学习在计算机视觉中的应用相关，而非大语言模型（LLMs）或指定的深度学习技术原理创新。因此，大多数关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）完全不相关（0分）。仅“Pre-training OR Continual Pre-training OR Domain Adaptation”关键词因论文提到“Parallel Multi-Task Pre-training (PMTP)”阶段而有一定关联（5分），但这不是论文的核心创新点。其他关键词如AI for Science等也不直接相关。

!!! tip deepseek-chat TL;DR

论文提出EchoGen框架，通过渐进式训练策略（包括预训练、联合优化和循环强化学习）统一解决布局到图像生成和图像定位任务，在基准测试中实现了最先进的性能并展示了任务间的协同增益。

摘要翻译

本研究提出EchoGen——一个面向布局生成图像与图像定位的统一框架，该框架能够生成具有精确布局且高度贴合文本描述（如空间关系）的图像，同时实现鲁棒的图像定位。我们认为图像定位任务具备强大的文本与布局理解能力，可弥补布局生成图像任务中相应的局限性；同时，基于布局生成的图像在内容上具有高度多样性，从而能增强图像定位的鲁棒性。在统一模型中联合训练这两项任务可促进各自性能的提升。然而，我们发现这种联合训练范式存在若干优化挑战，并导致性能受限。为解决这些问题，我们提出了渐进式训练策略：首先，通过并行多任务预训练阶段，利用共享标记加速训练，使模型获得两项任务的基础能力；随后，在双重联合优化阶段，利用任务对偶性将两项任务顺序整合，实现统一优化；最后，在循环强化学习阶段，通过一致性约束作为奖励消除对视觉监督的依赖，并借助GRPO策略显著增强模型的统一能力。大量实验表明，本方法在布局生成图像与图像定位基准测试中均取得了最先进的性能，并明确揭示了通过联合优化两项任务所产生的显著协同增益。

摘要 (Abstract)

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model’s unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

关键词: layout-to-image generation, image grounding, unified framework, progressive training, cycle-consistent learning, multi-task learning, GRPO strategy, visual supervision

164. ❌ LoST: Level of Semantics Tokenization for 3D Shapes

作者: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D形状的生成建模，提出了一种新的语义级标记化方法（LoST）和关系距离对齐损失（RIDA），属于计算机视觉和3D几何处理领域。论文内容与绝大多数关键词（涉及大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为3D形状生成可视为AI在科学计算或图形学（作为广义科学领域）中的应用，但并非论文核心，故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了3D形状自回归生成模型中传统几何层次标记化方法缺乏语义连贯性和标记效率低的问题，提出了一种基于语义显著性的标记化方法（LoST）和相应的对齐损失（RIDA），实现了在显著减少标记数量的同时，在几何和语义重建指标上大幅超越现有方法的性能，并支持高质量生成和语义检索等下游任务。

摘要翻译

分词是多种模态生成建模的一项基础技术。尤其在自回归模型中，它发挥着关键作用——该模型近年来已成为三维生成领域备受关注的选择。然而，三维形状的最优分词方法仍是一个开放性问题。当前最先进的方法主要依赖于最初为渲染和压缩设计的几何细节层次结构。这类空间层次结构通常存在分词效率低下的问题，且缺乏适用于自回归建模的语义连贯性。我们提出了语义层次分词法，该方法依据语义显著性对词元进行排序，使得早期前缀能解码为具备主体语义的完整、合理形状，而后续词元则用于细化实例特有的几何与语义细节。为训练语义层次分词法，我们引入了关系性内部距离对齐——一种新颖的三维语义对齐损失函数，它能够将三维形状潜在空间的关系结构与语义DINO特征空间的关系结构进行对齐。实验表明，语义层次分词法实现了最先进的重建效果，在几何与语义重建指标上均大幅超越此前基于细节层次的三维形状分词方法。此外，该方法仅需先前自回归模型所需词元数量的0.1%-10%，就能实现高效、高质量的自回归三维生成，并支持语义检索等下游任务。

摘要 (Abstract)

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

关键词: 3D shape tokenization, autoregressive modeling, semantic salience, level-of-detail, generative modeling, geometric reconstruction, semantic alignment, relational structure alignment

165. ❌ GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

作者: Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GMT专注于3D场景中的6-DOF物体轨迹合成，使用多模态Transformer框架，核心是机器人操作规划、3D几何理解和轨迹生成。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练、对齐、推理、部署、应用等），而本文完全不涉及语言模型、文本处理或自然语言任务，也未提及任何评分关键词中的技术概念。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了GMT，一个目标条件多模态Transformer框架，用于在3D场景中合成可控的6-DOF物体操作轨迹，通过融合几何、语义和上下文信息，在空间准确性和方向控制上超越了现有基线方法。

摘要翻译

在三维环境中合成可控的六自由度物体操控轨迹对于实现机器人与复杂场景的交互至关重要，但由于需要精确的空间推理、物理可行性以及多模态场景理解，这仍然是一项具有挑战性的任务。现有方法通常依赖于二维或部分三维表征，限制了其捕捉完整场景几何结构的能力，并制约了轨迹的精确性。我们提出了GMT，一种多模态变换器框架，通过联合利用三维边界框几何、点云上下文、语义物体类别以及目标末端姿态，生成逼真且目标导向的物体轨迹。该模型将轨迹表示为连续的六自由度姿态序列，并采用一种定制的条件调节策略，融合几何、语义、上下文及目标导向信息。在合成与真实世界基准测试上的大量实验表明，GMT在空间精度和姿态控制方面均取得显著提升，其性能超越了CHOIS和GIMO等当前最先进的人体运动及人-物交互基线方法。我们的方法为基于学习的操控规划设立了新基准，并展现出对多样化物体及杂乱三维环境的强大泛化能力。项目页面：https://huajian- zeng.github. io/projects/gmt/。

摘要 (Abstract)

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.

关键词: 6-DOF object trajectory synthesis, multimodal transformer, 3D scenes, goal-conditioned, manipulation planning, point cloud, spatial reasoning, robot interaction

166. ❌ Versatile Editing of Video Content, Actions, and Dynamics without Training

作者: Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频编辑技术，提出了一种无需训练的编辑方法DynaEdit，利用预训练的文本到视频流模型。与大多数关键词无关，因为论文不涉及大语言模型、推理方法、对齐技术、代理系统等。唯一相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文使用了预训练的文本到视频模型，但这不是核心创新，只是基础工具，因此给5分（有一定关联）。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文提出了一种无需训练的通用视频编辑方法DynaEdit，解决了现有方法难以编辑视频动作、动态事件和对象交互的问题，通过改进反转自由方法实现了最先进的复杂文本驱动视频编辑效果。

摘要翻译

近年来，可控视频生成技术取得了显著进展。然而，对真实世界视频中的动作与动态事件进行编辑，或插入会影响其他对象行为的内容，仍然是一个重大挑战。现有的已训练模型在处理复杂编辑任务时表现不佳，这很可能源于相关训练数据收集的困难。同样，现有的免训练方法本质上局限于保持结构和运动的编辑，不支持对运动或交互关系的修改。本文提出DynaEdit，一种基于预训练文本到视频流模型的免训练编辑方法，能够实现多样化的视频编辑功能。我们的方法基于近期提出的免反转技术，该技术不干预模型内部结构，因此具有模型无关性。我们发现，若直接尝试将此方法应用于一般无约束编辑，会导致严重的低频错位与高频抖动问题。我们分析了这些现象的产生根源，并提出了克服它们的新机制。通过大量实验，我们证明DynaEdit在基于文本的复杂视频编辑任务上取得了最先进的结果，包括修改动作、插入与场景交互的对象，以及引入全局视觉效果。

摘要 (Abstract)

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

关键词: video editing, training-free method, text-to-video models, action modification, object insertion, dynamic events, DynaEdit, inversion-free approach

167. ❌ Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

作者: Shuyao Shi, Kang G. Shin 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17980v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在3D场景理解中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确提到’Multimodal Large Language Models (MLLMs)‘并提出了Motion-MLLM框架。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为3D场景理解可视为AI在科学/工程领域的应用，但论文未明确提及生物信息学或化学信息学。其他关键词（如MoE、SFT、RAG等）未在论文标题或摘要中出现，且论文专注于特定应用而非这些技术原理，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态大语言模型在3D场景理解中依赖计算昂贵的3D表示或缺乏物理基础的问题，提出了一种融合自我运动模态数据的Motion-MLLM框架，通过关键帧过滤和跨模态融合模块，显著提高了3D场景理解和空间推理的准确性和效率。

摘要翻译

近期，多模态大语言模型（Multimodal Large Language Models, MLLMs）在三维场景空间推理方面展现出巨大潜力。然而，这些模型通常依赖于计算成本高昂的三维表征（如点云或重建的鸟瞰图（Bird’s-Eye View, BEV）），或缺乏物理基础来消除尺度与尺寸的模糊性。本文通过引入与视频同步采集的惯性测量单元（Inertial Measurement Units, IMUs）自运动模态数据，显著增强了多模态大语言模型的性能。具体而言，我们提出了一种名为Motion-MLLM的新框架，该框架包含两个关键组件：（1）级联运动-视觉关键帧筛选模块，该模块利用IMU数据与视觉特征，高效选取一组稀疏且具代表性的关键帧；（2）非对称跨模态融合模块，其中运动标记作为中介，将自运动线索与跨帧视觉上下文融入视觉表征中。通过将视觉内容锚定于物理自运动轨迹，Motion-MLLM能够对场景中的绝对尺度与空间关系进行推理。我们的大量实验评估表明，Motion-MLLM在多项涉及三维场景理解与空间推理的任务中均取得显著提升。与基于视频帧和显式三维数据的当前最优（state-of-the-art, SOTA）方法相比，Motion-MLLM在显著降低开销（即成本效益分别提高1.40倍和1.63倍）的同时，实现了相当甚至更高的准确度。

摘要 (Abstract)

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird’s-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).

关键词: Multimodal Large Language Models, 3D scene understanding, egomotion, spatial reasoning, motion-visual fusion, keyframe filtering, cross-modal fusion, cost-effectiveness

168. ❌ AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

作者: Jinho Park, Se Young Chun, Mingoo Seok 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于雷达数据压缩技术，与深度学习和大模型技术无直接关联。唯一相关的关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文使用了量化技术进行雷达数据压缩，但这是针对雷达数据而非深度学习模型，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AdaRadar的自适应雷达数据压缩方法，通过梯度下降调整压缩比、离散余弦变换和量化技术，在保持检测性能的同时实现了超过100倍的特征尺寸压缩。

摘要翻译

雷达因其全天候特性及测量距离与多普勒速度的能力，成为自动驾驶系统中关键的感知模态。然而，高维原始雷达数据量巨大，易使至计算引擎（如NPU）的通信链路饱和，该链路通常是低带宽接口，其数据速率仅能支持少数低分辨率距离-多普勒帧。目前明显缺乏利用高维雷达数据的通用编解码器，而现有的图像域方法并不适用，因为它们通常以固定压缩比运行，无法适应多变或对抗性条件。有鉴于此，我们提出了一种带自适应反馈的雷达数据压缩方法。该方法通过从检测置信度相对于压缩率的代理梯度执行梯度下降，动态调整压缩比。我们采用零阶梯度近似，因为它即使在不支持微分的核心操作——剪枝与量化中也能实现梯度计算。这同时避免了在带宽受限的链路上传输梯度张量，若对其进行估计，其大小将与原始雷达数据相当。此外，我们发现雷达特征图高度集中于少数频率分量上。因此，我们对雷达数据立方体应用离散余弦变换，并有效选择性地剪除系数。我们通过缩放量化保持每个雷达数据块的动态范围。结合这些技术，我们提出的在线自适应压缩方案在性能下降极小（约1%p）的情况下实现了超过100倍的特征尺寸缩减。我们在RADIal、CARRADA和Radatron数据集上验证了结果。

摘要 (Abstract)

Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations–pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.

关键词: radar data compression, adaptive compression, quantization, discrete cosine transform, autonomous driving, gradient descent, feature size reduction, perception systems

169. ❌ AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

作者: Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D重建领域，研究从遮挡的单目视频中重建可动画的3D高斯化身，使用扩散模型生成监督数据。论文内容与所有评分关键词（均涉及大语言模型、深度学习技术原理、AI for Science等）完全无关，未涉及任何大模型技术、深度学习原理创新或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AHOY的方法，用于从严重遮挡的单目视频中重建完整、可动画的3D高斯化身，通过身份微调的扩散模型生成密集监督，实现了最先进的重建质量。

摘要翻译

我们提出了AHOY方法，用于从严重遮挡的野外单目视频中重建完整、可动画化的三维高斯化身。现有方法通常假设输入无遮挡——主体完全可见且常处于规范姿态，这排除了现实世界中绝大多数人物被家具、物体或他人常规遮挡的影像素材。从这类素材重建面临根本性挑战：身体的大片区域可能从未被观测到，且每个姿态缺乏多视角监督。我们通过四项创新应对这些挑战：（一）采用幻觉即监督流程，利用身份微调的扩散模型为先前未观测到的身体区域生成密集监督信号；（二）设计两阶段规范态至姿态依赖架构，从稀疏观测引导至完整的姿态依赖高斯映射；（三）通过映射姿态/线性混合蒙皮姿态解耦机制，吸收生成数据中的多视角不一致性；（四）实施头身分离监督策略以保持面部身份特征。我们在YouTube视频及存在显著遮挡的多视角采集数据上进行评估，展示了业界领先的重建质量。实验还表明，所生成的化身具备足够鲁棒性，可用于新姿态动画并合成到通过手机视频采集的三维高斯溅射（3DGS）场景中。项目页面详见 https://miraymen.github.io/ahoy/

摘要 (Abstract)

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/

关键词: 3D Gaussian avatars, monocular video, occlusion, diffusion models, hallucination-as-supervision, animatable reconstruction, Gaussian Splatting, video priors

170. ❌ Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization

作者: Yoan David, Pierre-Marc Jodoin, Alzheimer’s Disease Neuroimaging Initiative, The TRACK-TBI Investigators 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（扩散MRI）数据协调中的统计方法改进，特别是处理病理异常值对ComBat方法的影响。论文内容涉及神经影像学、统计建模和机器学习（使用简单的MLP），但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中列出的具体大模型技术（如MoE、Scaling Laws、RLHF、RAG等）。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在生物医学（神经影像学）领域的应用，但相关性较弱（5分），因为论文重点在统计方法而非核心AI模型创新。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了在扩散MRI数据协调中，当包含病理异常患者时ComBat方法产生失真的问题，并提出了一种基于MLP的Robust-ComBat方法，在包含高达80%神经疾病患者的多中心数据中显著降低了协调误差。

摘要翻译

诸如ComBat及其变体等标准化方法被广泛应用于减轻扩散磁共振成像(dMRI)的站点特异性偏差。然而，ComBat假设受试者分布呈现高斯形态。实际上，神经系统疾病患者常表现出明显偏离健康对照组的扩散指标，这些病理异常值会扭曲站点效应的估计。该问题在临床实践中尤为棘手，因为大多数接受脑部成像的患者都存在潜在且尚未确诊的病症，难以将其排除在标准化队列之外——因为这些扫描正是为明确诊断而进行的。本文证明，在使用ComBat将数据标准化至规范参考人群时，若纳入病理病例会导致显著失真。我们针对7种神经系统疾病，在多种场景下评估了10种异常值剔除方法与4种ComBat变体的组合，结果显示许多过滤策略在存在病理数据时均告失效。相比之下，简单的多层感知器(MLP)能提供稳健的异常值补偿，在保留疾病相关信号的同时实现可靠的标准化。在对照组和真实多站点队列（包含高达80%的神经系统疾病受试者）上的实验表明，鲁棒性ComBat(Robust-ComBat)在所有ComBat变体中均以更低的标准化误差持续优于传统统计基线方法。

摘要 (Abstract)

Harmonization methods such as ComBat and its variants are widely used to mitigate diffusion MRI (dMRI) site-specific biases. However, ComBat assumes that subject distributions exhibit a Gaussian profile. In practice, patients with neurological disorders often present diffusion metrics that deviate markedly from those of healthy controls, introducing pathological outliers that distort site-effect estimation. This problem is particularly challenging in clinical practice as most patients undergoing brain imaging have an underlying and yet undiagnosed condition, making it difficult to exclude them from harmonization cohorts, as their scans were precisely prescribed to establish a diagnosis. In this paper, we show that harmonizing data to a normative reference population with ComBat while including pathological cases induces significant distortions. Across 7 neurological conditions, we evaluated 10 outlier rejection methods with 4 ComBat variants over a wide range of scenarios, revealing that many filtering strategies fail in the presence of pathology. In contrast, a simple MLP provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal. Experiments on both control and real multi-site cohorts, comprising up to 80% of subjects with neurological disorders, demonstrate that Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants.

关键词: diffusion MRI, data harmonization, ComBat, outlier mitigation, neurological disorders, multi-site cohorts, MLP, pathological cases

171. ❌ TransText: Transparency Aware Image-to-Video Typography Animation

作者: Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TransText专注于图像到视频模型的透明度感知文本动画，属于计算机视觉和生成模型领域，与所有评分关键词（主要关于大语言模型、深度学习技术原理、AI科学应用等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了TransText框架，通过Alpha-as-RGB范式解决了图像到视频模型中透明度感知文本动画的生成问题，实现了高质量、连贯的透明动画效果。

摘要翻译

据我们所知，我们首次提出了将图像到视频模型适配于图层感知文本（字形）动画的方法，这一能力对实际动态视觉设计至关重要。现有方法主要将透明度编码（Alpha通道）作为RGB空间附加的额外潜在维度处理，这需要重建以RGB为核心的变分自编码器（VAE）。然而，鉴于高质量透明字形数据的稀缺性，重新训练VAE计算成本高昂，且可能削弱从海量RGB语料库中学习到的鲁棒语义先验，导致潜在模式混合。为缓解这些局限，我们提出了TransText框架，该框架基于一种新颖的“Alpha-as-RGB”范式，在不修改预训练生成流形的情况下联合建模外观与透明度。TransText通过潜在空间拼接将Alpha通道嵌入为与RGB兼容的视觉信号，在防止特征纠缠的同时，显式确保了严格的跨模态（RGB与Alpha）一致性。实验表明，TransText显著优于基线方法，能够生成具有多样、细粒度效果的连贯且高保真度的透明动画。

摘要 (Abstract)

We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.

关键词: Transparency-aware animation, Image-to-video models, Alpha channel, Generative manifold, Latent spatial concatenation, Transparent glyph animation, VAE reconstruction, Cross-modal consistency

172. ❌ LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

作者: Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LaDe提出了一种用于分层媒体设计的潜在扩散框架，其中使用了LLM作为提示扩展器来生成结构化描述，这是论文的核心创新点之一。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文主要涉及计算机视觉和生成式AI，其核心是扩散模型和分层生成，与列表中的其他关键词（如MoE、SLMs、Scaling Laws、各种训练/调优技术、推理加速、AI for Science等）没有直接关联，因此这些关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了LaDe框架，通过结合LLM提示扩展器、潜在扩散变换器和RGBA VAE，解决了现有方法在生成可编辑分层媒体设计时层数固定或层内容受限的问题，实现了优于基线模型的文本到层对齐性能。

摘要翻译

媒体设计图层生成技术使得仅通过自然语言提示即可创建完全可编辑的分层设计文档，如海报、传单和标识。现有方法要么将输出限制在固定数量的图层，要么要求每个图层仅包含空间连续区域，导致图层数量随设计复杂度线性增长。我们提出LaDe（分层媒体设计），一种潜在扩散框架，能够生成数量灵活且具有语义意义的图层。LaDe结合了三个组件：基于大语言模型（LLM）的提示扩展器，将简短的用户意图转化为结构化的逐层描述以指导生成；采用4D RoPE位置编码机制的潜在扩散变换器，联合生成完整的媒体设计及其组成的RGBA图层；以及支持完整Alpha通道的RGBA变分自编码器（VAE），用于解码每个图层。通过在训练中对图层样本进行条件化，我们的统一框架支持三项任务：文本到图像生成、文本到图层的媒体设计生成以及媒体设计分解。我们在Crello测试集上，针对文本到图层和图像到图层任务，将LaDe与Qwen-Image-Layered进行了比较。经两个VLM-as-a-judge评估器（GPT-4o mini和Qwen3-VL）验证，LaDe在文本到图层生成中通过提升文本与图层的对齐度，性能优于Qwen-Image-Layered。

摘要 (Abstract)

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

关键词: Layered Media Design, Latent Diffusion, LLM-based Prompt Expander, RGBA Layers, Text-to-Layers Generation, Media Design Decomposition, Unified Framework

173. ❌ Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

作者: Jingchun Yang, Jinchang Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLM构建法律多智能体框架进行交通事故责任判定，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’、‘Chain of Thought’、‘System 2 Thinking’、‘Mechanistic Interpretability’高度相关（10分），因为这些是论文方法的核心组成部分。与’AI for Science’有一定关联（5分），因为论文属于AI在法律领域的应用，但并非典型的科学领域（如生物信息学）。其他关键词（如MoE、量化、RAG等）在论文中未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于法律多智能体框架的方法，利用大语言模型从行车记录仪视频中自动判定交通事故责任并生成可解释的法律判决报告，在C-TRAIL和MM-AU数据集上优于现有方法。

摘要翻译

行车记录仪的广泛普及使得交通事故中的视频证据日益丰富，但将“视频中发生了什么”转化为“依据哪条法律条款由谁承担责任”仍高度依赖人工专家。现有的第一视角交通事故研究主要集中于感知与语义理解，而基于大语言模型的法律方法大多基于文本案件描述构建，极少融入视频证据，两者之间存在明显断层。我们首次提出C-TRAIL——一个在中国交通法规体系下构建的多模态法律数据集，该数据集将行车记录仪视频与文本描述显式对齐至一组封闭的责任模式及其对应的中国交通法规条款。在此基础上，我们引入一个两阶段框架：（1）交通事故理解模块，用于生成文本化视频描述；（2）法律多智能体框架，可输出责任模式、法规条款集及完整的判决报告。在C-TRAIL和MM-AU数据集上的实验结果表明，我们的方法在提供透明可解释的法律推理过程的同时，其性能优于通用及法律领域大语言模型，也超越了现有的基于智能体的方法。

摘要 (Abstract)

The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming “what happened in the video” into “who is responsible under which legal provisions” still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.

关键词: traffic accident responsibility, dashcam video, legal multi-agent framework, LLM-based legal reasoning, interpretable legal judgment, multimodal legal dataset, traffic regulation system, video understanding

174. ❌ A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

作者: Javier Venema, Stefano De Luca, Pablo Mesejo, Óscar Ibáñez 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像（锁骨CT扫描）的深度学习应用，用于法律年龄估计，属于AI在科学（法医学）领域的应用。论文未涉及大语言模型（LLM）相关技术，所有LLM相关关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）均不相关，评分为0。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学/法医学领域的应用，评分为8。‘Mechanistic Interpretability OR Explainable AI’部分相关，因为论文提到了可解释性（attribution maps），但非核心，评分为5。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于锁骨CT扫描的可解释多阶段深度学习框架，用于法律年龄估计，在公开法医数据集上实现了1.55年的平均绝对误差，优于人类专家和先前方法。

摘要翻译

法定年龄评估在法医学与法医鉴定领域具有关键作用，相关决策需以准确、稳健、可重复且能明确量化不确定性的方法为依据。尽管现有基于人工智能（AI）的方法主要集中于手部X光片或牙齿影像分析，但锁骨计算机断层扫描（CT）在法定年龄评估中的实际有效性已获证实，却仍未得到充分探索。本研究提出一种基于锁骨CT扫描的可解释多阶段法定年龄评估流程。该框架整合了以下三个部分：（i）采用基于特征的连通分量分析方法实现锁骨自动检测，该方法仅需极少人工标注；（ii）通过集成梯度引导的切片选择策略构建输入数据，用于训练可估计法定年龄的多切片卷积神经网络；（iii）引入保形预测区间以支持符合国际标准的、考虑不确定性的决策。该流程在来自公共法医学数据集（新墨西哥州死者影像数据库）的1,158例全身死后CT扫描中进行了验证。最终模型在独立测试集上达到平均绝对误差（MAE）为1.55±0.16年的先进性能，优于人类专家（MAE约1.90年）及既往方法（在同数据集上MAE高于1.75年）。此外，保形预测可实现符合法医学要求的可配置置信水平。归因图显示模型聚焦于锁骨内侧骨骺的解剖相关区域。本方法目前正被集成至Skeleton-ID软件平台（https://skeleton-id.com/skeleton-id/），旨在为多因素法医学工作流程提供决策支持组件。

摘要 (Abstract)

Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.

关键词: legal age estimation, clavicle CT scans, forensic AI, multi-stage pipeline, conformal prediction, interpretable AI, medical imaging, deep learning

175. ❌ SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

作者: Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的无人机语义分割，提出了一种2D-3D-2D几何驱动范式来自动生成RGB和热成像模态的伪标签，并构建了大规模数据集SegFly。所有关键词均与自然语言处理、大模型技术原理或通用AI方法相关，而本文是纯粹的计算机视觉研究，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有微弱关联（属于AI在科学/工程应用），但并非核心内容，因此给5分；其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文解决了无人机RGB-热成像语义分割中数据标注成本高和跨模态对齐困难的问题，通过提出一种几何驱动的2D-3D-2D范式自动生成大规模伪标签并构建了SegFly数据集，显著提升了分割模型的性能。

摘要翻译

无人机语义分割是航空场景理解的基础，然而现有的RGB与RGB-T数据集因人工标注成本高昂以及商用无人机难以实现精确的RGB-T对齐，在规模、多样性和标注效率上仍存在局限。为应对这些挑战，我们提出了一种可扩展的几何驱动2D-3D-2D范式，该范式利用高重叠航拍图像中的多视角冗余，在统一框架内将标签从少量人工标注的RGB图像子集自动传播至RGB与热成像模态。通过将不足3%的RGB图像提升为语义三维点云并重投影至所有视角，我们的方法能够在大规模图像集中生成密集的伪真值标注，自动产生97%的RGB标签与100%的热成像标签，同时在无需任何二维人工修正的情况下达到91%和88%的标注准确率。我们进一步将此2D-3D-2D范式扩展至跨模态图像配准，利用三维几何作为中间对齐空间，实现了完全自动化、强像素级的RGB-T对齐，配准精度达87%且无需硬件级同步。通过将该框架应用于现有的地理参考航拍图像，我们构建了SegFly大规模基准数据集，包含超过20,000张高分辨率RGB图像及15,000余对几何对齐的RGB-T图像对，涵盖不同海拔、季节下的多样城市、工业与乡村环境。基于SegFly，我们建立了RGB与热成像语义分割的Firefly基线，并证明传统架构与视觉基础模型均能显著受益于SegFly的监督数据，凸显了几何驱动2D-3D-2D流程在可扩展多模态场景理解中的潜力。数据与代码公开于https://github.com/markus-42/SegFly。

摘要 (Abstract)

Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.

关键词: semantic segmentation, aerial imagery, RGB-thermal, 2D-3D-2D paradigm, pseudo ground-truth, cross-modal registration, large-scale dataset, geometry-driven

176. ❌ Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

作者: Shima Yousefi, Saptarshi Debroy 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是边缘计算环境下DNN协同推理中的对抗攻击检测问题，使用变分自编码器进行异常检测。所有关键词都专注于大语言模型、深度学习技术原理创新或AI在科学领域的应用，而本文研究的是传统DNN（非大模型）在边缘计算环境下的安全防御机制，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对边缘计算中DNN协同推理易受恶意数据注入攻击的问题，提出了一种基于变分自编码器的噪声感知异常检测框架，在真实噪声条件下实现了高达90%的AUROC检测性能。

摘要翻译

对象分类深度神经网络（DNNs）的协同推理——即资源受限的终端设备将部分处理后的数据卸载至远程边缘服务器以完成端到端处理——正逐渐成为边缘人工智能的关键赋能技术。然而，此类边缘卸载易受恶意数据注入攻击，导致难以检测的隐蔽性误分类，在环境噪声存在时尤为棘手。本文提出一种半灰盒且噪声感知的异常检测框架，该框架以变分自编码器（VAE）为驱动，用于捕捉对抗性操纵引发的数据偏差。所提出的框架引入了一种鲁棒的噪声感知特征，该特征能捕捉环境噪声的特性行为，从而提高检测精度并降低误报率。通过对主流对象分类DNNs的评估，本框架在真实噪声条件下展现出较强的检测鲁棒性（在不同DNN配置下AUROC最高可达90%），同时揭示了由特征相似性及高噪声水平所导致的局限性。

摘要 (Abstract)

Collaborative inference of object classification Deep neural Networks (DNNs) where resource-constrained end-devices offload partially processed data to remote edge servers to complete end-to-end processing, is becoming a key enabler of edge-AI. However, such edge-offloading is vulnerable to malicious data injections leading to stealthy misclassifications that are tricky to detect, especially in the presence of environmental noise. In this paper, we propose a semi-gray-box and noise- aware anomaly detection framework fueled by a variational autoencoder (VAE) to capture deviations caused by adversarial manipulation. The proposed framework incorporates a robust noise-aware feature that captures the characteristic behavior of environmental noise to improve detection accuracy while reducing false alarm rates. Our evaluation with popular object classification DNNs demonstrate the robustness of the proposed detection (up to 90% AUROC across DNN configurations) under realistic noisy conditions while revealing limitations caused by feature similarity and elevated noise levels.

关键词: Collaborative DNN Inference, Edge-AI, Adversarial Attack Detection, Variational Autoencoder, Noise-aware Anomaly Detection, Misclassification Attack, Edge-offloading Security, Object Classification DNNs

177. ❌ SpiderCam: Low-Power Snapshot Depth from Differential Defocus

作者: Marcos A. Ferreira, Tianao Li, John Mamish, Josiah Hester, Yaman Sangar, Qi Guo, Emma Alexander 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17910v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SpiderCam: Low-Power Snapshot Depth from Differential Defocus》专注于计算机视觉和硬件系统设计，具体研究基于FPGA的低功耗实时深度相机系统。论文内容涉及差分散焦深度估计、FPGA硬件实现、低功耗设计、实时处理等，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型、深度学习、AI科学应用无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文研究如何设计一个基于FPGA的低功耗实时深度相机系统（SpiderCam），通过差分散焦方法在52厘米工作范围内以32.5 FPS生成480x400稀疏深度图，总功耗仅为624 mW，实现了文献中首个亚瓦特级被动FPGA 3D相机。

摘要翻译

我们推出SpiderCam，一种基于FPGA的瞬态散焦深度相机，它能在52厘米的工作范围内以32.5帧/秒的速率实时生成480x400的稀疏深度图，总功耗仅为624毫瓦。SpiderCam包含一台定制相机，可同步捕获同一场景的两幅不同对焦图像，并通过在低功耗FPGA上采用SystemVerilog实现的差分散焦深度（Depth from Differential Defocus, DfDD）算法进行处理。为达到业界领先的功耗水平，我们提出了对DfDD算法的改进，以克服低功耗传感器带来的挑战，并设计了一种内存本地化实现方案，使得在无法存储单幅图像对的微型设备上也能进行流式深度计算。据文献记载，这是首次报道基于FPGA的无源3D相机实现亚瓦级总功耗。

摘要 (Abstract)

We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.

关键词: depth-from-defocus, FPGA, low-power, real-time, 3D camera, differential defocus, hardware acceleration, sparse depth maps

178. ❌ A Creative Agent is Worth a 64-Token Template

作者: Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本到图像生成中的创造性代理框架，核心创新在于通过创造性分词器将代理对创造性的理解封装为可重用的令牌模板，以提高生成效率和降低成本。这与关键词’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关，因为论文明确提出了一个创造性代理框架，并涉及代理驱动的生成方法。然而，论文专注于文本到图像生成，而非大语言模型或深度学习技术原理的创新，也未涉及其他关键词如大模型、MoE、缩放定律、训练方法、推理优化、科学AI应用等具体技术。因此，除代理相关关键词外，其他关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中模糊提示导致创造力受限的问题，提出了一个创造性代理令牌化框架，通过训练创造性分词器生成可重用的令牌模板来注入创造性语义，实现了3.7倍加速和4.8倍计算成本降低，同时提高了图像质量和文本对齐度。

摘要翻译

文本到图像（T2I）模型在图像保真度和提示遵循方面已取得显著进步，但其创造力仍受限于对离散自然语言提示的依赖。当面对模糊提示（例如“一座富有创意的、受黑胶唱片启发的摩天大楼”）时，这些模型往往无法推断出潜在的创意意图，使得创意构思和提示设计在很大程度上仍需依赖人类用户。近期的推理或智能体驱动方法通过迭代增强提示来应对，但带来了高昂的计算和经济成本，因为其实例特定的生成方式使得“创意”变得昂贵且不可复用，每次生成都需要重复查询或推理。为解决这一问题，我们提出了CAT框架，即创意AgentTokenization（创意智能体标记化），该框架通过一个创意标记器来封装智能体对“创意”的内在理解。给定模糊提示的嵌入表示，该标记器会生成一个可复用的标记模板，该模板可直接与提示拼接，从而将创意语义注入T2I模型，无需重复推理或提示增强。为实现这一目标，标记器通过创意语义解耦进行训练，利用部分重叠概念对之间的关系来捕捉智能体的潜在创意表征。在建筑设计、家具设计和自然混合任务上进行的大量实验表明，CAT为增强T2I生成的创造力提供了一个可扩展且有效的范式，与最先进的T2I模型和创意生成方法相比，实现了3.7倍的加速和4.8倍的计算成本降低，同时生成的图像在人类偏好和图文对齐度方面表现更优。

摘要 (Abstract)

Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes creativity’’ costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents’ intrinsic understanding of ``creativity’’ through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent’s latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.

关键词: Creative Agent, Text-to-Image Generation, Tokenization, Creative Tokenizer, Prompt Augmentation, Computational Efficiency, Semantic Disentanglement, Reusable Template

179. ❌ Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

作者: Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17889v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究身份感知的联合音频-视频生成框架，专注于多模态生成、身份个性化、数据标注和训练策略，属于计算机视觉和音频生成领域。所有评分关键词均与大模型技术原理、训练方法、推理优化、对齐技术、代理系统、科学AI应用等直接相关，而本论文未涉及任何大模型或深度学习技术原理的创新，也未应用于科学领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个统一的身份感知联合音频-视频生成框架，通过数据标注管道和身份注入机制实现了对多身份面部外观和声音音色的细粒度控制，并采用多阶段训练策略确保跨模态一致性。

摘要翻译

近期研究进展表明，将真实人物合成至生成视频中已展现出引人注目的能力，这反映了市场对身份感知内容创作日益增长的需求。然而，目前仍缺乏一个开放可用的框架，能够对多身份的面部外观与语音音色实现细粒度控制。本研究提出一个统一且可扩展的身份感知音视频联合生成框架，能够实现高保真度与一致性的个性化生成。具体而言，我们设计了一套数据构建流程，可自动提取跨音频与视觉模态的、带有配对标注的身份承载信息，覆盖从单主体到多主体交互的多样化场景。我们进一步提出一种灵活可扩展的身份注入机制，适用于单主体及多主体场景，其中面部外观与语音音色均作为身份承载的控制信号。此外，针对模态差异问题，我们设计了一种多阶段训练策略，以加速模型收敛并增强跨模态一致性。实验证明了所提框架的优越性。更多细节与定性结果请参见项目网页：\href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}。

摘要 (Abstract)

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.

关键词: identity-aware generation, audio-video generation, personalization, multi-modal coherence, data curation, identity injection, cross-modal, facial appearance and vocal timbre

180. ❌ Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

作者: Guandong Li, Zhaobin Chu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	2.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	6.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图像编辑模型中的编辑溢出（edit spillover）现象，将其作为探测模型世界知识理解的工具。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）的技术原理、训练方法、推理优化等。仅与两个关键词有微弱关联：1）“Mechanistic Interpretability OR Explainable AI”（8分）：论文通过分析编辑溢出行为来理解模型内部的世界知识表示，属于可解释AI范畴；2）“World Models AND General World Models”（6分）：论文探讨图像编辑模型是否隐含理解世界关系，涉及世界模型概念，但并非核心研究通用世界模型。其他关键词如"Instruction Tuning”（2分）仅因论文提到指令跟随图像编辑模型而有微弱关联。

!!! tip deepseek-chat TL;DR

该论文研究图像编辑模型中的编辑溢出现象，发现溢出反映了模型对世界关系的真实理解而非注意力泄漏，并提出了EditSpilloverProbe框架来系统评估不同模型的世界知识能力。

摘要翻译

指令跟随式图像编辑模型被期望仅修改指定区域，同时保持图像其余部分不变。然而在实践中，我们观察到一个普遍现象——编辑溢出：模型会改变编辑区域外语义相关但未指定的内容。这引发了一个根本性问题：溢出究竟反映了真实的隐性世界理解，还是仅仅是注意力泄漏？我们提出EditSpilloverProbe，一个系统性框架，将编辑溢出重新定位为图像编辑模型中世界知识的自然探针。我们构建了溢出分类体系（空间型、语义型、混合型、随机型），开发了自动化检测与分类流程，并基于真实中文文本编辑任务构建了基准数据集EditSpilloverBench。对5个代表性编辑模型的系统评估揭示了三个核心发现：（1）不同架构的溢出率差异显著，从3.49%到11.46%，极值比达3.3倍；（2）绝对语义溢出量揭示了模型的世界理解能力——nano_banana产生最多的语义溢出（每图27.8处），而qwen_2511具有最精确的编辑控制但语义溢出较少（每图16.3处），这揭示了编辑控制与世界理解之间的权衡关系；（3）空间衰减分析表明溢出区域密度随距离呈指数衰减，但语义相关溢出的比例保持恒定（40%-58%），这为“语义溢出反映真实世界理解而非空间扩散”提供了直接证据。

摘要 (Abstract)

Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon – edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question – does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models’ world understanding capability – nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.

关键词: image editing models, edit spillover, world understanding, semantic spillover, attention leakage, EditSpilloverProbe, benchmark dataset, model evaluation

181. ❌ Revisiting foundation models for cell instance segmentation

作者: Anwai Archit, Constantin Pape 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于计算机视觉领域的细胞分割基础模型（如SAM系列），属于AI for Science（生物信息学）应用，与’Foundation Models’和’AI for Science’高度相关（10分）。论文涉及模型适应策略，与’Domain Adaptation’有一定关联（8分）。其他关键词主要针对语言模型、推理、对齐、优化等技术，与本文的视觉分割任务无关（0分）。

!!! tip deepseek-chat TL;DR

该论文评估了多种细胞分割基础模型（如SAM系列）在显微镜图像上的性能，并提出了一种自动提示生成（APG）策略来改进这些模型，为显微镜图像分析提供了有效的模型适应方法。

摘要翻译

细胞分割是显微图像分析中的一项基础任务。目前已推出多种细胞分割基础模型，其中绝大多数均为分割一切模型（Segment Anything Model, SAM）的扩展版本，并针对显微数据进行了优化。近期，SAM2与SAM3相继发布，进一步提升了通用分割基础模型的性能与功能。本研究在多样化的（光学）显微数据集上，对细胞分割基础模型（CellPoseSAM、CellSAM、$μ$SAM）与通用分割基础模型（SAM、SAM2、SAM3）进行了全面评估，涵盖细胞、细胞核及类器官分割任务。此外，我们提出了一种名为自动提示生成（automatic prompt generation, APG）的新型实例分割策略，可用于进一步提升基于SAM的显微基础模型性能。在以$μ$SAM作为基础模型的实验中，APG持续改善了分割效果，其性能与当前最先进的CellPoseSAM模型相当。本研究还为SAM类模型在显微领域的适配策略提供了重要经验，并为构建更强大的显微基础模型提出了可行方案。相关代码已公开于https://github.com/computational-cell-analytics/micro-sam。

摘要 (Abstract)

Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, $μ$SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for $μ$SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at https://github.com/computational-cell-analytics/micro-sam.

关键词: cell segmentation, foundation models, microscopy image analysis, Segment Anything Model (SAM), automatic prompt generation (APG), instance segmentation, model adaptation, bioinformatics

182. ❌ VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

作者: Byron Dowling, Eleanor Frederick, Jacob Piland, Adam Czajka 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于人类感知先验的虹膜呈现攻击检测（PAD）方法，具体比较了手绘标注、眼动热图、分割掩码和DINOv2嵌入等不同显著性方法在开放集虹膜PAD任务中的表现。论文的核心是计算机视觉和生物识别安全领域的深度学习应用，虽然涉及深度学习训练，但完全不涉及大语言模型（LLM）、大模型技术原理、AI对齐、推理优化、智能体系统等关键词所涵盖的大模型相关技术。所有关键词均与大模型技术或AI for Science（如生物信息学）直接相关，而本文专注于特定的计算机视觉任务，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了不同人类显著性方法（手绘标注、眼动热图、分割掩码、DINOv2嵌入）在开放集虹膜呈现攻击检测任务中的效果，发现去噪后的眼动热图在ROC曲线下面积和攻击呈现分类错误率方面表现最佳。

摘要翻译

人类感知先验在显著性引导的深度学习训练中展现出潜力，尤其在虹膜呈现攻击检测领域。常见的显著性方法包括通过鼠标点击获取的手动标注和基于眼动追踪数据生成的注视热力图。然而，针对开放集虹膜呈现攻击检测，何种形式的人类显著性最为有效仍未得到充分探索。本文通过一系列实验，在开放集虹膜呈现攻击检测任务中，将手动标注、眼动追踪热力图、分割掩码以及DINOv2嵌入向量与当前最先进的基于深度学习的基线方法进行比较。在“留一攻击类型”验证范式下的开放集检测结果表明，在ROC曲线下面积和Bona Fide呈现分类错误率为1%时的攻击呈现分类错误率指标上，经过去噪处理的眼动追踪热力图相较于交叉熵损失展现出最佳的泛化性能提升。随本文同步发布训练模型、代码及显著性图谱，以确保结果可复现并推动后续研究。

摘要 (Abstract)

Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains underexplored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in terms of Area Under the ROC curve (AUROC) and Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.

关键词: iris presentation attack detection, open-set PAD, human saliency, eye tracking heatmaps, DINOv2 embeddings, deep learning, generalization improvement, AUROC

183. ❌ Video Understanding: From Geometry and Semantics to Unified Models

作者: Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, Guolei Sun, Serge Belongie 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于视频理解的综述，主要关注计算机视觉领域的视频几何理解、语义理解和统一模型，虽然涉及AI技术，但所有关键词都专门针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而论文完全不涉及语言模型或文本处理，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

这篇综述系统梳理了视频理解领域的研究进展，从几何理解、语义理解和统一模型三个视角组织文献，总结了从任务特定模型向统一视频基础模型发展的趋势、关键设计原则和开放挑战。

摘要翻译

视频理解旨在使模型能够感知、推理并与动态视觉世界进行交互。与图像理解相比，视频理解本质上需要对时序动态和演变的视觉上下文进行建模，这对时空推理提出了更高要求，使其成为计算机视觉中的一个基础性问题。在本综述中，我们通过将文献组织为三个互补的视角——低层视频几何理解、高层语义理解以及统一视频理解模型——来呈现视频理解的结构化概览。我们进一步强调了一个更广泛的转变：从孤立的、任务特定的流程转向能够适应多样化下游目标的统一建模范式，从而为近期进展提供一个更为系统化的视角。通过整合这些视角，本综述为不断演进的视频理解领域提供了一幅连贯的图景，总结了关键的建模趋势与设计原则，并指出了构建鲁棒、可扩展且统一的视频基础模型所面临的开放挑战。

摘要 (Abstract)

Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.

关键词: video understanding, computer vision, spatiotemporal reasoning, video geometry, semantic understanding, unified models, video foundation models, survey

184. ❌ Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

作者: Chen Liyi, Wang Pengfei, Zhang Guowen, Ma Zhiyuan, Zhang Lei 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D编辑任务，提出了一种基于学习的统一模型Omni-3DEdit，通过构建数据管道和采用预训练生成模型SEVA作为骨干，并引入双流LoRA模块来增强表示学习能力。论文的核心技术贡献在于3D编辑的通用化设计和效率提升，与大多数关键词（如LLMs、MoE、Scaling Laws、RLHF等）无直接关联。唯一高度相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文明确提出了’dual-stream LoRA module’作为核心创新点。‘Pre-training OR Continual Pre-training OR Domain Adaptation’得5分，因为论文使用了预训练生成模型SEVA作为骨干，但这不是主要创新。其他关键词均未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文提出Omni-3DEdit，一个统一的基于学习的模型，通过构建数据管道、采用预训练生成模型SEVA和双流LoRA模块，实现了各种3D编辑任务的隐式泛化，将推理时间从数十分钟减少到约两分钟。

摘要翻译

当前大多数指令驱动的三维编辑方法依赖二维模型来指导三维表征的显式迭代优化。然而，该范式存在两个主要缺陷。首先，由于对三维几何的显式操控需要依赖任务特定的规则（例如，三维外观编辑要求保留原始三维几何，而三维移除则需改变原始几何），该方法缺乏针对不同三维编辑任务的通用设计框架。其次，迭代优化过程极为耗时，通常需要调用二维/三维更新数千次。本文提出Omni-3DEdit，一个统一的、基于学习的模型，能够隐式泛化多种三维编辑任务。实现该目标的一个关键挑战在于缺乏用于训练的成对原始-编辑多视角数据资源。为解决此问题，我们构建了一个数据流水线，合成了数量相对丰富的高质量成对多视角编辑样本。随后，我们通过将原始视角潜在编码与条件标记在序列空间中进行拼接，将预训练生成模型SEVA适配为我们的主干网络。我们提出了双流LoRA模块以解耦不同视角的视觉线索，显著增强了模型的表征学习能力。作为一个基于学习的模型，我们的方法无需耗时的在线优化，能够通过单次前向传播完成多种三维编辑任务，将推理时间从数十分钟缩短至约两分钟。大量实验证明了Omni-3DEdit的有效性与高效性。

摘要 (Abstract)

Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model’s representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

关键词: 3D editing, instruction-driven, unified model, LoRA, pre-trained generative model, inference acceleration, multi-view synthesis, one-pass editing

185. ❌ TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

作者: Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, Liqiang Nie 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到图像扩散模型的概念擦除攻击，属于计算机视觉和生成模型安全领域，与提供的关键词列表（主要关注大语言模型及其相关技术、应用和优化方法）无直接关联。所有关键词均与大语言模型技术、训练方法、推理优化、对齐、应用场景等相关，而论文专注于扩散模型的视觉攻击方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像扩散模型的概念擦除防御，提出了一种无需文本引导的视觉反转攻击方法TINA，成功从经过最先进遗忘处理的模型中再生被擦除的概念，证明当前防御方法仅掩盖概念而未真正消除内部视觉知识。

摘要翻译

尽管文本到图像扩散模型展现出卓越的生成能力，概念擦除技术对其安全部署至关重要，以防止生成有害内容。这推动了擦除防御技术与旨在绕过这些防御的对抗性探测之间的动态互动，而这种协同进化逐步提升了擦除方法的有效性。然而，这种对抗性协同进化已收敛于一种狭隘的、以文本为中心的范式，该范式将擦除等同于切断文本到图像的映射关系，却忽略了与不良概念相关的底层视觉知识仍然存在。为证实这一观点，我们从视觉角度展开研究，利用DDIM反演技术来探究是否仍能找到针对已擦除概念的生成路径。然而，识别此类视觉生成路径具有挑战性，因为标准的文本引导DDIM反演会受到已擦除模型内以文本为中心的防御机制的主动抵制。为解决这一问题，我们提出了TINA（Text-free INversion Attack，无文本反演攻击），这是一种新颖的方法，通过在空文本条件下操作来强制执行这种纯视觉探测，从而规避现有的以文本为中心的防御。此外，TINA集成了一个优化程序，以克服标准反演在缺乏其惯常文本引导时所产生的累积近似误差。我们的实验表明，TINA能从经过最先进遗忘学习处理的模型中重新生成已擦除的概念。TINA的成功证明，当前方法仅仅掩盖了概念，这凸显了对直接作用于内部视觉知识的新范式的迫切需求。

摘要 (Abstract)

Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.

关键词: text-to-image diffusion models, concept erasure, inversion attack, DDIM inversion, visual knowledge, unlearning, adversarial probe, TINA

186. ❌ Steering Video Diffusion Transformers with Massive Activations

作者: Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频扩散变换器中的大规模激活现象及其引导方法，属于计算机视觉和视频生成领域。所有评分关键词均针对大语言模型（LLM）及相关技术，而论文专注于视频扩散模型，未涉及任何LLM技术、训练方法、推理优化、对齐技术或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了视频扩散变换器中大规模激活的时空模式，并提出了一种无需训练的自引导方法STAS来提升视频生成质量和时序一致性。

摘要翻译

尽管视频扩散变换器发展迅速，如何以最小开销利用其内部模型信号来提升视频生成质量，这一问题仍未得到充分探索。在本研究中，我们探究了大规模激活的作用——这些是视频扩散变换器中罕见且幅度较大的隐藏状态尖峰。我们观察到，大规模激活在所有视觉标记中持续出现，并呈现出清晰的幅度层级：首帧标记表现出最大的大规模激活幅度，潜在帧边界标记（潜在空间中每个时间块的首尾部分）显示出较高但略低于首帧的大规模激活幅度，而每个潜在帧内部的标记虽仍保持较高水平，但幅度相对适中。这种结构化模式表明，模型隐式地优先处理与潜在空间中时间分块对齐的标记位置。基于这一观察，我们提出了结构化激活引导，这是一种无需训练、类似自引导的方法，它将首帧和边界标记处的大规模激活值引导至一个按比例调整的全局最大参考幅度。该方法在不同文生视频模型中均实现了视频质量和时序一致性的持续提升，同时引入了可忽略的计算开销。

摘要 (Abstract)

Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.

关键词: Video Diffusion Transformers, Massive Activations, Temporal Coherence, Self-guidance, Training-free Method, Video Generation, Hidden State Spikes, Structured Activation Steering

187. ❌ ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

作者: Romil Imtiaz, Dimitris K. Iakovidis 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用ResNet-50进行胃肠道视频分析，涉及多标签分类、类别不平衡处理（通过类别加权）和解剖学引导的时间事件解码。论文内容与大多数关键词（如LLMs、MoE、SFT、RAG、CoT等）完全无关，因为这些关键词主要涉及大语言模型、推理技术、对齐方法等，而本文未使用或提及任何大模型或相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用AI于医学视频分析（属于生物信息学或科学AI领域），但并非核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于ResNet-50的胃肠道视频分析管道，通过类别加权解决类别不平衡问题，并采用解剖学引导的时间解码方法，将时间mAP从0.3801提升至0.4303。

摘要翻译

我们开发了一种基于ResNet-50帧分类器并结合解剖结构引导时序事件解码的多标签胃肠道视频分析流程。该系统从调整为336x336尺寸的帧图像中预测17个标签，包括5个解剖类别和12个病理类别。面临的主要挑战是严重的类别不平衡问题，尤其是罕见病理标签。为此，我们在训练损失函数中采用了截断式类别正样本加权方法，在保持优化稳定性的同时提升了对罕见类别的学习效果。在时序处理阶段，我们发现直接将帧结果转换为事件会导致与官方标注真值出现碎片化不匹配。因此最终提交方案结合了GT风格的逐帧事件合成、解剖类别投票平滑、基于解剖结构的病理门控机制以及保守的滞后解码器。该设计将挑战赛测试集上的最终时序平均精度均值从0.3801提升至0.4303。

摘要 (Abstract)

We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

关键词: gastrointestinal video analysis, ResNet-50, class imbalance, temporal event decoding, anatomy-guided, multi-label classification, clip weighting, hysteresis decoder

188. ❌ M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

作者: Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino, Guillermo Sapiro, Antoni B. Chan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究视觉基础模型（VFMs）在密集点跟踪任务中的改进，使用弱监督学习方法。虽然论文涉及预训练和微调（与关键词5和6有一定关联），但所有关键词都针对大语言模型（LLMs）及其相关技术，而本文研究的是视觉基础模型（如DINOv2/DINOv3），属于完全不同的模态和技术领域。因此，除了预训练和微调这两个通用概念有微弱关联外，其他所有LLM-specific关键词均完全不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Mask-to-Point（M2P）的弱监督学习方法，通过利用视频对象分割掩码注释来改进视觉基础模型，从而显著提升了密集点跟踪任务的性能。

摘要翻译

任意点追踪（Tracking Any Point, TAP）已成为视频理解的一项基础工具。现有方法通过离线微调或测试时优化来适配视觉基础模型（Vision Foundation Models, VFMs），例如DINOv2。然而，这些视觉基础模型依赖于静态图像预训练，本质上难以最优地捕捉视频中密集的时间对应关系。为解决此问题，我们提出了掩码到点（Mask-to-Point, M2P）学习方法，该方法利用丰富的视频对象分割（Video Object Segmentation, VOS）掩码标注来改进视觉基础模型，以实现密集点追踪。我们的M2P方法引入了三种基于掩码的弱监督表征学习约束。首先，我们提出了局部结构一致性损失，利用普氏分析（Procrustes analysis）对局部结构内点的内聚运动进行建模，从而实现更可靠的点对点匹配学习。其次，我们提出了掩码标签一致性（Mask Label Consistency, MLC）损失，强制要求采样的前景点在不同帧间严格匹配前景区域。所提出的MLC损失可视为一种正则化方法，能够稳定训练并防止收敛到平凡解。最后，我们应用掩码边界约束来显式监督边界点。研究表明，我们的弱监督M2P模型仅使用3.6K个VOS训练视频进行高效训练，其性能显著优于基线视觉基础模型。值得注意的是，在TAP-Vid-DAVIS基准测试中，M2P相比DINOv2-B/14和DINOv3-B/16分别实现了12.8%和14.6%的性能提升。此外，所提出的M2P模型可作为测试时优化和离线微调TAP任务的预训练骨干网络，这证明了其有潜力成为点追踪任务的通用预训练模型。代码将在论文被接受后公开。

摘要 (Abstract)

Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.

关键词: Vision Foundation Models, Dense Point Tracking, Weakly-Supervised Learning, Mask-to-Point Learning, Video Object Segmentation, TAP-Vid Benchmark, Representation Learning, Pre-trained Backbones

189. ❌ Facial Movement Dynamics Reveal Workload During Complex Multitasking

作者: Carter Sale, Melissa N. Stolar, Gaurav Patil, Michael J. Gostelow, Julia Wallier, Margaret C. Macpherson, Jan-Louis Kruger, Mark Dras, Simon G. Hosking, Rachel W. Kallen, Michael J. Richardson 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究使用标准网络摄像头通过面部运动动力学（如速度、加速度、位移和递归量化特征）实时监测认知工作负荷，属于计算机视觉、人机交互和认知科学交叉领域。论文未涉及任何大语言模型、深度学习技术原理、AI for Science应用或相关关键词中的技术方法（如MoE、RLHF、RAG、量化等），与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究通过标准网络摄像头捕捉的面部运动动力学特征来实时监测复杂多任务环境中的认知工作负荷，发现面部运动特征比任务表现指标更敏感，但个体差异限制了跨参与者的泛化能力。

摘要翻译

实时认知负荷监测在安全关键环境中至关重要，但现有测量方法存在侵入性、成本高昂或时间分辨率不足的局限。本研究验证了通过普通网络摄像头获取的面部运动动态特征能否成为一种低成本替代方案。72名参与者在不同负荷下完成多任务模拟（OpenMATB），同时使用OpenPose追踪面部关键点。研究提取了线性运动学特征（速度、加速度、位移）及递归量化特征。结果显示：负荷增加会跨时间尺度改变运动动态——运动幅度增大，时间组织先碎片化后重组为复杂模式，眼头协调性减弱。基于姿态运动学训练的随机森林分类器在识别准确率上优于任务绩效指标（85% vs. 55%），但跨被试泛化能力较差（43% vs. 33%随机水平）。针对个体训练的模型仅需极少校准（每种条件2分钟）即可达到50%准确率，并持续提升至73%而无平台期。面部运动动态能以简短校准灵敏追踪认知负荷，为使用消费级摄像头实现自适应界面提供了可能，但个体差异限制了跨被试泛化能力。

摘要 (Abstract)

Real-time cognitive workload monitoring is crucial in safety-critical environments, yet established measures are intrusive, expensive, or lack temporal resolution. We tested whether facial movement dynamics from a standard webcam could provide a low-cost alternative. Seventy-two participants completed a multitasking simulation (OpenMATB) under varied load while facial keypoints were tracked via OpenPose. Linear kinematics (velocity, acceleration, displacement) and recurrence quantification features were extracted. Increasing load altered dynamics across timescales: movement magnitudes rose, temporal organisation fragmented then reorganised into complex patterns, and eye-head coordination weakened. Random forest classifiers trained on pose kinematics outperformed task performance metrics (85% vs. 55% accuracy) but generalised poorly across participants (43% vs. 33% chance). Participant-specific models reached 50% accuracy with minimal calibration (2 minutes per condition), improving continuously to 73% without plateau. Facial movement dynamics sensitively track workload with brief calibration, enabling adaptive interfaces using commodity cameras, though individual differences limit cross-participant generalisation.

关键词: facial movement dynamics, cognitive workload monitoring, multitasking simulation, OpenMATB, recurrence quantification, random forest classifiers, real-time monitoring, individual differences

190. ❌ Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

作者: Haiyu Yang, Sumit Sharma, Enhong Liu, Miel Hostens 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究参数高效微调（PEFT）在农业图像分类中的应用，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为直接研究QLoRA和DoRA方法。与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分），因为应用于精准畜牧业（农业科学）。与’Large Language Models OR LLMs OR Foundation Models’和’Pre-training OR Continual Pre-training OR Domain Adaptation’及’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分），因为使用基础模型DINOv3并涉及微调，但论文聚焦视觉模型而非语言模型。其他关键词如MoE、SLMs、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探索了在有限数据（2,160张训练图像）下，使用参数高效微调（PEFT）方法（如QLoRA和DoRA）微调67亿参数的DINOv3基础模型，用于农业牲畜行为图像分类，结果表明PEFT在98:1的测试-训练比例下显著优于从头训练和冻结特征提取方法，最佳配置达到83.16%的测试准确率。

摘要翻译

自动化行为分类对精准畜牧业至关重要，但面临计算成本高昂和标注数据有限等挑战。本研究系统比较了三种方法：从头训练（ResNet-18、ViT-Small）、冻结特征提取，以及基于DINOv3基础模型（67亿参数）的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）。我们评估了QLoRA和DoRA在多种配置下的表现，包括不同秩值（8、16、64）和目标模块（仅q_proj层与全线性层）。

使用2,160张已验证的训练图像，我们在211,800个测试样本上评估了模型的泛化能力，测试集与训练集的比例高达98:1。结果表明，PEFT方法显著优于其他方案。其中最佳QLoRA配置（全线性层，秩=64）仅使用2.72%的参数（1.83亿），在5.8小时内实现了83.16%的测试准确率；而ResNet-18的准确率为72.87%（耗时16.8小时），ViT-Small为61.91%（18.7小时），冻结DINOv3为76.56%（17.5小时）。DoRA取得了相近的准确率（83.14%），但训练时间更长（11.0小时）。

值得注意的是，增加适配器容量持续提升了泛化性能，且未导致过拟合：将秩值从16降至8会使测试准确率从78.38%下降至77.17%，而将配置从仅q_proj层扩展至全线性层（秩=64）则使准确率从78.38%提升至83.16%。这表明，将基础模型适配至农业图像时，主要挑战是欠拟合而非过拟合。我们的研究为在农业畜牧应用中通过PEFT部署十亿级参数视觉模型提供了实践指导。

摘要 (Abstract)

Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.

关键词: Parameter-efficient fine-tuning, PEFT, QLoRA, DoRA, DINOv3, Image classification, Agricultural livestock, Limited-data

191. ❌ CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

作者: Yizheng Song, Yiyu Zhuang, Qipeng Xu, Haixiang Wang, Jiahe Zhu, Jing Tian, Siyu Zhu, Hao Zhu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉中的3D重建任务，使用3D高斯泼溅和扩散模型等技术，与大多数大语言模型（LLM）相关的关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为提到了预训练大型人体模型的自适应；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为3D重建可视为AI在科学/视觉领域的应用，但非核心生物信息学或化学信息学。其他关键词均不适用，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了从单张图像重建多人3D高斯泼溅表示的挑战，提出了CrowdGaussian框架，通过自监督适应和自校准学习生成逼真、几何一致的多人物场景重建。

摘要翻译

单视图三维人体重建近年来受到广泛关注。尽管已有诸多进展，先前研究主要集中于从清晰、近距离的单人图像重建三维模型，而在更普遍的多人物场景中往往表现欠佳。重建三维人群模型是一项高度复杂的任务，面临诸多挑战：1）严重遮挡，2）低清晰度，以及3）多样化的外观形态。针对此任务，我们提出CrowdGaussian——一个能够从单张图像输入直接重建多人三维高斯溅射（3D Gaussian Splatting，简称3DGS）表征的统一框架。为处理遮挡问题，我们设计了一种自监督适应管线，使预训练的大型人体模型能够从严重遮挡的输入中重建具有合理几何结构与外观的完整三维人体。此外，我们提出了自校准学习（Self-Calibrated Learning，SCL）策略。该训练方法通过融合身份保持样本与清晰/受损图像对，使单步扩散模型能够自适应地将粗糙渲染结果优化至最佳质量。其输出可被蒸馏回模型，以提升多人三维高斯溅射表征的质量。大量实验表明，CrowdGaussian能够生成具有照片级真实感且几何一致的多人物场景重建结果。

摘要 (Abstract)

Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.

关键词: 3D Gaussian Splatting, single-view reconstruction, human crowd, occlusion handling, self-supervised adaptation, diffusion models, multi-person scenes, photorealistic rendering

192. ❌ Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

作者: Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于大型视觉语言模型（LVLMs）的图像深度伪造检测框架SCEP，属于大模型在计算机视觉安全领域的应用。论文核心贡献在于提出了一种无需微调的推理方法，通过证据驱动的推理来提升检测性能。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LVLMs是大型多模态模型的一种。其他关键词主要涉及纯语言模型的技术细节（如MoE、SFT、RLHF、量化等）、特定推理方法（如CoT、MCTS）、或特定科学领域（如生物信息学），而本文专注于视觉-语言模型在图像伪造检测中的应用，且未涉及这些具体技术，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需微调大型视觉语言模型（LVLM）的图像深度伪造检测框架SCEP，通过挖掘可疑图像块作为证据进行推理，在多个基准测试中超越了强基线方法。

摘要翻译

图像深度伪造检测（IDD）通过识别合成或篡改痕迹，将伪造图像与真实图像区分开来。尽管大型视觉语言模型（LVLMs）具备强大的图像理解能力，但将其适配于IDD通常需要昂贵的微调，且对多样且不断演化的篡改手法泛化能力有限。我们提出语义一致证据包（Semantic Consistent Evidence Pack, SCEP），这是一种免训练的LVLM框架，以证据驱动推理替代全图推断。SCEP挖掘一组紧凑的可疑图像块标记，这些标记能最有效地揭示篡改线索。该框架利用视觉编码器的CLS标记作为全局参考，将图像块特征聚类为语义连贯的组别，并通过融合度量对图像块进行评分——该度量结合了CLS引导的语义失配与基于频率及噪声的异常分析。为覆盖分散的篡改痕迹并避免冗余，SCEP从每个聚类中采样少量高置信度图像块，并应用基于网格的非极大值抑制，最终生成一个证据包，用于指导冻结的LVLM进行预测。在多种基准测试上的实验表明，SCEP在不进行LVLM微调的情况下，性能优于现有强基线方法。

摘要 (Abstract)

Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder’s CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.

关键词: Large Vision-Language Models, Image Deepfake Detection, Training-free Framework, Evidence-driven Reasoning, Semantic Consistent Evidence Pack, Manipulation Detection, Cross-domain Generalization

193. ❌ Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

作者: Haoyun Chen, Fenghe Tang, Wenxin Ma, Shaohua Kevin Zhou 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于通用医学图像分割的框架C2P，核心创新在于利用多模态大语言模型（MLLMs）提取医学概念作为语义标记，并结合几何标记进行分割。因此，与’Large Language Models’高度相关（8分），因为MLLMs是关键组件；与’AI for Science’高度相关（10分），因为这是医学图像分析的科学应用；与’Pre-training’和’Post-training’有一定关联（各5分），因为涉及模型训练和微调；其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了通用医学图像分割中依赖手动提示和跨模态域偏移的问题，提出了一种无需提示的C2P框架，通过多模态大语言模型提取语义标记和几何标记，在多个数据集上实现了优越的分割性能和泛化能力。

摘要翻译

通用医学图像分割旨在利用单一基础模型处理多种成像模态下的不同任务。然而，现有方法通常严重依赖手动视觉提示或检索的参考图像，这限制了其自动化程度与鲁棒性。此外，跨模态的简单联合训练往往难以应对巨大的域偏移。为解决这些局限，我们提出了概念到像素（C2P），一种新颖的无提示通用分割框架。C2P将解剖知识显式解耦为两个组成部分：几何表征与语义表征。该框架利用多模态大语言模型（MLLMs）将抽象的、高层次的医学概念蒸馏为可学习的语义标记，并引入显式监督的几何标记以强化通用的物理与结构约束。这些解耦的标记与图像特征深度交互，生成针对输入特化的动态卷积核，从而实现精确的掩码预测。此外，我们提出了几何感知推理共识机制，该机制利用模型预测的几何约束来评估预测可靠性并抑制异常结果。在包含七种模态、八个数据集的统一基准上进行的大量实验与分析表明，相较于通用模型或单一模型方法，我们联合训练的方法具有显著优越性。值得注意的是，我们的统一模型展现出强大的泛化能力，不仅在涉及未见病例的零样本任务中取得优异结果，在相似任务的跨模态迁移中也表现突出。代码发布于：https://github.com/Yundi218/Concept-to-Pixel

摘要 (Abstract)

Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model’s predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel

关键词: Universal medical image segmentation, Multimodal Large Language Models, Prompt-free framework, Semantic Tokens, Geometric Tokens, Anatomical knowledge, Cross-modal transfer, Zero-shot generalization

作者: Wenbin Tan, Jiawen Lin, Fangyong Wang, Yuan Xie, Yong Xie, Yachao Zhang, Yanyun Qu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D视觉定位（3D Visual Grounding）任务，提出了一种名为PC-CrossDiff的双层级跨模态差分注意力架构，用于统一处理3D指代表达理解（3DREC）和分割（3DRES）。论文的核心是计算机视觉和自然语言处理的交叉领域，特别是3D点云与文本的跨模态交互，涉及注意力机制、空间关系建模和视觉-语言对齐。然而，论文并未涉及任何大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、对齐、推理、代理、压缩等），也未应用于科学领域（如生物信息学）。所有评分关键词均与大模型技术、训练方法、推理优化、代理系统或科学AI应用相关，而该论文是纯粹的3D视觉-语言任务研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对复杂多物体场景中3D视觉定位性能下降的问题，提出了PC-CrossDiff框架，通过点级和簇级差分注意力机制解析隐式定位线索并抑制空间干扰，在ScanRefer等基准上实现了最先进的性能。

摘要翻译

三维视觉定位（3DVG）旨在通过指称表达理解（3DREC）与分割（3DRES）两项核心任务，定位自然语言指称表达所对应的目标。现有方法在简单的单物体场景中已能实现较高精度，但在现实环境中常见的复杂多物体场景中性能严重下降，阻碍了实际应用。现有方法在复杂多物体场景中面临两大关键挑战：一是对区分视觉相似物体至关重要的隐式定位线索解析不足；二是难以有效抑制共现物体带来的动态空间干扰，导致定位精度降低。为应对这些挑战，我们提出PC-CrossDiff——一个面向3DREC与3DRES的统一双任务框架，其具备双层级跨模态差分注意力架构。具体而言，该框架引入：（i）点级差分注意力模块，通过在文本与点云间施加双向差分注意力，借助可学习权重自适应提取隐式定位线索以提升判别性表征能力；（ii）簇级差分注意力模块，通过建立层级注意力机制，在自适应增强定位相关空间关系的同时，借助定位感知差分注意力块抑制模糊或无关的空间关联。我们的方法在ScanRefer、NR3D和SR3D基准测试中取得了最先进的性能。值得注意的是，在ScanRefer的隐式子集上，本方法将3DREC任务的Overall@0.50指标提升了+10.16%，凸显了其解析隐式空间线索的强大能力。

摘要 (Abstract)

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.

关键词: 3D Visual Grounding, Referring Expression Comprehension, Point Cloud, Cross-modal Attention, Differential Attention, Implicit Localization Cues, Spatial Interference Suppression, Unified Framework

195. ❌ TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

作者: Yan Zeng, Haoran Jiang, Kaixin Yao, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D几何到外观生成的计算机视觉任务，使用视频扩散模型生成转盘视频，并用于3D重建。虽然涉及大模型（视频生成模型），但论文核心是3D视觉、几何处理和视频合成，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词均与大语言模型、对齐、推理、代理、科学AI等具体技术相关，与论文主题无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了TAPESTRY框架，通过几何条件视频扩散生成高质量、一致的转盘视频，解决了无纹理3D模型自动生成逼真外观的挑战，并实现了从视频到完整3D资产的重建。

摘要翻译

为未贴图的3D模型自动生成具有照片级真实感且自洽的外观是数字内容创作中的关键挑战。大规模视频生成模型的进展提供了一种自然途径：直接合成360度旋转展示视频，其不仅能作为高质量动态预览，还可作为驱动纹理合成与神经渲染的中间表示。然而，现有的通用视频扩散模型难以在全方位视角下保持严格的几何一致性与外观稳定性，导致其输出不适用于高质量3D重建。为此，我们提出了TAPESTRY框架，该框架能够基于显式3D几何条件生成高保真度的旋转展示视频。我们将3D外观生成任务重新定义为几何条件约束的视频扩散问题：给定一个3D网格，我们首先渲染并编码多模态几何特征，以像素级精度约束视频生成过程，从而创建高质量且一致的旋转展示视频。在此基础上，我们还设计了一种针对旋转展示视频输入的下游重建方法，该方法采用包含3D感知修复的多阶段流程。通过旋转模型并进行上下文感知的二次生成，该流程能有效补全自遮挡区域，实现完整的表面覆盖。TAPESTRY生成的视频不仅是高质量动态预览，更可作为可靠的、具备3D感知的中间表示，能够无缝反投影至UV纹理贴图，或用于监督如3D高斯溅射等神经渲染方法。这使得从未贴图网格自动生成可用于生产流程的完整3D资产成为可能。实验结果表明，我们的方法在视频一致性与最终重建质量方面均优于现有方法。

摘要 (Abstract)

Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.

关键词: 3D appearance generation, turntable videos, video diffusion models, geometry-conditioned generation, 3D reconstruction, neural rendering, texture synthesis, 3D-aware inpainting

196. ❌ DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

作者: Yuhe Tian, Kun Zhang, Haoran Ma, Rui Yan, Yingtai Li, Rongsheng Wang, Shaohua Kevin Zhou 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文DiffVP专注于基于LLM的CT报告生成，属于大模型在生物医学影像（AI for Science/Bioinformatics）领域的应用创新。论文明确使用LLMs作为核心生成模型，因此与’Large Language Models’高度相关（10分）。同时，该研究将LLMs应用于医学影像分析，属于’AI for Science/Bioinformatics’范畴（10分）。论文未涉及其他关键词如MoE、SFT、RAG、推理加速等技术原理，也未涉及小模型、对齐、代理等主题，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM-based CT报告生成方法无法区分信息性线索与冗余解剖背景的问题，提出了DiffVP方法，通过显式的高层语义扫描-参考差异来条件化报告生成，在多个基准测试中显著提升了生成报告的准确性和临床效果。

摘要翻译

尽管大语言模型（LLMs）在CT报告生成方面取得了进展，但现有方法通常整体编码三维体积，未能将信息性线索与冗余的解剖背景区分开来。受放射学认知减影（radiological cognitive subtraction）的启发，我们提出差分视觉提示（Differential Visual Prompting, DiffVP）方法，该方法基于显式的高层语义扫描-参考差异（而非仅依赖绝对视觉特征）来条件化报告生成。DiffVP采用分层差分提取器，将互补的全局与局部语义差异捕获到共享潜在空间中，并配备一个差分到提示生成器，将这些信号转化为可学习的视觉前缀标记（visual prefix tokens）以供LLM条件化使用。这些差分提示作为结构化的条件信号，能够隐式抑制不变解剖结构，同时放大与诊断相关的视觉证据，从而无需显式病灶定位即可促进准确报告生成。在两个大规模基准测试中，DiffVP始终优于先前方法，分别将平均BLEU-1-4分数提升了+10.98和+4.36，并在RadGenome-ChestCT上进一步提升了临床效能（F1分数0.421）。所有代码将在https://github.com/ArielTYH/DiffVP/发布。

摘要 (Abstract)

While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.

关键词: CT report generation, large language models, visual prompting, differential prompting, medical imaging, radiological cognitive subtraction, semantic differences, diagnostic accuracy

作者: Jingzhi Huang, Junkai Huang, Haoyang Yang, Haoang Li, Yi Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17712v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AERR-Nav框架用于零样本物体导航，核心创新在于自适应探索-恢复-回忆策略和快慢思维模式。与LLM相关度较高（8分），因为使用了MLLM作为决策框架；与System 2 Thinking高度相关（10分），因为明确提出了Slow-Thinking模式；与LLM Agents高度相关（10分），因为涉及自主机器人代理；与Chain of Thought相关（8分），因为涉及多步推理导航决策。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对未知多楼层环境中的零样本物体导航问题，提出了AERR-Nav框架，通过自适应探索-恢复-回忆策略和快慢思维模式，在HM3D和MP3D基准测试中实现了最先进的性能。

摘要翻译

未知多层环境中的零样本目标导航（Zero-Shot Object Navigation, ZSON）是一项重大挑战。近期方法主要基于语义价值贪心路径点选择、空间拓扑增强记忆以及将多模态大语言模型（Multimodal Large Language Model, MLLM）作为决策框架，已取得一定进展。然而，这些架构在遇到未见环境时难以平衡ZSON任务中的探索与利用，尤其在多层场景中，常出现机器人卡在狭窄交叉口、无限徘徊或无法找到楼梯入口等问题。为克服这些挑战，我们提出AERR-Nav——一种能根据机器人所处环境动态调整状态的零样本目标导航框架。具体而言，AERR-Nav具备以下两个关键优势：（1）自适应探索-恢复-回溯策略，使机器人能在三种状态间动态切换，从而针对不同导航场景作出专业化响应；（2）具备快慢思维模式的自适应探索状态，帮助机器人基于动态变化的环境信息更好地平衡探索、利用与高层推理。在HM3D和MP3D基准测试上的大量实验表明，我们的AERR-Nav在零样本方法中实现了最先进的性能。全面的消融研究进一步验证了所提策略与模块的有效性。

摘要 (Abstract)

Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot’s environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.

关键词: Zero-Shot Object Navigation, Adaptive Exploration-Recovery-Reminiscing, Fast and Slow-Thinking, Multimodal Large Language Model, Autonomous Navigation, Multi-floor Environments, State Transition Strategy, Robot Navigation Framework

198. ❌ DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

作者: Haocheng Yuan, Adrien Bousseau, Hao Pan, Lei Zhong, Changjian Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于视觉的轻量级运动捕捉系统，使用日常物体作为代理来生成3D角色动画。虽然使用了生成式运动模型和从大规模数据集学习的人类运动先验，但论文的核心是计算机视觉、动画生成和人机交互技术，而非大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词都涉及大语言模型、深度学习技术或特定AI科学应用，与论文的计算机视觉和动画生成主题完全无关。

!!! tip deepseek-chat TL;DR

论文提出了DancingBox系统，通过使用日常物体作为物理代理和单摄像头捕捉，结合生成式运动模型和人类运动先验，实现了面向新手的轻量级角色动画生成，降低了动画制作门槛。

摘要翻译

创建引人入胜的3D角色动画通常需要熟练操作专业软件的专家或由专业演员操控的昂贵动作捕捉系统。我们提出了DancingBox，一个轻量级、基于视觉的系统，通过将动作捕捉重新构想为数字木偶戏，使其对新手而言易于使用。该系统不追踪精确的人体动作，而是通过单个网络摄像头捕捉用户操控日常物体时产生的近似运动。这些粗略的代理运动随后通过一个生成式运动模型进行精细化处理，该模型以边界框表示（bounding-box representations）为条件，并融入了从大规模数据集中学习到的人体运动先验知识。为了克服成对的代理-动画数据缺乏的问题，我们通过将现有的动作捕捉序列转换为代理表示来合成训练数据对。一项用户研究表明，DancingBox能够使用从毛绒玩具到香蕉等多样化的代理物，实现直观且富有创意的角色动画，从而降低了新手动画师的入门门槛。

摘要 (Abstract)

Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.

关键词: motion capture, character animation, generative motion model, physical proxies, vision-based system, digital puppetry, human motion priors, bounding-box representations

199. ❌ Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

作者: Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视频理解中的时序推理问题，提出了SynRL后训练框架，使用合成视频训练模型学习时序原语。与关键词的相关性分析：1. “Post-training OR Supervised Fine-tuning OR SFT"得10分，因为论文明确提出了"post-training framework"并使用了7.7K CoT和7K RL样本进行训练；2. “Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"得10分，因为论文明确使用了7.7K CoT样本并涉及多步推理；3. 其他关键词得0分，因为论文专注于视觉语言模型（VLMs）的视频时序理解，未涉及大语言模型技术、模型架构优化、对齐方法、推理加速、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对现有视频理解模型在时序推理上的不足，提出了SynRL后训练框架，通过程序生成的合成视频训练模型学习时序原语，在15个基准测试中显著提升了视频时序理解能力。

摘要翻译

从图像理解到视频理解的转变，要求视觉语言模型（VLMs）从识别静态模式转向对时间动态进行推理，例如运动轨迹、速度变化和状态转换。然而，当前的训练后方法存在两个关键局限，导致效果不佳：（1）现有数据集通常缺乏时间中心性，其答案可以从孤立的关键帧推断，而不需要进行整体的时间整合；（2）由专有模型生成的训练数据在基本时间感知方面存在系统性错误，例如混淆运动方向或误判速度。我们提出了SynRL，一个训练后框架，旨在教授模型时间基元——即时间理解的基本构建模块，包括方向、速度和状态追踪。我们的核心洞见是，这些从程序生成的合成视频中学到的抽象基元，能够有效地迁移到现实场景中。我们将时间理解分解为短期感知基元（速度、方向）和长期认知基元，并通过基于代码的视频生成，构建了7.7K个思维链（CoT）样本和7K个强化学习（RL）样本，这些样本均带有真实帧级标注。尽管仅使用简单的几何形状进行训练，SynRL在涵盖时间定位、复杂推理和通用视频理解的15个基准测试中均取得了显著提升。值得注意的是，我们的7.7K个合成CoT样本的表现优于使用了16.5万个真实世界样本的Video-R1。我们将此归因于基本的时间技能，例如逐帧追踪变化和比较速度，这些技能能够有效地从抽象的合成模式迁移到复杂的现实场景。这为视频训练后建立了一个新范式：通过精心设计的合成数据进行视频时间学习，提供了一条更具成本效益的扩展路径。

摘要 (Abstract)

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

关键词: video understanding, temporal reasoning, post-training, synthetic videos, temporal primitives, vision-language models, chain of thought, transfer learning

200. ❌ Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

作者: Haocheng Li, Juepeng Zheng, Shuangxi Miao, Ruibo Lu, Guosheng Cai, Haohuan Fu, Jianxi Huang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17705v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于计算机视觉领域的多模态遥感语义分割，核心贡献是参数高效的模态平衡融合框架MoBaNet。与大多数关键词无关，因为论文不涉及语言模型、推理方法、对齐技术等。相关关键词：1) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’得10分，因为论文核心是参数高效微调，提出了Cross-modal Prompt-Injected Adapter和Difference-Guided Gated Fusion Module；2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’各得5分，因为论文基于预训练的Vision Foundation Models进行微调；3) ‘AI for Science OR Bioinformatics OR Cheminformatics’得8分，因为遥感语义分割属于科学应用，但非生物/化学信息学。其他关键词均得0分，因论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效且模态平衡的对称融合框架MoBaNet，用于多模态遥感语义分割，通过在冻结的视觉基础模型上使用跨模态提示注入适配器和差异引导门控融合模块，实现了最先进的性能并显著减少了可训练参数。

摘要翻译

多模态遥感语义分割通过利用异构数据的互补物理线索来增强场景解译能力。尽管预训练的视觉基础模型（Vision Foundation Models, VFMs）提供了强大的通用表征，但将其适配于多模态任务通常会产生巨大的计算开销，并容易导致模态失衡问题，即优化过程中辅助模态的贡献被抑制。为应对这些挑战，我们提出了MoBaNet——一种参数高效且模态均衡的对称融合框架。MoBaNet基于基本冻结的VFM主干构建，采用对称双流架构以保持可泛化表征，同时最小化可训练参数量。具体而言，我们设计了跨模态提示注入适配器（Cross-modal Prompt-Injected Adapter, CPIA），通过生成共享提示并将其注入冻结主干下的瓶颈适配器，实现深层次语义交互。为获得紧凑且判别性强的多模态表征以用于解码，我们进一步引入了差异引导门控融合模块（Difference-Guided Gated Fusion Module, DGFM），该模块通过显式利用跨模态差异来指导特征选择，从而自适应地融合配对阶段特征。此外，我们提出了模态条件随机掩蔽（Modality-Conditional Random Masking, MCRM）策略，通过在训练期间仅掩蔽单一模态并对模态专用分支施加硬像素辅助监督，以缓解模态失衡问题。在ISPRS Vaihingen和Potsdam基准数据集上的大量实验表明，MoBaNet以远少于全微调的可训练参数量实现了最先进的性能，验证了其对于鲁棒且均衡的多模态融合的有效性。本工作的源代码公开于https://github.com/sauryeo/MoBaNet。

摘要 (Abstract)

Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.

关键词: multimodal remote sensing, semantic segmentation, parameter-efficient fine-tuning, vision foundation models, modality-balanced fusion, cross-modal prompt injection, difference-guided gated fusion, modality imbalance mitigation

201. ❌ Does YOLO Really Need to See Every Training Image in Every Epoch?

作者: Xingxing Xie, Jiahua Dong, Junwei Han, Gong Cheng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究YOLO目标检测器的训练效率优化，提出了一种动态采样策略（AFSS）来减少冗余训练样本的处理。论文内容完全专注于计算机视觉中的目标检测领域，涉及训练策略、采样方法和检测性能优化。所有评分关键词均与大语言模型、深度学习技术原理创新、AI for Science等主题相关，而本论文研究的是传统的计算机视觉检测器训练优化，与这些关键词无直接关联。论文未涉及任何大模型、语言模型、AI for Science或深度学习技术原理创新的内容。

!!! tip deepseek-chat TL;DR

该论文研究了YOLO检测器是否需要每个训练周期处理所有图像的问题，提出了一种抗遗忘采样策略（AFSS），在多个数据集上实现了超过1.43倍的训练加速同时提高了检测精度。

摘要翻译

YOLO检测器以其快速推理能力著称，但其训练过程却出人意料地耗时，这是因为其训练流程在每个训练周期中都会处理所有训练图像，即使许多图像已被充分学习。这与“你只需看一次”理念所暗示的高效性形成鲜明对比。这自然引出了一个重要问题：YOLO是否真的需要在每个训练周期中都看到每一张训练图像？ 为探究此问题，我们提出了一种抗遗忘采样策略（Anti-Forgetting Sampling Strategy, AFSS），该策略能动态决定每个训练周期中应使用哪些图像以及可以跳过哪些图像，从而使检测器能够更高效、更有效地学习。具体而言，AFSS以每张训练图像的检测召回率与精确率的最小值来衡量其学习充分度，并据此将训练图像动态划分为简单、中等或困难等级。简单训练图像在训练过程中以连续回顾的方式稀疏重采样，优先选择长时间未使用的图像，以减少冗余并防止遗忘。中等训练图像则部分选取，优先考虑近期未使用的图像，其余从未选图像中随机选择，以确保覆盖度并防止遗忘。困难训练图像在每个训练周期中都会被完整采样，以确保充分学习。每张训练图像的学习充分度会定期更新，使得检测器能够随时间推移自适应地将注意力转向信息量丰富的训练图像，同时逐步丢弃冗余图像。在广泛使用的自然图像检测基准（MS COCO 2017 和 PASCAL VOC 2007）以及遥感检测数据集（DOTA-v1.0 和 DIOR-R）上，AFSS为YOLO系列检测器实现了超过$1.43\times$的训练加速，同时提升了检测精度。

摘要 (Abstract)

YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once’’ philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.

关键词: YOLO, object detection, training efficiency, sampling strategy, Anti-Forgetting Sampling Strategy, AFSS, dynamic sampling, training acceleration

202. ❌ Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging

作者: Roja Sahoo, Anoop Namboodiri 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17679v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于配对闪光-非闪光成像的无接触指纹防伪检测，属于计算机视觉和生物识别领域。论文内容完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词，没有使用任何大模型技术，也没有在生物信息学或化学信息学等科学领域应用AI。所有关键词都与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文研究了利用配对闪光-非闪光成像进行无接触指纹防伪检测的方法，通过分析光照引起的差异特征来区分真实指纹和伪造攻击，提高了检测的鲁棒性和可解释性。

摘要翻译

非接触式指纹识别实现了卫生便捷的生物特征认证，但由于缺乏物理接触和传统的活体检测线索，其防伪检测面临新的挑战。现有方法大多依赖单图像采集和基于表观的特征，这些方法在不同设备、采集条件和伪造材料间的泛化能力往往不足。本研究探讨了成对闪光-非闪光非接触式指纹采集作为一种轻量级主动感知机制在防伪检测中的应用。通过初步实证分析，我们发现闪光照明能增强与材料及结构相关的特性，包括脊线可见性、次表面散射、微观几何特征和表面油脂信息，而非闪光图像则提供基础的表观上下文。我们使用可解释的度量指标（如通道间相关性、镜面反射特性、纹理真实性和差分成像）分析了光照引起的差异。这些互补特征有助于区分真实指纹与印刷、数字及模具制作等呈现攻击。我们进一步探讨了成对采集的局限性，包括对成像设置的敏感性、数据集规模限制以及新兴高仿真伪造手段的挑战。研究结果表明，光照感知分析有潜力提升非接触式指纹呈现攻击检测的鲁棒性和可解释性，为未来成对采集及基于物理原理的特征设计研究提供了方向。代码已公开于项目仓库。

摘要 (Abstract)

Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single-image acquisition and appearance-based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash-non-flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material- and structure-dependent properties, including ridge visibility, subsurface scattering, micro-geometry, and surface oils, while non-flash images provide a baseline appearance context. We analyze lighting-induced differences using interpretable metrics such as inter-channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high-fidelity spoofs. Our findings demonstrate the potential of illumination-aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics-informed feature design. Code is available in the repository.

关键词: contactless fingerprint, spoof detection, flash-non-flash imaging, presentation attack detection, biometric authentication, illumination-aware analysis, material properties, interpretable metrics

203. ❌ DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

作者: Sarra Harrabi, Yichen Wu, Geoffrey H. Tison, Minhaj Ansari, Milos Vukadinovic, David Ouyang, Joshua P. Barrios, Jacques Delfrate, Robert Avram 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文DeepCORO-CLIP是一个用于冠状动脉造影视频-文本分析的多视图基础模型，核心是医学影像AI应用。高度相关的关键词包括：1) ‘Large Language Models OR LLMs OR Foundation Models’（10分），因为论文明确构建了一个’foundation model’用于医学影像；2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），模型使用视频-文本对比学习进行预训练，并涉及领域适应（外部验证）；3) ‘AI for Science OR Bioinformatics OR Cheminformatics’（10分），直接应用于生物医学（心血管疾病诊断）。其他关键词如MoE、SFT、RAG、推理加速等与论文的计算机视觉/医学影像焦点无关，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究开发了DeepCORO-CLIP，一个基于视频-文本对比学习预训练的多视图基础模型，用于自动化冠状动脉造影分析，在内部和外部验证中实现了高精度的狭窄检测（AUROC 0.888-0.89）并能够预测心血管事件和疾病进展。

摘要翻译

冠状动脉造影是评估冠状动脉疾病的金标准，但其视觉判读结果在不同阅片者间仍存在差异。现有的人工智能方法通常分析单帧图像或单一投照角度，且主要关注狭窄评估，限制了全面的冠状动脉分析。我们提出了DeepCORO-CLIP——一个基于多视角视频文本对比学习训练的基础模型。该模型使用蒙特利尔心脏研究所32,473项研究中的28,117名患者的203,808段血管造影视频进行训练，并在加州大学旧金山分校的4,249项研究中进行了外部验证。DeepCORO-CLIP通过基于注意力机制的池化方法整合多投照角度信息，实现了研究级别的综合评估，涵盖诊断、预后和疾病进展预测任务。在显著狭窄检测方面，模型内部验证的受试者工作特征曲线下面积（AUROC）为0.888，外部验证为0.89。与核心实验室定量冠状动脉造影相比，其平均绝对误差为13.6%，低于临床报告的19.0%。该模型在慢性完全闭塞、冠状动脉内血栓和冠状动脉钙化检测方面也表现优异。通过迁移学习，模型能预测一年主要不良心血管事件（AUROC 0.79），并估算左心室射血分数（平均绝对误差7.3%）。其嵌入表征还能捕捉系列检查中的疾病进展。在医院部署中平均推理时间仅需4.2秒，DeepCORO-CLIP为床旁自动化冠状动脉造影解读提供了基础平台。代码、样本数据、模型权重及部署基础设施均已公开发布。

摘要 (Abstract)

Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.

关键词: foundation model, coronary angiography, video-text contrastive learning, multi-view analysis, external validation, cardiovascular disease, automated diagnosis, medical imaging AI

204. ❌ VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

作者: Chaokang Jiang, Desen Zhou, Jiuming Liu, Kevin Li Sun 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶仿真中的世界模型（World Models），与关键词’World Models AND General World Models’高度相关（10分），因为论文提出了一种名为VectorWorld的流式世界模型，用于生成自动驾驶场景的向量图。然而，论文未涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用，因此其他所有关键词均得0分。论文的核心是自动驾驶仿真技术，而非大模型或深度学习的基础创新。

!!! tip deepseek-chat TL;DR

论文提出VectorWorld，一种用于自动驾驶仿真的流式世界模型，通过向量图扩散流实现高效闭环评估，在Waymo和nuPlan数据集上提升了地图结构保真度和初始化有效性，并支持稳定的实时长距离闭环推演。

摘要翻译

对自动驾驶策略进行闭环评估需要超越日志回放的交互式仿真。然而，现有的生成式世界模型在闭环中常因以下问题性能下降：(i) 无历史初始化与策略输入不匹配；(ii) 多步采样延迟违反实时性约束；(iii) 长时域下运动学不可行性持续累积。我们提出 VectorWorld，一种流式世界模型，能在推演过程中增量生成以自车为中心的 $64 \mathrm{m}\times 64\mathrm{m}$ 车道-智能体向量图图块。VectorWorld 通过运动感知门控 VAE 生成策略兼容的交互状态，使初始化与历史条件策略对齐。它借助边缘门控关系 DiT（Diffusion Transformer）实现实时外绘，该模型通过区间条件 MeanFlow 和基于 JVP 的大步长监督训练，实现无需求解器的单步掩码补全。为稳定长时域推演，我们引入 $Δ$Sim，一种物理对齐的非自车（NPC）策略，采用混合离散-连续动作及可微分运动学逻辑塑形。在 Waymo 开放运动和 nuPlan 数据集上，VectorWorld 提升了地图结构保真度与初始化有效性，并支持稳定、实时、超过 $1\mathrm{km}$ 的闭环推演（代码见 \href{https://github.com/jiangchaokang/VectorWorld}{链接}）。

摘要 (Abstract)

Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric $64 \mathrm{m}\times 64\mathrm{m}$ lane–agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce $Δ$Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete–continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time $1\mathrm{km}+$ closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).

关键词: world model, autonomous driving, vector graph, closed-loop simulation, real-time rollout, diffusion flow, motion-aware VAE, physics-aligned policy

205. ❌ Few-Step Diffusion Sampling Through Instance-Aware Discretizations

作者: Liangyu Yuan, Ruoyu Wang, Tong Zhao, Dingwen Fu, Mingkun Lei, Beier Zhu, Chi Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于扩散模型和流匹配模型的采样加速技术，特别是通过实例感知的离散化策略来优化时间步分配。论文的核心内容涉及生成模型（扩散模型、流匹配）、概率流ODE、数值求解器和离散化策略。所有给定的关键词均与大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理、压缩等）或特定科学AI应用（如生物信息学）直接相关。由于论文主题是扩散模型的采样方法，而非大语言模型或相关技术，因此与所有关键词均无直接关联，所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对扩散和流匹配模型中全局时间步调度无法适应实例特定复杂性的问题，提出了一种实例感知的离散化框架，通过学习基于输入的先验来调整时间步分配，从而在多种生成任务中一致地提高了生成质量，且调优成本和推理开销极低。

摘要翻译

扩散与流匹配模型通过模拟由常微分方程或随机微分方程（ODE/SDE）定义的路径，从易于处理的先验分布出发生成高保真数据。概率流ODE公式使得利用先进数值求解器加速采样成为可能。与求解器设计正交但同样关键的是离散化策略。早期方法采用手工设计的启发式规则，近期方法则采用基于优化的技术，但现有策略大多强制所有样本采用全局共享的时间步长调度方案。这种统一处理方式未能考虑生成过程中实例特定的复杂性，可能限制模型性能。通过在合成数据上的受控实验发现，全局调度方案在实例特定动态下存在次优性，受此启发，我们提出一种实例感知的离散化框架。该方法通过学习基于输入相关先验的自适应时间步分配策略，将基于梯度的离散化搜索扩展到条件生成场景。在合成数据、像素空间扩散、潜空间图像及视频流匹配模型等多种设置下的实证结果表明，与训练成本相比，我们的方法能以边际调优代价持续提升生成质量，且推理开销可忽略不计。

摘要 (Abstract)

Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.

关键词: Diffusion Models, Flow Matching, Sampling Acceleration, Instance-aware Discretization, Probability Flow ODE, Numerical Solvers, Timestep Schedule, Generative Models

206. ❌ Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

作者: Dongqiang Gou, Xuming He 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究3D物体功能区域的语言驱动定位，核心创新在于提出一个两阶段跨模态框架，其中第一阶段使用大语言模型（LLMs）生成部件感知指令来恢复缺失语义，这直接涉及LLMs的应用。因此，只有’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），因为LLMs被用作关键工具来增强语义表示。其他关键词如MoE、SFT、RAG、Agents等均未在摘要中提及或与论文主题无关，故得0分。论文属于大模型在具体任务（3D视觉与语言交互）中的应用，符合研究背景中’大模型在不同领域的研究应用’的酌情给分标准，但未深入探讨LLMs技术原理本身。

!!! tip deepseek-chat TL;DR

该论文提出了一种新颖的两阶段跨模态框架，通过大语言模型生成部件感知指令和引入几何一致性建模，解决了开放词汇3D功能区域定位中的语义和几何对齐挑战，并在多个基准测试中展示了优越性能。

摘要翻译

将自然语言问题关联至三维物体中功能相关区域——称为语言驱动的三维可供性基础化——对于具身智能与人机交互至关重要。现有方法虽已从基于标签的方式演进至语言驱动范式，但在开放词汇泛化、细粒度几何对齐及部件级语义一致性方面仍面临挑战。为解决这些问题，我们提出一种新颖的两阶段跨模态框架，通过增强语义与几何表征来实现开放词汇的三维可供性基础化。第一阶段中，大语言模型生成部件感知指令以恢复缺失语义，使模型能够关联语义相似的可供性。第二阶段引入两个核心组件：可供性原型聚合模块通过捕捉跨对象几何一致性来表征各类可供性特征，以及对象内部关系建模模块通过细化物体内部的几何区分度以实现精准语义对齐。我们在新构建的基准数据集及两个现有基准上进行了广泛实验，验证了所提方法的有效性，结果表明其性能优于现有方法。

摘要 (Abstract)

Grounding natural language questions to functionally relevant regions in 3D objects – termed language-driven 3D affordance grounding – is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.

关键词: 3D affordance grounding, open-vocabulary, large language models, cross-modal framework, semantic alignment, geometric alignment, part-aware instructions, embodied intelligence

207. ❌ DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

作者: Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17637v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DSS-GAN专注于计算机视觉领域的生成对抗网络（GAN）架构创新，使用Mamba作为生成器骨干进行类别条件图像合成。虽然Mamba是一种状态空间模型，在序列建模中表现出色，但论文的核心贡献是Directional Latent Routing（DLR）机制和GAN架构设计，与提供的关键词列表（主要围绕大语言模型、训练方法、推理优化、对齐技术、代理系统等）完全无关。所有关键词均针对自然语言处理和大语言模型技术，而本文是纯粹的计算机视觉/图像生成研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了DSS-GAN，首次将Mamba作为生成对抗网络的层次化生成器骨干，并引入方向性潜在路由（DLR）机制进行类别条件图像合成，在多个数据集上实现了优于StyleGAN2-ADA的图像生成质量。

摘要翻译

我们提出DSS-GAN，这是首个采用Mamba作为分层生成器主干进行噪声到图像合成的生成对抗网络。其核心贡献是方向性潜在路由（Directional Latent Routing, DLR），这是一种新颖的条件调节机制，它将潜在向量分解为方向特定的子向量，每个子向量与类别嵌入共同投影，以生成对相应Mamba扫描路径的特征仿射调制。不同于注入全局信号的传统类别条件调节方法，DLR将类别身份与潜在结构沿特征图的不同空间轴耦合，并在所有生成尺度上保持一致应用。在多个测试数据集上，与StyleGAN2-ADA相比，DSS-GAN在FID、KID以及精确率-召回率指标上均取得了更优结果。对潜在空间的分析表明，方向性子向量展现出可测量的专化特性：沿单个分量的扰动会在合成图像中产生结构化且与方向相关的变化。

摘要 (Abstract)

We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.

关键词: DSS-GAN, Mamba, Generative Adversarial Network, Class-Conditional Image Synthesis, Directional Latent Routing, Hierarchical Generator, State Space Model, Image Generation

208. ❌ A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

作者: Kundan Thota, Thorsten Schlachter, Veit Hagenmeyer 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于城市建筑年龄队列测绘的多智能体LLM系统，以支持城市能源规划。该系统包含三个关键智能体（Zensus、OSM、Monument），用于融合异构数据源，并引入了一个基于ConvNeXt的卫星图像分类器BuildingAgeCNN。论文的核心贡献在于多智能体LLM系统的设计和应用，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分）。论文涉及大模型（LLM）在科学领域的应用，因此与’Large Language Models OR LLMs OR Foundation Models’和’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。其他关键词（如MoE、SFT、RAG、量化等）在论文中未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多智能体LLM系统，通过融合异构数据源和卫星图像分类来测绘城市建筑年龄分布，以支持可持续能源规划，其分类器在空间交叉验证中达到90.69%的总体准确率。

摘要翻译

确定城市建筑存量的年龄分布对于可持续市政供热规划与改造优先级划分至关重要。然而，现有方法通常依赖传感器或遥感技术收集的数据集，存在数据不一致和缺失问题。我们提出一个多智能体大语言模型系统，包含三个关键智能体：人口普查智能体、开放街道地图智能体和历史建筑智能体，用于融合多源异构数据。数据协调器与整合器通过地理编码和去重处理建筑轮廓数据。基于此融合的真实数据，我们提出了BuildingAgeCNN——一种纯卫星图像分类器，其以ConvNeXt架构为骨干网络，并集成了特征金字塔网络、坐标卷积空间通道和挤压激励模块。在空间交叉验证下，BuildingAgeCNN总体准确率达到90.69%，但宏观F1分数仅为67.25%，反映出严重的类别不平衡问题及相邻历史时期建筑群间的持续混淆。为降低规划应用风险，地址到预测的流程包含校准置信度估计，并对低置信度案例进行人工复核标记。该多智能体大语言模型系统不仅能协助收集结构化数据，还可帮助能源需求规划者优化区域供热网络，并助力低碳可持续能源系统的目标实现。

摘要 (Abstract)

Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi-agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite-only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze-and-Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro-F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address-to prediction pipeline includes calibrated confidence estimates and flags low-confidence cases for manual review. This multi-agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district-heating networks and target low-carbon sustainable energy systems.

关键词: multi-agent system, LLM agents, urban building age mapping, satellite image classification, energy planning, data fusion, ConvNeXt, feature pyramid network

209. ❌ S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

作者: Xinze Li, Pengxu Chen, Yiyuan Wang, Weifeng Su, Wentao Cheng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文S-VGGT专注于3D基础模型的可扩展性优化，通过结构感知的子场景分解来减少全局注意力计算成本。所有评分关键词均针对大语言模型（LLM）及相关技术（如训练、对齐、推理优化、应用等），而本文研究的是3D视觉基础模型，属于计算机视觉领域，与文本大模型技术无直接关联。论文未涉及任何LLM技术、训练方法、对齐技术、推理优化或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对3D基础模型中全局注意力计算成本过高的问题，提出了一种结构感知的子场景分解方法S-VGGT，通过场景图引导的软分配和共享参考帧设计，显著降低了计算复杂度并保持了重建质量。

摘要翻译

前馈式三维基础模型面临一个核心挑战：全局注意力机制带来的二次方计算成本，随着输入长度增加会严重限制模型的可扩展性。现有的并行加速方法（如令牌合并）在令牌级别进行操作。虽然这些方法能实现局部计算节省，但其所需的最近邻搜索会引入额外开销。因此，这些技术未能解决密集采集数据中占主导地位的结构性冗余这一根本问题。本文提出一种创新方法 S-VGGT，该方法在结构帧级别处理冗余问题，从根本上改变了优化重点。我们首先利用初始特征构建密集场景图，该图能表征场景的结构冗余并指导后续的场景划分。基于此图，我们将帧软分配至少量子场景中，确保组间平衡与几何平滑过渡。核心创新在于设计共享共同参考帧的子场景，建立并行几何桥梁，使得各子场景无需显式几何对齐即可实现独立高效处理。这种结构重组通过从源头削减全局注意力成本，提供了强大的内在加速能力。关键的是，S-VGGT 与令牌级加速方法完全正交，二者可无缝结合以实现叠加式加速，同时不损害重建保真度。代码发布于 https://github.com/Powertony102/S-VGGT。

摘要 (Abstract)

Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.

关键词: 3D foundation models, scalability, global attention, computational cost, structure-aware decomposition, subscene partitioning, scene graph, acceleration

210. ❌ ReLaGS: Relational Language Gaussian Splatting

作者: Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, Didier Stricker 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ReLaGS专注于3D场景理解，提出了一种结合语言蒸馏高斯场景和3D语义场景图的框架，用于开放词汇的3D分割、检索和关系理解。其核心是计算机视觉、3D表示和场景图构建，而非大模型或深度学习技术原理的创新。所有关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT、Agents等）均涉及大模型架构、训练、推理、对齐、应用或优化技术，与论文的3D视觉任务无直接关联。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因论文涉及科学应用（3D感知），但非核心生物信息学或化学信息学，故给5分（有一定关联）。加权总分仅5.0，远低于动态及格分26.6，表明论文与评审关注的大模型和深度学习技术原理创新高度不匹配。

!!! tip deepseek-chat TL;DR

该论文解决了3D场景中统一感知和推理的挑战，提出了一种无需场景特定训练、基于语言蒸馏高斯场景和3D语义场景图的框架，实现了高效的开放词汇3D分割、场景图生成和关系引导检索。

摘要翻译

在分割、检索与关系理解等任务中实现统一的3D感知与推理仍具挑战性，现有方法或局限于以物体为中心，或需依赖成本高昂的物体间关系训练。本文提出一种新颖框架，无需针对特定场景进行训练即可构建层次化语言蒸馏高斯场景及其3D语义场景图。通过高斯剪枝机制优化场景几何结构，并采用鲁棒的多视角语言对齐策略，将含噪声的2D特征聚合为精确的3D物体嵌入。在此层次结构基础上，我们结合视觉语言驱动的标注与基于图神经网络的关系推理，构建了开放词汇的3D场景图。该方法通过对层次化语义及物体内/物体间关系的联合建模，实现了高效且可扩展的开放词汇3D推理，并在开放词汇分割、场景图生成和关系引导检索等任务中得到验证。项目页面：https://dfki-av.github.io/ReLaGS/

摘要 (Abstract)

Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/

关键词: 3D perception, Gaussian splatting, scene graph, open-vocabulary, relational reasoning, language alignment, segmentation, retrieval

211. ❌ Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification

作者: Yan Liang, Ziyuan Yang, Zhuxin Lei, Mengyu Sun, Yingyu Chen, Yi Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像分类中的核心集选择方法，提出了一种基于动态不可靠性评估的DUCS策略。所有关键词均与大模型、深度学习技术原理或具体应用技术直接相关，而本文研究的是传统神经网络训练过程中的样本选择问题，不涉及大模型、LLM、MoE、缩放定律、预训练、对齐、推理优化、智能体等任何关键词技术。唯一的相关点是’AI for Science’，因为论文应用于医学图像分析（生物信息学相关领域），但并非核心内容，因此给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分类中核心集选择效率低的问题，提出了一种基于训练过程中置信度波动和样本遗忘频率的动态不可靠性驱动选择方法（DUCS），在公开数据集上实现了优于现有方法的性能，特别是在高压缩率下。

摘要翻译

在有限资源下高效管理与利用大规模医学影像数据集存在显著挑战。尽管核心集选择方法有助于降低计算成本，但由于医学数据固有的复杂性（如类内差异大、类间相似度高），其实际效果仍受限。为解决这一问题，我们重新审视训练过程，发现神经网络在训练中始终能产生稳定的置信度预测，并对类中心附近的样本记忆更佳。然而，过度关注这些样本可能使决策边界建模复杂化。因此，我们认为越不可靠的样本实际上对构建决策边界越具信息价值。基于此，我们提出动态不可靠性驱动的核心集选择策略。具体而言，我们引入一种由内而后的不可靠性评估视角：1）内向自省：模型通过分析训练过程中置信度的演变进行自省，从而量化每个样本的不确定性；2）后向记忆追踪：模型通过追踪样本被遗忘的频率来反思其训练轨迹，从而评估对每个样本的保留能力。随后，我们选择那些在训练中表现出显著置信度波动且被反复遗忘的不可靠样本。这一选择过程确保所选样本靠近决策边界，从而帮助模型优化边界。在公共医学数据集上的大量实验表明，相较于现有最优方法，我们的策略展现出更优性能，尤其在高压缩率下更为显著。

摘要 (Abstract)

Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.

关键词: coreset selection, medical image classification, dynamic unreliability, confidence fluctuation, sample forgetting, decision boundary, inward-backward assessment, computational efficiency

作者: Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Pérez Toro, Tomas Arias Vergara, Andreas Maier 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	15.0/10	0.0

评分理由: 论文LoGSAM提出了一种参数高效的跨模态医学影像分割框架，核心创新在于：1）使用预训练的Whisper ASR和临床NLP处理放射科医生语音生成文本提示；2）采用LoRA适配的Grounding DINO进行文本条件肿瘤定位（仅更新5%参数）；3）结合冻结的MedSAM生成分割掩码。该工作高度相关于：‘PEFT/LoRA/Parameter-efficient Fine-tuning’（核心方法，15分），‘AI for Science/Bioinformatics/Cheminformatics’（医学AI应用，15分），‘Large Language Models/Foundation Models’（使用基础模型，10分），‘Pre-training/Domain Adaptation’（利用预训练模型并适配，10分）。其他关键词如MoE、SLMs、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种参数高效的LoGSAM框架，通过将放射科医生语音转化为文本提示来指导基础模型进行MRI脑肿瘤定位和分割，在BRISC 2025数据集上达到了80.32%的Dice分数，并在未见过的德语MRI扫描中实现了91.7%的病例级准确率。

摘要翻译

利用磁共振成像（MRI）对脑肿瘤进行精确定位与勾画，对于制定治疗方案和指导手术决策至关重要。然而，现有方法大多依赖于特定任务的监督模型，并受限于标注数据的有限性。为此，我们提出LoGSAM，一个参数高效的、检测驱动的框架，该框架将放射科医师的口述报告转化为文本提示，用于基于基础模型的定位与分割。首先使用预训练的Whisper自动语音识别模型对放射科医师的语音进行转录和翻译，随后通过具备否定识别能力的临床自然语言处理技术提取肿瘤相关的文本提示。这些提示通过一个经LoRA适配的视觉语言检测模型——Grounding DINO（GDINO），引导文本条件下的肿瘤定位。LoRA适配仅更新模型5%的参数，从而在保留预训练跨模态知识的同时，实现了计算高效的领域适应。预测的边界框被用作MedSAM的提示，以生成像素级的肿瘤掩膜，而无需任何额外的微调。将冻结的MedSAM模型以LoGSAM生成的先验信息为条件，在BRISC 2025数据集上取得了80.32%的最新Dice分数。此外，我们使用一位获得委员会认证的放射科医师提供的12例未见过的MRI扫描的德语口述报告，对整个流程进行了评估，实现了91.7%的病例级准确率。这些结果凸显了通过智能利用预训练基础模型并以最少的参数更新，构建一个模块化的、从语音到分割的流程的可行性。

摘要 (Abstract)

Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

关键词: Parameter-efficient fine-tuning, LoRA, Foundation models, Medical image segmentation, MRI, Cross-modal grounding, Speech-to-segmentation, Domain adaptation

213. ❌ PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

作者: Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PanoVGGT专注于计算机视觉领域的全景图像3D重建，提出了一种基于Transformer的框架来解决全景相机下的姿态估计和3D重建问题。虽然论文使用了Transformer架构，但其研究内容与所有评分关键词（均围绕大语言模型、深度学习技术原理、AI科学应用等）完全无关。论文未涉及任何语言模型、模型训练技术、推理方法、AI代理、模型优化或特定科学领域AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PanoVGGT的Transformer框架，解决了全景图像中相机姿态估计和3D重建的挑战，并创建了PanoCity数据集来验证其方法的准确性和泛化能力。

摘要翻译

全景图像提供完整的360°视场，在消费级设备中日益普及。然而，其引入的非针孔畸变对联合位姿估计与三维重建提出了挑战。现有为透视相机设计的前馈模型在此场景下泛化能力较差。我们提出PanoVGGT——一种置换等变的Transformer框架，可在单次前向传播中从一个或多个全景图联合预测相机位姿、深度图与三维点云。该模型融合了球面感知的位置编码及全景专用的三轴SO(3)旋转增强策略，实现了球面域内有效的几何推理。为消除固有的全局坐标系歧义，我们进一步在训练中引入了随机锚定策略。此外，我们构建了PanoCity数据集，这是一个包含稠密深度与六自由度位姿标注的大规模室外全景数据集。在PanoCity及标准基准上的大量实验表明，PanoVGGT在精度、鲁棒性及跨域泛化能力方面均达到优异水平。代码与数据集将公开。

摘要 (Abstract)

Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

关键词: panoramic imagery, 3D reconstruction, camera pose estimation, Transformer framework, spherical domain, PanoCity dataset, feed-forward model, geometric reasoning

214. ❌ Face anonymization preserving facial expressions and photometric realism

作者: Luigi Celona, Simone Bianco, Raimondo Schettini 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的隐私保护技术，特别是面部匿名化方法，通过改进DeepPrivacy框架来保留面部表情和光度一致性。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）完全无关，没有涉及任何大模型技术、训练方法、推理优化、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种特征保留的面部匿名化框架，通过整合密集面部标志点和轻量级后处理模块，在保护身份隐私的同时更好地保留了面部表情、光照方向和肤色一致性，在CelebA-HQ数据集上相比现有方法实现了更高的真实性和特征保真度。

摘要翻译

社交媒体平台和大规模数据集中人脸图像的广泛共享引发了紧迫的隐私担忧，因为生物特征标识符可能在未经同意的情况下被滥用。人脸匿名化旨在生成逼真的人脸图像，在不可逆地隐藏主体身份的同时，保持其对下游任务的实用性。然而，现有的大多数生成方法主要关注身份移除和图像真实感，往往忽略了面部表情以及光度一致性——特别是光照和肤色等属性——这些属性对于重光照、颜色恒常性以及医学或情感分析等应用至关重要。在本研究中，我们提出了一种特征保留的匿名化框架，该框架通过整合密集面部关键点以更好地保留表情，并引入轻量级后处理模块以确保光照方向和肤色的一致性，从而扩展了DeepPrivacy。我们进一步建立了专门设计的评估指标，用于量化表情保真度、光照一致性和颜色保留，以补充图像真实感、姿态准确性和抗重识别能力等标准度量。在CelebA-HQ数据集上的实验表明，与现有先进基线方法相比，我们的方法生成的匿名化人脸具有更高的真实感，并且在表情、光照和肤色方面的保真度显著提升。这些结果强调了特征感知匿名化的重要性，它是迈向更有用、更公平、更可信的隐私保护人脸数据的关键一步。

摘要 (Abstract)

The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject’s identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency – specifically attributes such as illumination and skin tone – that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.

关键词: face anonymization, facial expression preservation, photometric realism, privacy protection, DeepPrivacy extension, lighting consistency, skin tone preservation, feature-aware anonymization

215. ❌ Prompt-Free Universal Region Proposal Network

作者: Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Prompt-Free Universal Region Proposal Network》专注于计算机视觉中的目标检测任务，提出了一种无需外部提示的通用区域提议网络。论文的核心贡献在于视觉模型架构创新（Sparse Image-Aware Adapter、Cascade Self-Prompt、Centerness-Guided Query Selection），涉及稀疏适配器、级联机制和查询选择等技术，但所有关键词均与大语言模型（LLM）、深度学习技术原理（如MoE、Scaling Laws、RLHF等）或AI在科学领域的应用（如生物信息学）无关。论文属于传统计算机视觉领域，未涉及大模型、深度学习技术原理创新或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需外部提示的通用区域提议网络（PF-RPN），通过稀疏图像感知适配器、级联自提示和中心度引导查询选择模块，在有限数据下实现跨领域的目标检测，无需微调即可应用于水下、工业缺陷和遥感图像等场景。

摘要翻译

识别潜在目标对于各类计算机视觉应用中的目标识别与分析至关重要。现有方法通常依赖示例图像、预定义类别或文本描述来定位潜在目标。然而，其对图像和文本提示的依赖往往限制了灵活性，制约了在真实场景中的适应能力。本文提出了一种新颖的无提示通用区域建议网络（Prompt-Free Universal Region Proposal Network, PF-RPN），该网络无需依赖外部提示即可识别潜在目标。首先，稀疏图像感知适配器（Sparse Image-Aware Adapter, SIA）模块利用可学习的查询嵌入进行潜在目标的初始定位，该嵌入通过视觉特征动态更新。接着，级联自提示（Cascade Self-Prompt, CSP）模块利用自提示的可学习嵌入识别剩余的潜在目标，以级联方式自主聚合信息丰富的视觉特征。最后，中心度引导查询选择（Centerness-Guided Query Selection, CG-QS）模块借助中心度评分网络促进高质量查询嵌入的筛选。本方法仅需有限数据（例如MS COCO数据的5%）即可优化，并可直接应用于多种目标检测应用领域以识别潜在目标，无需微调，例如水下目标检测、工业缺陷检测和遥感图像目标检测。在19个数据集上的实验结果验证了本方法的有效性。代码发布于https://github.com/tangqh03/PF-RPN。

摘要 (Abstract)

Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.

关键词: Prompt-Free, Universal Region Proposal Network, Object Detection, Sparse Image-Aware Adapter, Cascade Self-Prompt, Centerness-Guided Query Selection, Cross-domain Application, Limited Data Optimization

216. ❌ ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

作者: Daowen Li, Ruixiao Dong, Ying Chen, Kai Li, Ding Ding, Li Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ProGVC专注于视频压缩领域，提出了一种基于渐进式传输和自回归上下文建模的生成式视频压缩框架。虽然使用了Transformer架构进行概率估计，但研究内容与所有评分关键词（主要涉及大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。论文未涉及任何大模型在不同领域的应用或技术原理创新，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ProGVC的渐进式生成视频压缩框架，通过自回归上下文建模统一了渐进传输、高效熵编码和细节合成，在低比特率下实现了良好的感知压缩性能并提供了实用的可扩展性。

摘要翻译

感知视频压缩利用生成先验在低码率下重建逼真的纹理与运动。然而，现有感知编解码器通常缺乏对可变码率与渐进式传输的原生支持，且其生成模块与熵编码弱耦合，限制了码率压缩效率。受视觉自回归（VAR）模型中下一尺度预测的启发，我们提出ProGVC——一种基于渐进式的生成视频压缩框架，将渐进传输、高效熵编码与细节合成统一于单一编解码器中。ProGVC将视频编码为层次化的多尺度残差令牌图，通过渐进式传输从粗到细的尺度子集实现灵活的码率适配。基于Transformer的多尺度自回归上下文模型估计令牌概率，该概率既用于已传输令牌的高效熵编码，也在解码器中用于预测被截断的精细尺度令牌以恢复感知细节。大量实验表明，作为一种新的编码范式，ProGVC在低码率下实现了优异的感知压缩性能，同时提供了实用的可扩展性。

摘要 (Abstract)

Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.

关键词: Generative Video Compression, Progressive Transmission, Auto-Regressive Context Modeling, Multi-scale Residual Tokens, Transformer-based Model, Entropy Coding, Perceptual Compression, Low Bitrate

217. ❌ Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis

作者: Sirong Piao, Ying Ming, Ruijie Zhao, Jiaru Wang, Ran Xiao, Rui Zhao, Zicheng Liao, Qiqi Xu, Shaoze Luo, Bing Li, Lin Li, Zhuangfei Ma, Fuling Zheng, Wei Song 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用基于U-Net的深度学习框架进行医学图像分割，应用于系统性红斑狼疮相关间质性肺病的临床研究。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型（LLM）及相关技术，而本文使用的是传统的卷积神经网络（CNN）进行图像分割。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（具体是医学影像分析）领域的应用，属于’AI for Science’的范畴，但并非核心创新于大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本研究开发了一种基于U-Net的深度学习框架，用于自动分割高分辨率CT图像中的气道，并发现系统性红斑狼疮伴间质性肺病患者的上肺叶气道体积显著增大，揭示了该疾病的特定拓扑表型。

摘要翻译

本研究旨在基于深度学习算法，利用非增强胸部高分辨率CT（HRCT）图像，系统性比较伴间质性肺病（ILD）与不伴间质性肺病（non-ILD）的系统性红斑狼疮（SLE）患者在肺叶及肺段水平的气道容积差异。方法：对106例接受HRCT检查的SLE患者（27例SLE-ILD，79例SLE-non-ILD）进行回顾性分析。我们开发了一种基于U-Net架构的定制化深度学习框架，用于在HRCT图像上自动分割肺叶及肺段水平的气道结构。基于分割结果计算各肺叶及肺段的容积测量值，并采用两样本t检验对两组数据进行统计学比较（显著性阈值：p < 0.05）。结果：在肺叶水平，与SLE-non-ILD患者相比，SLE-ILD患者的右上叶（p=0.009）和左上叶（p=0.039）气道容积显著增大。在肺段水平，包括R1（p=0.016）、R3（p<0.001）和L3（p=0.038）在内的多个肺段存在显著差异，其中上肺区改变最为明显，而下肺区仅呈现不显著的趋势。结论：我们的研究表明，基于深度学习的自动化方法能够有效量化HRCT扫描中的气道容积，并揭示与不伴ILD的患者相比，SLE-ILD患者存在显著的、区域特异性的气道扩张。主要累及上肺叶及特定肺段的受累模式，凸显了SLE-ILD一种独特的拓扑表型，并提示气道结构改变可作为疾病存在的潜在生物标志物。这种由人工智能驱动的定量影像学生物标志物，有望增强SLE人群中ILD的早期检测与监测能力，最终为更个性化的患者管理提供支持。

摘要 (Abstract)

To characterize lobar and segmental airway volume differences between systemic lupus erythematosus (SLE) patients with interstitial lung disease (ILD) and those without ILD (non-ILD) using a deep learning-based approach on non-contrast chest high-resolution CT (HRCT). Methods: A retrospective analysis was conducted on 106 SLE patients (27 SLE-ILD, 79 SLE-non-ILD) who underwent HRCT. A customized deep learning framework based on the U-Net architecture was developed to automatically segment airway structures at the lobar and segmental levels via HRCT. Volumetric measurements of lung lobes and segments derived from the segmentations were statistically compared between the two groups using two-sample t-tests (significance threshold: p < 0.05). Results: At lobar level, significant airway volume enlargement in SLE-ILD patients was observed in the right upper lobe (p=0.009) and left upper lobe (p=0.039) compared to SLE-non-ILD. At the segmental level, significant differences were found in segments including R1 (p=0.016), R3 (p<0.001), and L3 (p=0.038), with the most marked changes in the upper lung zones, while lower zones showed non-significant trends. Conclusion: Our study demonstrates that an automated deep learning-based approach can effectively quantify airway volumes on HRCT scans and reveal significant, region-specific airway dilation in patients with SLE-ILD compared to those without ILD. The pattern of involvement, predominantly affecting the upper lobes and specific segments, highlights a distinct topographic phenotype of SLE-ILD and implicates airway structural alterations as a potential biomarker for disease presence. This AI-powered quantitative imaging biomarker holds promise for enhancing the early detection and monitoring of ILD in the SLE population, ultimately contributing to more personalized patient management.

关键词: deep learning, airway segmentation, systemic lupus erythematosus, interstitial lung disease, high-resolution CT, U-Net, quantitative imaging biomarker, medical image analysis

218. ❌ MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

作者: Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17528v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MM-OVSeg专注于遥感领域的多模态（光学和SAR）开放词汇分割，核心贡献在于跨模态融合框架设计，以提升恶劣天气条件下的分割鲁棒性。所有关键词均针对大语言模型（LLM）的技术原理、训练方法、推理优化、应用范式等，而本文是纯粹的计算机视觉任务，未涉及任何语言模型或文本生成技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感属于地球科学应用，但论文未明确提及’AI for Science’概念，仅基于领域关联性给予5分（有一定关联）。其他关键词与论文内容完全无关，均评0分。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像在恶劣天气下开放词汇分割的挑战，提出了一个多模态光学-SAR融合框架MM-OVSeg，通过跨模态统一和双编码器融合实现了更鲁棒的分割性能。

摘要翻译

开放词汇分割技术能够根据开放的文本类别集合实现像素级识别，从而突破固定类别的限制实现泛化。尽管该技术在遥感领域潜力巨大，但相关进展目前主要局限于晴空光学数据，在云层覆盖或雾霾干扰条件下性能仍面临挑战。本文提出MM-OVSeg——一种面向恶劣天气条件的多模态光学-SAR融合开放词汇分割框架。该框架充分发挥两种数据模态的互补优势：光学影像提供丰富的光谱语义信息，而合成孔径雷达（SAR）则提供穿透云层的结构特征。针对跨模态域差异以及当前视觉语言模型密集预测能力有限的问题，我们提出两项核心设计：用于多传感器表征对齐的跨模态统一处理流程，以及集成多视觉基础模型层级特征的双编码器融合模块，从而实现文本对齐的多模态分割。大量实验表明，MM-OVSeg在不同云况条件下均展现出卓越的鲁棒性和泛化能力。源数据集与代码已公开。

摘要 (Abstract)

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities–optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

关键词: open-vocabulary segmentation, multimodal fusion, optical-SAR, remote sensing, adverse weather, cross-modal alignment, vision-language models, robust segmentation

219. ❌ PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

作者: Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu, Yazhou Yao, Fumin Shen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究计算机视觉领域的开放词汇语义和部件分割（OSPS），提出了一种并行成本聚合范式（PCA-Seg）和专家驱动的感知学习（EPL）模块。论文的核心创新在于视觉-语言模型（VLM）在分割任务中的应用，特别是通过多专家解析器（multi-expert parser）提取互补特征，这与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关，因为MoE技术涉及多个专家模型协同工作。然而，论文未涉及大语言模型（LLM）、深度学习技术原理创新（如缩放定律、训练方法、推理优化等）或科学领域AI应用，因此其他关键词均不相关（得0分）。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇语义和部件分割中成本聚合的知识干扰问题，提出了一种并行成本聚合范式（PCA-Seg），通过专家驱动的感知学习模块和特征正交化解耦策略，在仅增加少量参数的情况下实现了最先进的性能。

摘要翻译

视觉语言模型在开放词汇语义与部件分割领域的最新进展已引起广泛关注。然而，现有方法通过空间与类别聚合的串行结构从代价卷中提取图文对齐线索，导致类别级语义与空间上下文之间存在知识干扰。为此，本文提出一种简单而有效的并行代价聚合范式以缓解上述问题，使模型能够从代价卷中捕获更丰富的视觉语言对齐信息。具体而言，我们设计了专家驱动感知学习模块，该模块能高效整合语义流与上下文流。其通过多专家解析器从多视角提取互补特征，并引入系数映射器自适应学习各特征在像素层面的权重，从而将互补知识融合为统一且鲁棒的特征嵌入。此外，我们提出特征正交化解耦策略以减少语义流与上下文流间的冗余性，使专家驱动感知学习模块能够从正交化特征中学习多样化知识。在八个基准数据集上的大量实验表明，并行代价聚合范式中的每个并行块仅增加0.35M参数，即实现了最先进的开放词汇语义与部件分割性能。

摘要 (Abstract)

Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.

关键词: open-vocabulary semantic segmentation, part segmentation, vision-language models, cost aggregation, multi-expert parser, feature orthogonalization, parallel architecture, parameter-efficient

220. ❌ UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

作者: Guibiao Liao, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniSem专注于3D高斯泼溅（3DGS）的语义感知3D重建，提出了一种改进深度精度和语义泛化的统一框架。虽然该研究属于计算机视觉和3D重建领域，并涉及深度学习技术，但论文内容与所有给定的关键词（主要围绕大语言模型、训练技术、推理方法、对齐、压缩、代理等）完全无关。论文未提及任何语言模型、训练范式、推理技术或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了从稀疏无位姿图像进行语义感知3D重建时几何不稳定和语义不完整的问题，通过提出的UniSem框架（包含误差感知高斯丢弃和混合训练课程）显著提升了深度预测精度和开放词汇3D分割性能。

摘要翻译

基于稀疏、无位姿图像进行语义感知的三维重建对于前馈式三维高斯溅射（3DGS）而言仍具挑战。现有方法通常在稀疏视角监督下预测一组过度完备的高斯基元，导致几何结构不稳定且深度质量较差。同时，这些方法仅依赖二维分割器特征进行语义提升，其提供的三维层面监督较弱且泛化能力有限，导致在新场景中出现三维语义不完整的问题。为解决这些难题，我们提出统一框架UniSem，通过两个关键组件联合提升深度精度与语义泛化能力。首先，误差感知高斯丢弃（EGD）模块利用渲染误差线索抑制易冗余的高斯单元，实现误差引导的容量控制，从而生成具有几何稳定性的有意义高斯表示以改进深度估计。其次，我们提出混合训练课程（MTC），通过对象级原型对齐逐步融合二维分割器提升的语义与模型自身涌现的三维语义先验，以增强语义连贯性与完整性。在ScanNet和Replica数据集上的大量实验表明，UniSem在不同数量输入视角下均实现了深度预测与开放词汇三维分割的优越性能。值得注意的是，在16视角输入时，UniSem将深度相对误差（Rel）降低15.2%，并将开放词汇分割平均精度（mAcc）较现有强基线提升3.7%。

摘要 (Abstract)

Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model’s own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

关键词: 3D Gaussian Splatting, semantic 3D reconstruction, sparse unposed images, depth estimation, open-vocabulary segmentation, Error-aware Gaussian Dropout, Mix-training Curriculum, geometric stability

221. ❌ EI: Early Intervention for Multimodal Imaging based Disease Recognition

作者: Qijie Wei, Hailan Lin, Xirong Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像领域的多模态疾病识别，提出了Early Intervention框架和Mixture of Low-varied-Ranks Adaptation方法。与大多数关键词无关，因为论文不涉及语言模型、推理、对齐、代理等主题。仅与两个关键词相关：1) “PEFT OR LoRA OR Parameter-efficient Fine-tuning”：论文提出的MoR方法是一种参数高效微调技术，使用低秩适配器和路由器，与PEFT概念高度相关，评10分。2) “AI for Science OR Bioinformatics OR Cheminformatics”：论文应用于医学影像分析（视网膜疾病、皮肤病变、膝关节异常），属于AI for Science的生物信息学子领域，评10分。其他关键词如Foundation Models、MoE、Domain Adaptation等虽在广义AI中相关，但论文未明确涉及或仅提及VFMs（视觉基础模型）而非语言模型，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态医学影像疾病识别中信息融合不充分和领域迁移困难的挑战，提出了Early Intervention框架和参数高效的MoR微调方法，在三个公共数据集上验证了其有效性。

摘要翻译

当前基于多模态医学影像的疾病识别方法面临两大挑战。首先，主流的“单模态图像嵌入后融合”范式无法充分利用多模态数据中的互补与关联信息。其次，标记多模态医学图像的稀缺性，加之其与自然图像存在显著领域偏移，阻碍了前沿视觉基础模型（Vision Foundation Models, VFMs）在医学图像嵌入中的应用。为协同应对这些挑战，我们提出了一种新颖的早期干预（Early Intervention, EI）框架。该框架将一种模态视作目标模态，其余作为参考模态，利用参考模态的高层语义标记作为干预标记，在早期阶段引导目标模态的嵌入过程。此外，我们引入了混合多秩低秩适配（Mixture of Low-varied-Ranks Adaptation, MoR），这是一种参数高效的微调方法，它采用一组不同秩的低秩适配器和一个权重松弛路由器，以实现对视觉基础模型的适配。在视网膜疾病、皮肤病变和膝关节异常分类三个公开数据集上的大量实验表明，所提方法相较于一系列竞争性基线模型具有显著有效性。

摘要 (Abstract)

Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing “fusion after unimodal image embedding” paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality’s embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.

关键词: multimodal medical imaging, disease recognition, early intervention, parameter-efficient fine-tuning, vision foundation models, low-rank adaptation, medical image embedding, domain adaptation

222. ❌ Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

作者: Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu, Lihuo He, Dengpan Ye, Xinbo Gao, Jing Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究大型多模态模型（LMMs）在图像到代码生成任务中的能力评估，与’Large Language Models’高度相关（8分），因为LMMs是LLMs的扩展。论文关注模型在复杂视觉感知和代码生成中的错误，与’Hallucination Mitigation’相关（8分），因为论文强调视觉幻觉会导致任务失败。论文需要模型进行深度推理和结构化理解，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。论文的评估框架旨在暴露模型的结构性失败，与’Explainable AI’相关（5分）。论文涉及科学可视化等应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Omni-I2C基准，用于评估大型多模态模型将复杂数字图形转换为可执行代码的能力，发现当前模型在保持结构完整性和避免感知幻觉方面存在显著性能差距。

摘要翻译

我们提出Omni-I2C，这是一个旨在评估大型多模态模型（Large Multimodal Models, LMMs）将复杂结构化数字图形转换为可执行代码能力的综合性基准。我们认为，该任务对当前一代LMMs构成了一个重大挑战：它要求高保真视觉感知——以解析复杂的空间层次结构和符号细节——与精确的生成表达——以合成语法正确且逻辑一致的代码——之间实现前所未有的协同。与传统描述性任务不同，Omni-I2C需要整体性理解，任何微小的感知幻觉或编码错误都会导致视觉重建的完全失败。Omni-I2C包含1080个精心策划的样本，其特点体现在涵盖主题、图像模态和编程语言的广度上。通过纳入真实用户来源的案例，该基准覆盖了从科学可视化到复杂符号表示等广泛的数字内容，每个案例均配有可执行的参考代码。为补充这种多样性，我们的评估框架提供了必要的深度：通过将性能解耦为感知保真度和符号精确度，它超越了表面准确性，揭示了当前LMMs的细粒度结构缺陷和推理瓶颈。我们的评估揭示了领先LMMs之间存在显著的性能差距；即使在复杂场景中，最先进的模型也难以保持结构完整性，这凸显了多模态代码生成仍然是一个艰巨的挑战。数据和代码可在https://github.com/MiliLab/Omni-I2C获取。

摘要 (Abstract)

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception – to parse intricate spatial hierarchies and symbolic details – and precise generative expression – to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content – from scientific visualizations to complex symbolic notations – each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.

关键词: Large Multimodal Models, Image-to-Code Generation, Benchmark, Visual Perception, Code Synthesis, Hallucination Mitigation, Structural Integrity, Multimodal Reasoning

223. ❌ UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection

作者: Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li, Feng Gao, Fan Yang, Qingyao Wu, Ke Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于无人机检测的计算机视觉任务，提出新的RGB-T数据集和基于局部频率建模的检测网络。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，但论文内容与绝大多数关键词（如LLM、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为无人机检测可视为AI在科学或工程领域的应用，但并非核心的生物信息学或化学信息学领域。

!!! tip deepseek-chat TL;DR

该论文针对复杂背景下无人机检测的挑战，构建了新的RGB-T数据集UAV-CB，并提出了基于局部频率建模的检测网络LFBNet，在伪装和杂乱条件下实现了最先进的检测性能。

摘要翻译

在低空环境中检测无人机对感知与防御系统至关重要，但由于复杂背景、伪装及多模态干扰，该任务仍极具挑战性。在实际场景中，无人机常与建筑物、植被、电线等周围结构在视觉上融合，导致目标对比度低、边界弱化，并与杂乱背景纹理产生严重混淆。现有的无人机检测数据集虽具多样性，但并未专门针对捕捉此类伪装与复杂背景挑战而设计，这限制了鲁棒现实世界感知能力的发展。为填补这一空白，我们构建了UAV-CB——一个精心设计的新型RGB-T无人机检测数据集，其重点突出低空复杂背景与伪装特性。此外，我们提出了局部频率桥接网络，该网络在局部频率空间中对特征进行建模，以弥合RGB-T融合中频率-空间融合差距与跨模态差异差距。在UAV-CB及公开基准上的大量实验表明，LFBNet在伪装与杂乱条件下实现了最先进的检测性能与强鲁棒性，为现实应用中的多模态无人机感知提供了频率感知的新视角。

摘要 (Abstract)

Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.

关键词: UAV detection, RGB-T dataset, complex background, camouflage, frequency modeling, multimodal fusion, LFBNet, robust perception

224. ❌ AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

作者: Dailan He, Guanlin Feng, Xingtong Ge, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成领域，而非大语言模型。论文的核心贡献是AR-CoPO框架，用于对齐自回归视频生成器。它与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文明确提出了一个改进RLHF对齐的方法，并解决了现有方法（如GRPO）在视频生成中的挑战。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为论文涉及对齐（alignment）问题，但具体是针对视频生成的人类偏好对齐，而非指令调优或价值对齐。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、模型架构、训练技术、推理方法、代理系统、模型压缩或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对自回归视频生成器难以通过人类反馈强化学习（RLHF）进行对齐的问题，提出了AR-CoPO框架，通过分块级对齐和半在线策略训练，改善了生成质量和对齐效果。

摘要翻译

结合少步蒸馏的流式自回归视频生成器能够实现低延迟、高质量合成，但仍难以通过人类反馈强化学习进行对齐。现有的基于随机微分方程的GRPO方法在此场景下面临挑战：少步常微分方程和一致性模型采样器偏离了标准的流匹配常微分方程，且其短促、低随机性的轨迹对初始噪声高度敏感，导致中间随机微分方程探索失效。我们提出AR-CoPO（自回归对比策略优化）框架，将邻域GRPO的对比视角适配于流式自回归生成。AR-CoPO通过分块机制引入块级对齐：在随机选择的片段处构建邻域候选序列，分配序列级奖励，并执行局部化的GRPO更新。我们进一步提出半在线策略训练策略，通过利用参考轨迹回放池进行探索与利用的互补，提升了跨领域的生成质量。在Self-Forcing上的实验表明，AR-CoPO在领域外泛化能力和领域内人类偏好对齐方面均优于基线，证明了其实现了真实对齐而非奖励欺骗。

摘要 (Abstract)

Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.

关键词: Autoregressive Video Generation, RLHF, Alignment, Contrastive Policy Optimization, GRPO, Human Preference, Streaming Generation, Semi-on-policy Training

225. ❌ FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

作者: Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究情感视频描述（EVC）任务，提出了一种检索增强框架FACE-net，通过事实校准和情感增强来解决事实-情感偏差问题。论文与大多数大模型技术关键词无关，因为其核心是计算机视觉和自然语言处理的交叉任务，而非大模型技术研究。与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’相关度较高（8分），因为论文明确使用外部知识库检索相关句子来增强语义信息。与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（8分），因为论文通过事实校准模块（包括不确定性估计和三元组细化）来确保生成描述的事实准确性。与’Self-Correction OR Self-Improvement OR Self-Reflection’有微弱关联（5分），因为论文提到’self-refines’机制，但这不是核心内容。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FACE-net的检索增强框架，通过事实校准和情感增强模块来解决情感视频描述任务中的事实-情感偏差问题，从而提高生成描述的事实准确性和情感适应性。

摘要翻译

情感视频描述（Emotional Video Captioning, EVC）是一项新兴任务，其目标在于结合视频所表达的内在情感来描述事实性内容。现有研究通常感知全局情感线索，随后将其与视频内容结合以生成描述。然而，由于在生成过程中对事实与情感线索的挖掘与协调不足，现有方法难以处理事实-情感偏差问题，即不同样本在生成时对事实性与情感性的需求存在差异。为此，我们提出了一种检索增强框架，该框架融合了事实校准与情感增强（FACE-net），通过统一的架构协同挖掘事实-情感语义，并为生成过程提供自适应且精准的指导，从而突破传统方法在所有样本学习中事实与情感描述相互妥协的倾向。在技术上，我们首先引入外部知识库，检索与视频内容最相关的句子以增强语义信息。随后，我们通过不确定性估计模块进行事实校准，将检索到的信息分解为主谓宾三元组，并借助视频内容进行自校准与交叉校准，以有效挖掘事实语义；同时，我们的渐进式视觉情感增强模块将已校准的事实语义作为专家知识，与视频内容及情感词典进行交互，生成视觉查询与候选情感，进而将其聚合以自适应地为每个事实语义增强情感表达。此外，为缓解事实-情感偏差，我们设计了一种动态偏差调整路由模块，用于预测并调整样本的偏差程度。

摘要 (Abstract)

Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.

关键词: Emotional Video Captioning, retrieval-enhanced framework, factual calibration, emotion augmentation, factual-emotional bias, uncertainty estimation, subject-predicate-object triplets, dynamic bias adjustment

226. ❌ ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

作者: Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和多媒体领域的交互式头部生成（IHG），研究如何生成具有上下文适当性和情感合理性的虚拟头像视频。虽然涉及深度学习技术，但所有关键词均与大语言模型（LLM）及其相关技术（如MoE、Scaling Laws、RLHF、RAG等）或AI for Science应用直接相关，而本文未提及任何LLM技术或科学领域AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

本文提出了ECHO框架，通过长程上下文理解和空间感知解耦交叉注意力模块，解决了交互式头部生成中上下文适当性不足和唇部同步受损的问题，显著提升了虚拟头像的视觉保真度和情感合理性。

摘要翻译

在自然面对面交互中，参与者流畅地在说话与倾听间切换，产生的面部行为（FBs）受到长程情境的精细调控，自然地展现出情境恰当性与情感合理性。交互式头部生成（IHG）旨在合成模拟此类能力的逼真虚拟人头部视频。现有IHG方法通常基于短时窗口内的双轨信号（即人类用户的行为与虚拟人的预定义音频）进行条件生成，共同驱动虚拟人音频对齐的唇部动作与非言语面部行为的合成。然而，这些方法仍存在两大挑战：（i）依赖短片段行为线索而缺乏长程情境建模，导致生成的面部行为缺失情境恰当性；（ii）双轨信号以纠缠式、角色无关的方式融合，经验上会引入跨信号干扰，可能损害说话时的唇部区域同步性。为此，我们提出ECHO，一种新颖的IHG框架，包含两个关键组件：长程情境理解（LCU）组件，促进对基于行为的动态变化与语言驱动的情感语义的情境理解，以提升合成虚拟人面部行为的情境恰当性与情感合理性；以及分块式空间感知解耦交叉注意力调制（SDCM）模块，在保持自音频驱动的唇部动作的同时，自适应地整合用户情境行为线索以驱动非唇部面部区域，辅以我们设计的两阶段训练范式，共同提升唇部同步与视觉保真度。大量实验验证了所提出组件的有效性及ECHO在IHG任务上的优越性能。

摘要 (Abstract)

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user’s behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar’s audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO’s superior IHG performance.

关键词: Interactive Head Generation, facial behaviors, contextual appropriateness, emotional rationality, long-range contextual modeling, lip synchronization, avatar video synthesis, cross-attention modulation

227. ❌ SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

作者: Xi Ye, Wenjia Yang, Yangyang Xu, Xiaoyang Liu, Duo Su, Mengfei Xia, Jun Zhu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的运动对齐问题，提出了一种名为SHIFT的混合微调框架。该研究直接涉及“Post-training OR Supervised Fine-tuning OR SFT”关键词，因为论文明确研究“post-training”阶段的“supervised fine-tuning”问题，并提出了改进的微调方法。其他关键词主要涉及大语言模型（LLMs）、推理、对齐、压缩、科学AI应用等，而本文研究的是视频生成扩散模型，属于计算机视觉领域，与这些关键词没有直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了视频扩散模型在微调后运动保真度下降的问题，提出了一种结合监督微调和优势加权微调的SHIFT框架，有效解决了动态度崩溃并提高了收敛速度。

摘要翻译

图像条件视频扩散模型在视觉真实感方面取得了显著成就，但常面临运动保真度下降的问题，例如运动动态减弱或长期时间连贯性退化，这种现象在微调后尤为明显。本研究探讨了视频扩散模型在训练后的运动对齐问题。为此，我们引入了基于像素通量动态的像素运动奖励机制，该机制能够同时捕捉瞬时与长期的运动一致性。我们进一步提出了平滑混合微调（SHIFT），这是一个可扩展的、奖励驱动的视频扩散模型微调框架。SHIFT将常规监督微调与优势加权微调融合为一个统一框架。得益于新颖的对抗性优势设计，SHIFT提升了收敛速度并缓解了奖励滥用问题。实验表明，我们的方法能有效解决现代视频扩散模型在监督微调中出现的动态程度塌缩问题。

摘要 (Abstract)

Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.

关键词: Video Diffusion Models, Motion Alignment, Fine-tuning, Supervised Fine-tuning, Reward-driven Fine-tuning, Pixel-motion Rewards, Adversarial Advantages, Temporal Coherence

228. ❌ Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration

作者: Ivor J. A. Simpson, Neill D. F. Campbell 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像配准中的概率推断方法（Structured SIR），属于计算机视觉和医学影像分析领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文未涉及任何大模型、语言模型、训练技术、推理优化、对齐、代理系统等主题。唯一的相关点是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用AI于脑MRI图像配准（生物医学领域），但并非核心创新点，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于高维3D脑MRI图像配准的高效概率推断方法Structured SIR，通过新颖的内存高效协方差参数化，实现了比变分推断更好的不确定性校准和准确度。

摘要翻译

图像配准是一个不适定的密集视觉任务，其存在多个能实现相似损失值的解，这促使了概率推断的应用。先前已有研究采用变分推断来捕捉这些分布，但对后验形式的限制性假设可能导致表征能力不足、过度自信以及低质量样本。更灵活的后验分布通常受限于密集三维图像配准所需的高维协方差矩阵的复杂性。

在本研究中，我们提出了一种内存和计算高效的概率推断方法——结构化重要性重采样法，该方法能够通过高质量样本实现表达力强、多模态的不确定性表征。我们提出采用一种结合了新型内存高效高维协方差参数化的重要性重采样算法，该参数化将协方差表示为低秩协方差与稀疏、空间结构化的Cholesky精度因子之和。这种结构能够在保持计算可行性的同时，捕捉复杂的空间相关性。

我们在脑部MRI数据的三维密集图像配准这一超高维问题中评估了该方法的有效性。实验表明，我们提出的方法所产生的不确定性估计，其校准效果显著优于变分方法所得结果，同时达到相当或更高的精度。关键的是，我们证明了该模型能够生成高度结构化的多模态后验分布，从而实现高效且有效的不确定性量化。

摘要 (Abstract)

Image registration is an ill-posed dense vision task, where multiple solutions achieve similar loss values, motivating probabilistic inference. Variational inference has previously been employed to capture these distributions, however restrictive assumptions about the posterior form can lead to poor characterisation, overconfidence and low-quality samples. More flexible posteriors are typically bottlenecked by the complexity of high-dimensional covariance matrices required for dense 3D image registration. In this work, we present a memory and computationally efficient inference method, Structured SIR, that enables expressive, multi-modal, characterisation of uncertainty with high quality samples. We propose the use of a Sampled Importance Resampling (SIR) algorithm with a novel memory-efficient high-dimensional covariance parameterisation as the sum of a low-rank covariance and a sparse, spatially structured Cholesky precision factor. This structure enables capturing complex spatial correlations while remaining computationally tractable. We evaluate the efficacy of this approach in 3D dense image registration of brain MRI data, which is a very high-dimensional problem. We demonstrate that our proposed methods produces uncertainty estimates that are significantly better calibrated than those produced by variational methods, achieving equivalent or better accuracy. Crucially, we show that the model yields highly structured multi-modal posterior distributions, enable effective and efficient uncertainty quantification.

关键词: Image Registration, Probabilistic Inference, Uncertainty Quantification, High-dimensional Covariance, Sampled Importance Resampling, Brain MRI, 3D Dense Registration, Variational Inference

229. ❌ Towards Motion-aware Referring Image Segmentation

作者: Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17413v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Referring Image Segmentation（RIS）任务，专注于解决基于运动描述的图像分割问题。论文提出了数据增强方案和Multimodal Radial Contrastive Learning方法，并创建了M-Bench基准。所有评分关键词都直接与大模型、深度学习技术原理、AI科学应用等具体技术相关，而本文属于计算机视觉中的多模态图像分割任务，没有涉及大模型架构、训练方法、推理优化、对齐技术、代理系统、模型压缩等任何评分关键词领域。虽然论文使用了深度学习技术，但研究内容与评分关键词列表中的具体技术方向完全无关。

!!! tip deepseek-chat TL;DR

该论文针对Referring Image Segmentation任务中基于运动描述的查询性能不足问题，提出了数据增强方案和Multimodal Radial Contrastive Learning方法，显著提升了运动相关查询的分割性能。

摘要翻译

指代图像分割（Referring Image Segmentation，RIS）需要根据文本描述从图像中识别目标物体。我们观察到，与基于外观的查询相比，现有方法在处理与运动相关的查询时性能显著不足。为解决这一问题，我们首先提出了一种高效的数据增强方案，该方案从原始描述中提取以运动为核心的关键短语，使模型能够接触更多运动相关的表达，而无需额外标注。其次，由于同一物体在不同语境下可能以不同方式被描述，我们提出了多模态径向对比学习（Multimodal Radial Contrastive Learning，MRaCL），该方法在融合的图像-文本嵌入表示上进行，而非单模态表示。为进行全面评估，我们引入了一个专注于运动相关查询的新测试划分，并提出了一个名为M-Bench的新基准测试，其中物体主要通过动作进行区分。大量实验表明，我们的方法在多种RIS模型上显著提升了运动相关查询的性能，同时在基于外观的描述上保持了有竞争力的结果。代码可在 https://github.com/snuviplab/MRaCL 获取。

摘要 (Abstract)

Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL

关键词: Referring Image Segmentation, motion-aware, multimodal learning, contrastive learning, data augmentation, benchmark evaluation, vision-language models, object segmentation

230. ❌ Mutually Causal Semantic Distillation Network for Zero-Shot Learning

作者: Shiming Chen, Shuhuang Chen, Guo-Sen Xie, Xinge You 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17412v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究零样本学习（ZSL）中的视觉-属性语义蒸馏网络，属于计算机视觉和机器学习领域，但未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术或AI for Science相关，而本文专注于传统ZSL方法，未使用或提及任何大模型技术、训练方法、推理优化或科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种相互因果语义蒸馏网络（MSDN++）来解决零样本学习中视觉与属性特征之间内在语义知识发现不足的问题，通过在三个基准数据集上的实验验证了其优于现有方法的性能。

摘要翻译

零样本学习（Zero-shot Learning，ZSL）旨在借助辅助信息（如属性）识别开放世界中的未见类别。其核心任务在于如何从已见类别中推断视觉特征与属性特征之间的潜在语义知识，从而实现从已见类别到未见类别的有效语义知识迁移。现有研究通常采用弱监督下的单向注意力机制学习虚假且有限的潜在语义表示，未能有效发掘视觉与属性特征之间的内在语义知识（例如属性语义）。为解决上述挑战，本文提出一种互因果语义蒸馏网络（称为MSDN++），以蒸馏出本质且充分的语义表示用于零样本学习。MSDN++包含一个属性→视觉因果注意力子网络（学习基于属性的视觉特征）和一个视觉→属性因果注意力子网络（学习基于视觉的属性特征）。因果注意力机制促使两个子网络通过学习因果视觉/属性关联，构建具有因果性的视觉-属性关联以表征可靠特征。在语义蒸馏损失的引导下，两个互注意力子网络在训练过程中协同学习并相互促进。在三个广泛使用的基准数据集（CUB、SUN、AWA2和FLO）上的大量实验表明，MSDN++相较于强基线模型取得显著提升，实现了新的最优性能。

摘要 (Abstract)

Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.

关键词: Zero-shot Learning, Semantic Distillation, Causal Attention, Attribute Features, Visual Features, Mutual Learning, State-of-the-art Performance

231. ❌ Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

作者: Rui Hong, Shuxue Quan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成领域，提出了一种基于Stable Diffusion的参数高效视频生成方法，核心创新是运动自适应时间注意力机制。论文与绝大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等。唯一相关的关键词是"PEFT OR LoRA OR Parameter-efficient Fine-tuning”，因为论文明确提到"parameter-efficient video generation”，通过注入轻量级时间注意力模块（仅增加2.9%的可训练参数）来实现高效微调，这与参数高效微调（PEFT）的核心思想高度相关，因此给予10分。其他关键词如AI for Science等虽然涉及科学应用，但论文属于计算机视觉/生成模型领域，与生物信息学等具体科学领域无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种运动自适应时间注意力机制，用于在冻结的Stable Diffusion模型上实现参数高效的视频生成，通过动态调整时间注意力感受野和轻量级模块注入，仅增加少量参数即可在WebVid数据集上取得有竞争力的结果。

摘要翻译

我们提出一种基于冻结Stable Diffusion模型的参数高效视频生成方法——运动自适应时序注意力机制。与传统方法均匀处理所有视频内容不同，本方法根据估计的运动量动态调整时序注意力的感受野：高运动序列在相邻帧间进行局部注意力计算以保留快速变化的细节，而低运动序列则进行全局注意力计算以增强场景一致性。我们通过级联策略将轻量化时序注意力模块注入所有UNet变换器块中——下采样块和中块采用全局注意力实现语义稳定，上采样块采用运动自适应注意力进行细粒度优化。结合时序相关噪声初始化与运动感知门控机制，该系统仅增加2580万个可训练参数（占基础UNet的2.9%），在10万条视频数据训练后于WebVid验证集上取得具有竞争力的结果。我们证明标准去噪目标本身已提供充分的隐式时序正则化，其性能优于额外添加显式时序一致性损失的方法。消融实验揭示了噪声相关性与运动幅度间的明确权衡关系，为多样化生成行为提供了实用的推理时控制手段。

摘要 (Abstract)

We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy – global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.

关键词: video generation, Stable Diffusion, temporal attention, motion-adaptive, parameter-efficient, lightweight modules, noise initialization, temporal consistency

232. ❌ Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

作者: Rui Hong, Jana Kosecka 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17396v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D手部姿态估计，使用Transformer架构和预训练技术。与大多数大模型关键词无关，仅与’Pre-training’高度相关（论文核心方法包含gesture-aware pretraining），与’AI for Science’有一定关联（属于AI在科学/工程应用）。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用手势语义作为归纳偏置的两阶段框架，通过手势感知预训练和基于Transformer的关节标记融合，显著提升了单帧RGB图像3D手部姿态估计的准确性。

摘要翻译

从单目RGB图像估计三维手部姿态是增强现实/虚拟现实（AR/VR）、人机交互和手语理解等领域应用的基础。本研究聚焦于存在离散手势标签集的场景，并证明手势语义能为三维姿态估计提供强大的归纳偏置。我们提出一个两阶段框架：首先进行手势感知预训练，利用InterHand2.6M数据集中的粗粒度与细粒度手势标签学习信息丰富的嵌入空间；随后采用基于关节令牌的Transformer架构，以手势嵌入作为中间表征引导，最终回归MANO手部参数。训练过程通过对手部参数、关节及结构约束的分层目标函数驱动。在InterHand2.6M数据集上的实验表明，手势感知预训练持续提升了单手势估计精度，优于当前最先进的EANet基线模型，且该优势无需任何修改即可迁移至不同架构。

摘要 (Abstract)

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

关键词: 3D hand pose estimation, gesture-aware pretraining, Transformer, MANO parameters, InterHand2.6M, human-computer interaction, token fusion, inductive bias

233. ❌ Harnessing the Power of Foundation Models for Accurate Material Classification

作者: Qingran Lin, Fengwei Yang, Chaolun Zhu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究是利用视觉语言基础模型（VLMs）解决材料分类中的数据稀缺问题，提出图像生成与自动标注管道以及先验知识融合策略。与’Foundation Models’高度相关（10分），因为论文明确使用并改进基础模型；与’Pre-training’和’SFT’相关（8分），涉及预训练模型微调；与’AI for Science’相关（8分），属于科学AI应用；与’Scaling Laws AND Data Quality’有一定关联（5分），涉及数据质量提升；其他关键词如MoE、SLMs、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用视觉语言基础模型解决材料分类中数据稀缺问题的新框架，通过图像生成自动标注和先验知识融合策略，显著提升了多个数据集上的分类准确率。

摘要翻译

材料分类已成为计算机视觉与图形学领域的关键任务，其目标是为广泛的数字与现实应用赋予精确的材料属性。尽管传统上该任务被构建为图像分类问题，但由于标注数据稀缺，训练模型的准确性与泛化能力受到严重制约。近年来，视觉-语言基础模型的发展为解决这一难题提供了新途径，然而现有基于此类模型的方案在材料识别任务中仍表现欠佳。本研究提出一种新颖框架，通过有效利用基础模型来克服数据限制并提升分类精度。我们的方法融合了两项核心创新：（a）一个鲁棒的图像生成与自动标注流程，可创建以材料为中心的多样化高质量训练数据集，并通过融合文本提示中的物体语义与材料属性实现自动标注；（b）一种先验知识融合策略，用于从视觉-语言模型中提取信息，并结合联合微调方法对预训练的视觉基础模型与源自视觉-语言模型的先验知识进行协同优化，在适应材料特定特征的同时保持广泛的泛化能力。大量实验表明，该方法在多个数据集上实现了显著性能提升。我们证明，所生成的合成数据集能有效捕捉真实世界材料的特性，而视觉-语言模型先验知识的融合显著增强了最终性能。源代码与数据集将公开提供。

摘要 (Abstract)

Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.

关键词: Foundation Models, Material Classification, Vision-Language Models, Data Scarcity, Auto-labeling, Fine-tuning, AI for Science, Computer Vision

234. ❌ Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

作者: Rui Hong, Jana Kosecka 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究手语动作生成，使用扩散模型和条件生成技术，专注于计算机视觉、动作生成和手语语言学领域。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用（如生物信息学、化学信息学）。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统、模型压缩等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于文本输入和手语音韵属性条件生成3D手语动作的问题，通过建立扩散模型基线并分析不同文本编码器和条件模式，发现将符号化音韵属性映射为自然语言对CLIP编码器至关重要，且最佳模型在各项指标上超越了现有方法。

摘要翻译

基于文本输入生成自然、准确且视觉流畅的3D虚拟人手语动作仍然极具挑战性。在本研究中，我们训练了一个3D身体动作的生成模型，并利用手形、手部位置和移动等ASL-LEX 2.0标注，探索了音系属性条件对手语动作生成的作用。我们首先采用基于SMPL-X表征的Human Motion MDM风格扩散模型建立了一个强大的扩散基线模型，该模型在手语词区分度指标上超越了当前最先进的CVAE方法SignAvatar。随后，我们系统研究了使用不同文本编码器（CLIP与T5）、条件模式（仅手语词vs.手语词+音系属性）以及属性标注格式（符号化vs.自然语言）对文本条件化效果的影响。分析表明，将符号化的ASL-LEX标注转换为自然语言是基于CLIP的属性条件化有效运作的必要前提，而T5编码器则基本不受此转换影响。此外，我们性能最佳的模型变体（采用映射属性的CLIP编码器）在所有评估指标上均优于SignAvatar。这些发现揭示了输入表征是基于文本编码器的属性条件化的关键因素，并启发了通过独立路径编码手语词与音系属性的结构化条件化方法。

摘要 (Abstract)

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

关键词: Sign Language Motion Generation, Diffusion Model, Phonological Attribute Conditioning, Text Conditioning, CLIP, T5, ASL-LEX, SMPL-X

235. ❌ VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

作者: Hongbo Lu, Liang Yao, Chenghao He, Fan Liu, Wenlong Liao, Tao He, Pai Peng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17382v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VisionNVS专注于计算机视觉中的新视角合成（NVS）任务，特别是针对自动驾驶场景。它提出了一种基于自监督修复和虚拟偏移范式的相机框架，核心贡献在于将视角合成重新定义为修复问题，并使用单目深度代理和伪3D接缝合成策略。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本论文属于纯计算机视觉领域，不涉及任何LLM技术、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文解决了自动驾驶中新视角合成（NVS）的监督差距问题，通过提出VisionNVS框架，将视角合成重新定义为自监督修复任务，并引入虚拟偏移策略和伪3D接缝合成，实现了优于依赖LiDAR基线的高几何保真度和视觉质量。

摘要翻译

自动驾驶领域中的新视角合成（Novel View Synthesis, NVS）面临一个根本性瓶颈：在新轨迹上存在固有的监督缺失问题——模型需要在推理过程中合成未见过的视角，但在训练时却缺乏这些偏移位姿对应的真实图像作为监督。本文提出VisionNVS，一个纯视觉框架，其核心创新在于将视角合成从不适定的外推问题重新定义为自监督的图像修复任务。通过引入“虚拟偏移”策略，我们利用单目深度估计生成遮挡模式代理，并将其映射至原始视角。这一范式转变使得系统能够直接使用原始记录的图像作为像素级精确的监督信号，从而彻底消除了以往方法中固有的领域差异。此外，我们通过“伪三维接缝合成”策略解决空间一致性问题，该策略在训练过程中整合相邻摄像机的视觉数据，以显式建模真实世界的光度差异与校准误差。实验表明，与依赖激光雷达的基线方法相比，VisionNVS在几何保真度与视觉质量上均表现出显著优势，为可扩展的驾驶模拟提供了鲁棒的解决方案。

摘要 (Abstract)

A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift’’ strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.

关键词: Novel View Synthesis, Autonomous Driving, Self-Supervised Inpainting, Virtual-Shift Paradigm, Monocular Depth, Pseudo-3D Seam Synthesis, Camera-Only Framework, Geometric Fidelity

236. ❌ Stereo World Model: Camera-Guided Stereo Video Generation

作者: Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17375v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文StereoWorld专注于立体视频生成的世界模型，与大多数关键词无关。唯一高度相关的是’World Models AND General World Models’（10分），因为论文明确提出了’stereo world model’。‘Pre-training OR Continual Pre-training OR Domain Adaptation’得5分，因为摘要提到’preserving pretrained video priors’，表明使用了预训练技术。其他关键词（如LLMs、MoE、RLHF等）与论文的计算机视觉和视频生成焦点无关。

!!! tip deepseek-chat TL;DR

该论文提出了StereoWorld，一种相机引导的立体世界模型，通过统一的相机帧RoPE和立体感知注意力分解，实现了端到端的立体视频生成，在立体一致性、视差准确性和相机运动保真度方面优于现有方法，并支持VR渲染和具身策略学习。

摘要翻译

本文提出StereoWorld，一种相机条件化的立体世界模型，能够联合学习外观与双目几何以实现端到端的立体视频生成。与单目RGB或RGBD方法不同，StereoWorld完全在RGB模态下运行，同时直接从视差中建立几何基础。为实现高效且一致的立体生成，我们的方法引入两项关键设计：(1) 统一的相机坐标系RoPE（Rotary Positional Encoding，旋转位置编码），通过相机感知的旋转位置编码增强潜在标记，在通过稳定的注意力初始化保留预训练视频先验的同时，实现相对、视角与时间一致的调节；(2) 立体感知的注意力分解，将完整的4D注意力分解为3D视图内注意力与水平行注意力，利用极线先验以显著更低的计算成本捕捉视差对齐的对应关系。在多项基准测试中，相较于强大的“先单目后转换”流程，StereoWorld在立体一致性、视差精度和相机运动保真度方面均取得提升，实现了超过3倍的生成速度提升，并在视角一致性上额外获得5%的增益。在基准测试之外，StereoWorld能够在不依赖深度估计或修补的情况下实现端到端的双目VR渲染，通过度量尺度深度基础增强具身策略学习，并兼容长视频蒸馏以实现扩展的交互式立体合成。

摘要 (Abstract)

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

关键词: stereo world model, stereo video generation, camera-conditioned, binocular geometry, disparity, RoPE, attention decomposition, end-to-end

237. ❌ Shot-Aware Frame Sampling for Video Understanding

作者: Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan, Shirin Jalali, Yong Cao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频理解中的帧采样方法（InfoShot），属于计算机视觉领域，主要涉及视频处理、帧采样算法和视觉语言模型（VLMs）的应用。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、代理系统等），或特定科学领域AI应用（如生物信息学）。论文未涉及LLMs、MoE、SLMs、训练技术、推理方法、代理系统、模型压缩等任何关键词内容，也未在生物信息学等科学领域应用LLMs。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对长视频理解中帧采样方法在有限帧数下难以平衡视频整体覆盖与关键短暂事件的问题，提出了一种任务无关的、基于镜头感知的帧采样器InfoShot，通过信息论目标选择互补关键帧，实验表明其在帧数约束下提高了异常检测命中率和视频问答准确性。

摘要翻译

视频帧采样对于利用视觉语言模型（VLMs）实现高效的长视频理解至关重要，因为密集输入成本高昂且常超出上下文限制。然而，当只能保留少量帧时，现有采样方法往往难以兼顾广泛的视频覆盖度与短暂但关键的事件，这可能导致下游预测不可靠。为解决此问题，我们提出InfoShot，一种任务无关、镜头感知的长视频理解帧采样器。InfoShot首先将视频分割为语义一致的镜头（shots），随后从每个镜头中选取两帧互补的关键帧：一帧用于表征主要内容，另一帧用于捕捉镜头内不寻常的变化。该设计受到信息论目标的指导，旨在促使采样集合同时保留关于镜头结构与稀疏镜头内偏差的高信息量。通过这种方式，它在无需任何重新训练的情况下，提高了同时保留整体视频上下文与短暂决策关键时刻的可能性。为更好地评估此类短暂事件，我们进一步引入SynFlash，这是一个具有可控亚秒级异常模式与帧级真实标注的合成基准数据集，同时我们也在现有异常数据集和通用视频理解任务上评估InfoShot。实验表明，在帧数限制下，InfoShot提升了异常命中率与下游视频问答（Video-QA）准确率，并在标准视频理解基准测试中达到或优于现有强基线方法。

摘要 (Abstract)

Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.

关键词: Video frame sampling, Long-video understanding, Vision-Language Models (VLMs), Shot-aware sampler, Information-theoretic objective, Anomaly detection, Video-QA accuracy, Task-agnostic

238. ❌ Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

作者: Ben S. Southworth, Stephen Thomas 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种新的优化器MUD，用于加速Transformer模型的训练，属于大模型技术原理的创新。与关键词的相关性分析：1）与’Large Language Models’相关（5分），因为论文在GPT-2和ESM-2蛋白质语言模型上进行了实验；2）与’Pre-training’相关（5分），因为优化器直接应用于模型预训练过程；3）与’AI for Science’相关（5分），因为论文应用了ESM-2蛋白质语言模型，属于生物信息学领域。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为MUD的新型优化器，通过三角白化方法替代Muon的极分解更新，在保持相似收敛性能的同时显著降低了优化器开销，在GPT-2和蛋白质语言模型训练中实现了10-50%的加速。

摘要翻译

正交化动量优化器（如Muon）通过短极分解迭代对矩阵值动量更新进行近似白化/正交化，从而改进Transformer训练。然而，极因子近似通常需要多次大型矩阵乘法，由此产生的开销可能相当显著且依赖于硬件。我们提出MUD（动量去相关），这是一种互补的白化方法，它利用受经典格拉姆-施密特和高斯-赛德尔思想启发的三角（类Cholesky）白化替代方案，取代了Muon中的极更新。我们证明行正交矩阵是MUD映射的不动点，将内部步骤与格拉姆矩阵的对称高斯-赛德尔预处理联系起来，并证明了在不动点附近具有局部二次收敛性。在达到困惑度的时间效率方面，相较于调优后的AdamW和Muon，MUD在挂钟时间上实现了稳定的10-50%提升——虽然每步收敛速度通常略慢于Muon，但优化器开销显著降低：相对于Muon，在大多数设置下MUD将峰值令牌处理速度提升了约$1.3-2.6$倍，在A100上训练GPT-2大模型时提升接近$3$倍。我们还演示了训练ESM-2 1.5亿参数蛋白质语言模型的结果，其中MUD在显著更短的挂钟时间内达到了与Muon相当的验证困惑度水平。

摘要 (Abstract)

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon’s polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram–Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead – relative to Muon, MUD improves peak tokens/s by roughly $1.3-2.6\times$ across most settings and up to nearly $3\times$ on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.

关键词: Transformer training, optimizer, momentum decorrelation, wall-clock improvement, protein language model, time-to-perplexity, whitening, Gauss-Seidel

239. ❌ Unified Policy Value Decomposition for Rapid Adaptation

作者: Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习中的快速适应问题，提出了一种基于目标嵌入和双线性分解的策略价值框架，用于MuJoCo环境中的多方向运动控制。所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是传统强化学习算法（Soft Actor-Critic）和机器人控制，未涉及大模型、深度学习架构、训练技术、推理优化、AI代理或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于共享低维目标嵌入和双线性分解的策略价值框架，使强化学习智能体能够在MuJoCo Ant环境中实现无需梯度更新的零样本快速适应，通过冻结基础函数并估计目标系数来适应新的连续目标方向。

摘要翻译

复杂控制系统中的快速适应仍是强化学习的核心挑战。我们提出一个框架，其中策略函数与价值函数共享一个低维系数向量——目标嵌入（goal embedding）——该向量捕获任务身份，并能在不重新训练表征的情况下实现对新任务的即时适应。在预训练阶段，我们通过双线性行动者-评论家分解联合学习结构化价值基函数与兼容的策略基函数。评论家部分被分解为 Q = sum_k G_k(g) y_k(s,a)，其中 G_k(g) 是目标条件系数向量，y_k(s,a) 是习得的价值基函数。这种乘法门控机制——即上下文信号对一组状态依赖基函数进行缩放——类似于在第五层锥体神经元中观察到的增益调制现象，其中自上而下的输入在不改变其调谐特性的情况下调节感觉驱动响应的增益。基于后继特征（Successor Features）方法，我们将该分解扩展至行动者部分，通过同一组系数 G_k(g) 加权组合一组原始策略。在测试阶段，基函数被固定，G_k(g) 通过单次前向传播进行零样本估计，从而实现无需梯度更新的新任务即时适应。我们在 MuJoCo Ant 环境中以多方向运动为目标训练软行动者-评论家（Soft Actor-Critic）智能体，要求智能体沿八个由连续目标向量指定的方向行走。双线性结构使每个策略头能够专注于特定方向子集，而共享的系数层则实现跨方向泛化，通过在目标嵌入空间中进行插值来适应新方向。我们的研究表明，共享低维目标嵌入为高维控制中的快速结构化适应提供了通用机制，并揭示了一种在复杂强化学习系统中实现高效迁移的潜在生物学可行原理。

摘要 (Abstract)

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

关键词: reinforcement learning, rapid adaptation, goal embedding, bilinear decomposition, policy-value decomposition, zero-shot adaptation, Soft Actor-Critic, MuJoCo Ant

240. ❌ Multi-Armed Sequential Hypothesis Testing by Betting

作者: Ricardo J. Sandoval, Ian Waudby-Smith, Michael I. Jordan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多臂序贯假设检验的统计方法，属于经典统计推断领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何人工智能、机器学习或深度学习技术，也未讨论大模型在科学领域的应用。

!!! tip deepseek-chat TL;DR

该论文研究了多臂序贯假设检验问题，提出了在不知道哪个臂能提供最强证据的情况下，设计具有最优性能的e-过程和序贯检验的方法，并推导了相应的理论下界和上界。

摘要翻译

我们考虑一种序列检验的变体，其中统计学家在每个时间步面临多个数据源（臂），并通过选择其中一个臂来获取数据。我们考虑复合全局零假设 $\mathscr{P}$，即所有臂在某种意义下均为零（例如所有治疗剂量均无效），并关注拒绝 $\mathscr{P}$ 而支持复合备择假设 $\mathscr{Q}$，即至少有一个臂非零（例如存在有效的治疗剂量）。我们提出一种最优性目标，其非正式描述如下：即使存在多个非零臂，我们寻求的 $e$-过程与序列检验的性能，应尽可能接近那些具有先知知识（即知道哪个臂能产生最强反证据）的检验方法。形式上，我们将对数最优性与期望拒绝时间最优性推广至多臂情形，并得到了两者的匹配上下界。该最优性分析中的一个关键技术工具是一种改进的类上置信界算法，适用于不可观测但充分“可估计”的收益。在该算法设计中，我们基于凯利[1956]意义下的最优财富增长率，推导了非渐近集中不等式。这些结论可能具有独立的学术价值。

摘要 (Abstract)

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently “estimable” rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

关键词: sequential hypothesis testing, multi-armed bandits, e-processes, optimality bounds, concentration inequalities, statistical inference, sequential analysis

241. ❌ A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

作者: Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维统计中的统计-计算差距问题，聚焦于单索引和多索引模型的理论分析，与所有评分关键词（均涉及大模型、深度学习技术及其应用）无直接关联。论文未提及任何大模型、深度学习技术、AI应用或相关方法，属于纯理论统计机器学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了高维统计中单索引和多索引模型的统计-计算差距问题，发现噪声敏感指数统一控制了噪声鲁棒性、计算难度和特征专业化之间的关系。

摘要翻译

理解学习在统计上可能而计算上困难的条件，是高维统计学的核心挑战。本研究在单索引模型与多索引模型背景下探讨这一问题，这些函数类作为衡量机器学习方法在高维数据中发现特征能力的基准被广泛研究。我们的主要贡献在于证明：噪声敏感指数——一个由激活函数决定的简单量——在这些模型的广泛范围内主导着统计-计算间隙的存在与大小。我们首先证明，在具有较大加性噪声的单索引模型中，计算瓶颈的出现完全由噪声敏感指数表征。随后我们论证，该指数同样控制着大型可分离多索引模型在专业化转变中的统计-计算间隙，此时各分量变得可学习。最后，在层次化多索引模型中，我们表明噪声敏感指数决定了不同方向被顺序学习的最优计算速率。综上所述，我们的研究将噪声敏感指数确立为连接高维学习中噪声鲁棒性、计算困难性与特征专业化的统一属性。

摘要 (Abstract)

Understanding when learning is statistically possible yet computationally hard is a central challenge in high-dimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE) - a simple quantity determined by the activation function - governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Finally, in hierarchical multi-index models, we show that the NSE governs the optimal computational rate in which different directions are sequentially learned. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.

关键词: statistical-to-computational gaps, single-index models, multi-index models, noise sensitivity exponent, high-dimensional statistics, computational hardness, feature specialization, activation function

242. ❌ Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

作者: Abhishek Gupta, Aditya Mahajan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是强化学习中一般马尔可夫决策过程（MDPs）的算子理论基础和策略梯度方法，专注于无界成本情况下的理论扩展和算法开发。所有关键词都涉及大模型、深度学习技术或特定AI应用领域（如生物信息学），而本文是纯强化学习理论工作，不涉及任何大模型技术、深度学习架构、训练方法、推理优化或特定科学领域的AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一个基于线性算子扰动理论的框架，将一般状态和动作空间的马尔可夫决策过程视为函数空间上线性算子的优化问题，从而推广了强化学习的经典结果，并开发了适用于一般MDPs的低复杂度PPO类算法。

摘要翻译

马尔可夫决策过程（MDPs）可被视为在一般函数空间上特定线性算子对目标函数的优化问题。借助成熟的线性算子扰动理论，这一视角使得我们能够将目标函数的导数识别为线性算子的函数。由此，强化学习中许多经典结论得以推广至具有广义状态与动作空间的情形。此前此类结果仅建立在有限状态-有限动作的MDP设定中，或某些线性函数逼近的设定下。该框架还催生出适用于广义状态与动作空间MDP的新型低复杂度PPO类强化学习算法。

摘要 (Abstract)

Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.

关键词: Markov decision processes, operator-theoretic foundations, policy gradient methods, unbounded costs, general state and action spaces, perturbation theory, PPO-type algorithms, reinforcement learning

243. ❌ RHYME-XT: A Neural Operator for Spatiotemporal Control Systems

作者: Marijn Ruiter, Miguel Aguiar, Jake Rap, Karl H. Johansson, Amritam Das 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文RHYME-XT专注于使用神经网络学习算子（neural operator）来建模时空控制系统，特别是受输入仿射非线性偏积分微分方程（PIDEs）控制的系统。它涉及深度学习在科学计算中的应用，但并非大语言模型（LLMs）或相关技术。因此，与大多数关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关，评分为0。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（具体为控制理论和偏微分方程建模），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。加权总分计算为5.0（5分 × 权重1.0）。作者列表中未包含指定专家。

!!! tip deepseek-chat TL;DR

论文提出RHYME-XT，一种用于建模具有局部节律行为的时空控制系统的神经算子框架，通过Galerkin投影和流映射学习，在神经场PIDE实验中优于现有神经算子并实现跨数据集知识迁移。

摘要翻译

我们提出RHYME-XT，一种用于时空控制系统代理建模的算子学习框架，该系统由具有局部节律行为的输入仿射非线性偏积分微分方程（PIDEs）所控制。RHYME-XT采用伽辽金投影法，在通过神经网络参数化的空间基函数所张成的学习有限维子空间上近似无限维PIDE，从而得到一个由投影输入驱动的投影常微分方程组。我们并未对该非自治系统进行积分，而是利用一种学习流函数的架构直接学习其流映射，从而在获得连续时间且与离散化无关的表示的同时避免了高昂的计算成本。在神经场PIDE上的实验表明，RHYME-XT的性能优于当前最先进的神经算子，并能够通过微调过程，在不同数据集训练的模型之间有效地迁移知识。

摘要 (Abstract)

We propose RHYME-XT, an operator-learning framework for surrogate modeling of spatiotemporal control systems governed by input-affine nonlinear partial integro-differential equations (PIDEs) with localized rhythmic behavior. RHYME-XT uses a Galerkin projection to approximate the infinite-dimensional PIDE on a learned finite-dimensional subspace with spatial basis functions parameterized by a neural network. This yields a projected system of ODEs driven by projected inputs. Instead of integrating this non-autonomous system, we directly learn its flow map using an architecture for learning flow functions, avoiding costly computations while obtaining a continuous-time and discretization-invariant representation. Experiments on a neural field PIDE show that RHYME-XT outperforms a state-of-the-art neural operator and is able to transfer knowledge effectively across models trained on different datasets, through a fine-tuning process.

关键词: neural operator, spatiotemporal control systems, partial integro-differential equations, Galerkin projection, flow map learning, surrogate modeling, continuous-time representation, fine-tuning

244. ❌ Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation

作者: William Thorossian 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17855v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于地震和火山信号分析的机器学习应用，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到’interpretable AI’，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文综述了机器学习在地震和火山信号分析中的应用，强调物理约束、自监督学习和可解释性对实现稳健、可维护的AI辅助监测的重要性。

摘要翻译

现代地震与火山监测日益受到持续多传感器观测的影响，并亟需从非平稳、含噪的波场中提取可指导行动的信息。在此背景下，机器学习已从一种研究探索转变为处理流程中的实用组成部分，用于实现检测、震相拾取、分类、去噪及异常追踪等任务。然而，仅在固定数据集上提升精度尚不足以满足实际业务需求。模型必须在领域偏移（如新增台站、噪声变化、火山活动演变）下保持可靠性，提供支持决策的不确定性估计，并将其输出与具有物理意义的约束条件相关联。本文系统梳理并归纳了近期用于地震与火山信号分析的机器学习方法，重点探讨了经典信号处理如何提供不可或缺的归纳偏置，自监督与生成建模如何降低对标注数据的依赖，以及哪些评估协议最能反映模型跨区域的迁移能力。最后，我们提出了构建鲁棒、可解释且可维护的人工智能辅助监测系统所面临的开放挑战。

摘要 (Abstract)

Modern seismic and volcanic monitoring is increasingly shaped by continuous, multi-sensor observations and by the need to extract actionable information from nonstationary, noisy wavefields. In this context, machine learning has moved from a research curiosity to a practical ingredient of processing chains for detection, phase picking, classification, denoising, and anomaly tracking. However, improved accuracy on a fixed dataset is not sufficient for operational use. Models must remain reliable under domain shift (new stations, changing noise, evolving volcanic activity), provide uncertainty that supports decision-making, and connect their outputs to physically meaningful constraints. This paper surveys and organizes recent ML approaches for seismic and volcanic signal analysis, highlighting where classical signal processing provides indispensable inductive bias, how self-supervision and generative modeling can reduce dependence on labels, and which evaluation protocols best reflect transfer across regions. We conclude with open challenges for robust, interpretable, and maintainable AI-assisted monitoring.

关键词: seismic signal analysis, volcanic monitoring, machine learning, physics-aware AI, domain shift, self-supervision, interpretable AI, generative modeling

245. ❌ Verification and Validation of Physics-Informed Surrogate Component Models for Dynamic Power-System Simulation

作者: Petros Ellinas, Indrajit Chaudhuri, Johanna Vorwerk, Spyros Chatzivasileiadis 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17836v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究物理信息机器学习代理模型在电力系统动态仿真中的应用验证问题，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。论文未涉及大语言模型（LLMs）、深度学习技术原理创新、模型训练方法（如预训练、微调、对齐）、推理优化、智能体系统等主题，与其他所有关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了物理信息机器学习代理模型在电力系统动态仿真中的验证与验证问题，发现良好的独立代理模型精度并不能保证在仿真器中的准确行为，最大差异集中在压力运行区域。

摘要翻译

基于物理信息的机器学习代理模型正被日益广泛地探索，以加速发电机、变流器及其他电网组件的动态仿真。然而，关键问题不仅在于代理模型是否在平均意义上与独立组件模型相匹配，更在于其被嵌入微分-代数仿真器后是否仍能保持准确性——在仿真器中，代理模型的输出会进入代数方程，从而将组件与系统其余部分耦合起来。本文将这种仿真器内的使用场景表述为一个验证与确认问题。文中推导了一个有限时域界，该界将允许的组件输出误差与代数耦合灵敏度、动态误差放大效应以及仿真时长相联系。随后研究了两种互补的情境：基于模型的验证（对照参考组件求解器）以及基于数据的确认（通过对与仿真器交换的组件输出变量进行保形校准）。该框架具有普适性，但案例研究聚焦于二阶、四阶和六阶同步电机模型的物理信息神经网络代理模型。结果表明，良好的独立代理模型精度本身并不能保证其在仿真器内的行为准确，最大的偏差集中在高应力运行区域，且微小的方程残差并不必然意味着状态轨迹误差也小。

摘要 (Abstract)

Physics-informed machine learning surrogates are increasingly explored to accelerate dynamic simulation of generators, converters, and other power grid components. The key question, however, is not only whether a surrogate matches a stand-alone component model on average, but whether it remains accurate after insertion into a differential-algebraic simulator, where the surrogate outputs enter the algebraic equations coupling the component to the rest of the system. This paper formulates that in-simulator use as a verification and validation (V&V) problem. A finite-horizon bound is derived that links allowable component-output error to algebraic-coupling sensitivity, dynamic error amplification, and the simulation horizon. Two complementary settings are then studied: model-based verification against a reference component solver, and data-based validation through conformal calibration of the component-output variables exchanged with the simulator. The framework is general, but the case study focuses on physics-informed neural-network surrogates of second-, fourth-, and sixth-order synchronous-machine models. Results show that good stand-alone surrogate accuracy does not by itself guarantee accurate in-simulator behavior, that the largest discrepancies concentrate in stressed operating regions, and that small equation residuals do not necessarily imply small state-trajectory errors.

关键词: physics-informed machine learning, surrogate models, dynamic simulation, power systems, verification and validation, synchronous-machine models, differential-algebraic simulator, conformal calibration

246. ❌ Symmetry-Reduced Physics-Informed Learning of Tensegrity Dynamics

作者: Jing Qin, Muhao Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于对称性约简的物理信息神经网络（SymPINN）在张拉整体结构动力学中的应用，属于AI for Science（科学AI）领域，因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文未涉及大语言模型、深度学习技术原理创新或其他关键词相关的大模型技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种对称性约简的物理信息神经网络（SymPINN）框架，通过嵌入群论对称性来预测张拉整体结构的动力学行为，相比标准物理信息模型显著提高了预测精度和计算效率。

摘要翻译

张拉整体结构具有支配其动力学行为的内在几何对称性。然而，现有大多数用于张拉整体动力学的物理信息神经网络方法并未显式利用这些对称性，导致计算复杂度高且优化不稳定。本研究提出一种对称性约简物理信息神经网络框架，将基于群论的对称性直接嵌入到解的表达和神经网络架构中，以预测张拉整体动力学。通过将节点分解为对称轨道，并使用对称基表示自由节点坐标，所提方法构建了能保持结构几何对称性的约简坐标表示。随后，通过网络学习的约简解经对称变换恢复完整坐标，确保预测构型自动满足对称约束。在该框架中，等变性通过基于轨道的坐标生成、对称一致的消息传递和物理残差约束得以强制实施。此外，该框架通过将初始条件编码为硬约束、引入傅里叶特征编码以增强动态运动表征，并采用两阶段优化策略，从而提升了训练效率。在对称T型杆和着陆器结构上的大量数值实验表明，相较于标准物理信息模型，该方法在预测精度和计算效率上均有显著提升，这揭示了对称感知学习在张拉整体动力学结构保持建模中的巨大潜力。

摘要 (Abstract)

Tensegrity structures possess intrinsic geometric symmetries that govern their dynamic behavior. However, most existing physics-informed neural network (PINN) approaches for tensegrity dynamics do not explicitly exploit these symmetries, leading to high computational complexity and unstable optimization. In this work, we propose a symmetry-reduced physics-informed neural network (SymPINN) framework that embeds group-theory-based symmetry directly into both the solution expression and the neural network architecture to predict tensegrity dynamics. By decomposing nodes into symmetry orbits and representing free nodal coordinates using a symmetry basis, the proposed method constructs a reduced coordinate representation that preserves geometric symmetry of the structure. The full coordinates are then recovered via symmetry transformations of the reduced solution learned by the network, ensuring that the predicted configurations automatically satisfy the symmetry constraints. In this framework, equivariance is enforced through orbit-based coordinate generation, symmetry-consistent message passing, and physics residual constraints. In addition, SymPINN improves training effectiveness by encoding initial conditions as hard constraints, incorporating Fourier feature encoding to enhance the representation of dynamic motions, and employing a two-stage optimization strategy. Extensive numerical experiments on symmetric T-bars and lander structures demonstrate significantly improved prediction accuracy and computational efficiency compared to standard physics-informed models, indicating the great potential of symmetry-aware learning for structure-preserving modeling of tensegrity dynamics.

关键词: Symmetry-reduced physics-informed neural network, Tensegrity dynamics, Group theory symmetry, Equivariance, Physics-informed learning, Computational efficiency, Structure-preserving modeling

247. ❌ Federated Distributional Reinforcement Learning with Distributional Critic Regularization

作者: David Millard, Cecilia Alm, Rashid Ali, Pengcheng Shi, Ali Baheri 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Federated Distributional Reinforcement Learning with Distributional Critic Regularization》专注于联邦强化学习（Federated Reinforcement Learning）和分布强化学习（Distributional RL）的结合，提出了一种新的联邦分布强化学习方法（FedDistRL）和TR-FedDistRL算法。其核心是解决联邦学习中参数平均导致分布信息丢失的问题，通过量化价值函数评论家和Wasserstein重心来保留统计多模态和尾部行为，以提高安全关键环境中的性能。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是强化学习（特别是联邦强化学习和分布强化学习），属于机器学习的一个子领域，但与大模型技术、深度学习原理创新或AI for Science应用没有直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种联邦分布强化学习方法（FedDistRL）和TR-FedDistRL算法，通过量化价值函数评论家和Wasserstein重心约束，解决了联邦学习中参数平均导致分布信息丢失的问题，从而在安全关键环境中减少了均值模糊、提高了安全代理指标并降低了评论家/策略漂移。

摘要翻译

联邦强化学习通常通过参数平均来聚合价值函数或策略，这种方法侧重于期望回报，但可能掩盖在安全关键场景中至关重要的统计多模态性和尾部行为。我们形式化了联邦分布强化学习（FedDistRL），其中客户端参数化分位数价值函数评论器，并仅对这些网络进行联邦聚合。我们还提出了TR-FedDistRL，该方法为每个客户端在时序缓冲区上构建一个风险感知的Wasserstein重心。这一局部重心提供了一个参考区域，用于约束参数平均后的评论器，确保必要的分布信息在联邦过程中不会被平均化消除。分布信任区域通过围绕该参考的收缩-压缩步骤实现。在固定策略评估下，可行性映射是非扩张的，且更新过程在评估所用的探针集Wasserstein度量下具有压缩性。在赌博机、多智能体网格世界和连续高速公路环境上的实验表明，相较于面向均值的基准方法和非联邦基线，该方法减少了均值模糊效应，提升了安全代理指标（灾难/事故率），并降低了评论器/策略的漂移。

摘要 (Abstract)

Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety-critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR-FedDistRL, which builds a per client, risk-aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink-squash step around this reference. Under fixed-policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe-set Wasserstein metric under evaluation. Experiments on a bandit, multi-agent gridworld, and continuous highway environment show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean-oriented and non-federated baselines.

关键词: Federated Reinforcement Learning, Distributional Reinforcement Learning, Quantile Value Function, Wasserstein Barycenter, Safety-critical Settings, Parameter Averaging, Critic Regularization, Multi-agent Gridworld

248. ❌ The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery

作者: Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要探讨HPC、ML和QC在药物发现中的融合，其中提到ML基础模型（如FeNNix-Bio1）用于量子精确模拟，这与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为基础模型属于大模型范畴，但论文未深入讨论LLMs具体技术。论文核心是AI在科学领域的应用，特别是生物信息学/化学信息学，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为直接涉及药物发现和量子化学模拟。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未在论文中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过融合高性能计算、机器学习和量子计算来解决药物发现中量子化学模拟的计算瓶颈，并提出量子增强采样作为超越GPU前沿的方法，以优化药物发现流程并实现化学精度。

摘要翻译

将量子力学整合到药物发现中，标志着从经验性试错向定量精准研究的决定性转变。然而，从头算分子动力学模拟的过高成本历来迫使人们在化学精度与计算可扩展性之间做出妥协。本文指出，高性能计算、机器学习与量子计算的融合是解决这一瓶颈的明确方案。尽管机器学习基础模型（如FeNNix-Bio1）能够实现量子精度的模拟，但它们仍受限于经典数据生成的内在局限。我们详细阐述了利用混合量子处理单元-图形处理器架构的高性能量子计算，将如何成为量子化学数据的终极加速器。通过利用希尔伯特空间映射，这些系统能够绕过经典近似方法的启发式策略，实现真正的化学精度。我们展示了这种三方融合如何优化药物发现流程，涵盖从初始系统准备到机器学习驱动的高保真模拟。最后，我们将量子增强采样定位为超越图形处理器能力的前沿技术，用于模拟反应性细胞系统并开创下一代材料。

摘要 (Abstract)

Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.

关键词: Drug Discovery, Quantum Computing, Machine Learning, High-Performance Computing, Quantum Chemistry, Molecular Dynamics, Foundation Models, Quantum-enhanced Sampling

249. ❌ Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

作者: Qi Liu, Laure Zanna, Joan Bruna 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究神经代理模型在动态系统模拟中的长期一致性问题，提出了一种自精炼模型（SNS）。与大多数大模型技术关键词无关，但与’Self-Correction OR Self-Improvement OR Self-Reflection’高度相关（10分），因为论文核心是模型自我精炼以纠正累积误差。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为论文属于科学计算领域的AI应用，但未明确涉及生物信息学或化学信息学。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了自回归神经代理模型在长期动态系统模拟中因误差累积导致分布漂移的问题，提出了一种自精炼神经代理模型（SNS），通过条件扩散模型平衡短期保真度和长期一致性，实现了任意长时间的高保真模拟。

摘要翻译

近期自回归神经代理模型的研究进展使得动态系统仿真实现了数量级的速度提升。然而，自回归模型普遍存在分布漂移问题：自回归推演中的误差累积会严重降低长时间尺度下的生成质量。现有研究试图通过超参数调优，隐式利用短期精度与长期一致性之间的固有权衡来解决这一问题。本文提出了一个统一的数学框架，将这种权衡关系显式化，并对现有方法中基于超参数的策略进行了形式化概括与推广。在此框架内，我们提出了一种稳健的无超参数模型，该模型以条件扩散模型实现，其结构设计天然平衡了短期保真度与长期一致性。我们提出的自优化神经代理模型（SNS）既可作为独立模型优化其自身的自回归输出，也可作为现有神经代理模型的补充模块以确保长期一致性。通过复杂动态系统在任意长时间尺度上的高保真仿真实验，我们进一步验证了SNS模型的数值可行性。

摘要 (Abstract)

Recent advances in autoregressive neural surrogate models have enabled orders-of-magnitude speedups in simulating dynamical systems. However, autoregressive models are generally prone to distribution drift: compounding errors in autoregressive rollouts that severely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly leveraging the inherent trade-off between short-time accuracy and long-time consistency through hyperparameter tuning. In this work, we introduce a unifying mathematical framework that makes this tradeoff explicit, formalizing and generalizing hyperparameter-based strategies in existing approaches. Within this framework, we propose a robust, hyperparameter-free model implemented as a conditional diffusion model that balances short-time fidelity with long-time consistency by construction. Our model, Self-refining Neural Surrogate model (SNS), can be implemented as a standalone model that refines its own autoregressive outputs or as a complementary model to existing neural surrogates to ensure long-time consistency. We also demonstrate the numerical feasibility of SNS through high-fidelity simulations of complex dynamical systems over arbitrarily long time horizons.

关键词: neural surrogate models, dynamical systems, autoregressive models, distribution drift, self-refining model, conditional diffusion model, long-time consistency, high-fidelity simulations

250. ❌ Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

作者: Oksana Kolomenko, Ricardo Knauer, Erik Rodner 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM嵌入在表格预测任务中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为摘要明确提到’large language models (LLMs)‘并研究其嵌入方法。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、压缩技术、科学AI等均未在标题或摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文系统性地研究了如何设计有效的基于大语言模型的嵌入管道来提升表格预测任务的性能，发现嵌入连接策略、模型大小和下游模型选择对性能有重要影响。

摘要翻译

嵌入技术是一种利用大型语言模型（LLM）的世界知识来增强数据驱动机器学习模型的有效方法。然而，关于如何为表格预测任务设计基于LLM的高效嵌入流程，目前证据有限。在本研究中，我们系统性地对256种流程配置进行了基准测试，涵盖了8种预处理策略、16种嵌入模型和2种下游模型。结果表明，融入LLM的先验知识是否能提升预测性能，很大程度上取决于具体的流程设计。总体而言，将嵌入向量与原始特征拼接的方法往往优于直接用嵌入向量替换原始列。更大的嵌入模型通常能产生更好的结果，而公开排行榜排名和模型流行度并不能有效预测性能。最后，梯度提升决策树（Gradient Boosting Decision Trees）往往是强大的下游模型。我们的研究结果为研究者和实践者构建更有效的表格预测嵌入流程提供了指导。

摘要 (Abstract)

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

关键词: LLM embeddings, tabular prediction, embedding pipeline design, world knowledge, gradient boosting decision trees, benchmarking, predictive performance, machine learning models

251. ❌ Predicting Trajectories of Long COVID in Adult Women: The Critical Role of Causal Disentanglement

作者: Jing Wang, Jie Shen, Yiming Luo, Amar Sra, Qiaomin Xie, Jeremy C. Weiss 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文明确使用Large Language Model构建因果网络来预测长新冠轨迹，属于大模型在生物医学领域的应用，因此与’Large Language Models’和’AI for Science’高度相关（10分）。其他关键词如MoE、SLMs、训练技术、推理优化、代理系统等均未在摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于大语言模型的因果网络框架，通过整合临床数据和可穿戴设备数据，成功预测了成年女性长新冠的严重程度轨迹，准确率达到86.7%。

摘要翻译

SARS-CoV-2急性后遗症严重程度的早期预测是女性健康领域的一项关键挑战，尤其考虑到PASC与更年期等常见激素转变期症状存在诊断重叠。识别并厘清这些混杂因素对于准确预测长期病程轨迹至关重要。本研究基于美国国立卫生研究院RECOVER数据集，对1,155名女性（平均年龄61岁）开展回顾性分析。通过整合静态临床资料与连续四周的纵向可穿戴设备数据（监测心脏活动与睡眠），我们构建了基于大语言模型的因果网络以预测未来PASC评分。该框架在临床严重程度预测中达到86.7%的精确度。因果归因分析表明模型能有效区分活动性病理与基线噪声：呼吸困难、倦怠等直接指标达到最大显著性（1.00），而更年期（menopause）与糖尿病等混杂因素的显著性得分则被成功抑制在0.27以下。

摘要 (Abstract)

Early prediction of Post-Acute Sequelae of SARS-CoV-2 severity is a critical challenge for women’s health, particularly given the diagnostic overlap between PASC and common hormonal transitions such as menopause. Identifying and accounting for these confounding factors is essential for accurate long-term trajectory prediction. We conducted a retrospective study of 1,155 women (mean age 61) from the NIH RECOVER dataset. By integrating static clinical profiles with four weeks of longitudinal wearable data (monitoring cardiac activity and sleep), we developed a causal network based on a Large Language Model to predict future PASC scores. Our framework achieved a precision of 86.7% in clinical severity prediction. Our causal attribution analysis demonstrate the model’s ability to differentiate between active pathology and baseline noise: direct indicators such as breathlessness and malaise reached maximum saliency (1.00), while confounding factors like menopause and diabetes were successfully suppressed with saliency scores below 0.27.

关键词: Long COVID, PASC, causal disentanglement, Large Language Model, wearable data, clinical prediction, women’s health, RECOVER dataset

252. ❌ Stochastic set-valued optimization and its application to robust learning

作者: Tommaso Giovannelli, Jingfu Tan, Luis Nunes Vicente 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17691v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于随机集值优化（SVO）框架及其在鲁棒机器学习中的应用，属于数学优化和机器学习理论领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型、深度学习技术或特定科学领域AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一个随机集值优化框架，通过超盒集和多目标优化方法改进鲁棒机器学习，在分布偏移下相比经验风险最小化展现出更好的鲁棒性和更低的变异性。

摘要翻译

本文提出了一种专为鲁棒机器学习设计的随机集值优化框架。在该框架中，每个决策变量被映射为一组目标值，并通过集合关系定义最优性。我们重点研究具有超盒集合的集值优化问题，此类问题可重构为具有有限多个目标的多目标优化问题，并为表示或逼近更一般的映射集合奠定了基础。超盒值优化的两种特例是区间值优化与矩形值优化。我们构建了随机区间值/矩形值优化模型，将次分位数与超分位数纳入多目标优化重构的目标函数中，从而为次分位数提供了新的表征方式。这些模型通过捕捉损失分布的下尾与上尾行为，提供了可解释的权衡机制，从而超越了标准的经验风险最小化与经典鲁棒模型。为求解所得的多目标优化问题，我们采用随机多梯度算法并选择帕累托膝部解。数值实验表明，在分布偏移条件下，采用该选择策略的所提算法相较于经验风险最小化方法，在测试重复中表现出更强的鲁棒性与更低的波动性，同时保持了具有竞争力的准确度。

摘要 (Abstract)

In this paper, we develop a stochastic set-valued optimization (SVO) framework tailored for robust machine learning. In the SVO setting, each decision variable is mapped to a set of objective values, and optimality is defined via set relations. We focus on SVO problems with hyperbox sets, which can be reformulated as multi-objective optimization (MOO) problems with finitely many objectives and serve as a foundation for representing or approximating more general mapped sets. Two special cases of hyperbox-valued optimization (HVO) are interval-valued (IVO) and rectangle-valued (RVO) optimization. We construct stochastic IVO/RVO formulations that incorporate subquantiles and superquantiles into the objective functions of the MOO reformulations, providing a new characterization for subquantiles. These formulations provide interpretable trade-offs by capturing both lower- and upper-tail behaviors of loss distributions, thereby going beyond standard empirical risk minimization and classical robust models. To solve the resulting multi-objective problems, we adopt stochastic multi-gradient algorithms and select a Pareto knee solution. In numerical experiments, the proposed algorithms with this selection strategy exhibit improved robustness and reduced variability across test replications under distributional shift compared with empirical risk minimization, while maintaining competitive accuracy.

关键词: stochastic set-valued optimization, robust machine learning, multi-objective optimization, hyperbox sets, distributional shift, empirical risk minimization, Pareto knee solution, stochastic multi-gradient algorithms

253. ❌ Flow Matching Policy with Entropy Regularization

作者: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习中的扩散策略改进，提出基于常微分方程的流匹配策略框架，核心内容涉及强化学习、概率建模、最优传输和熵正则化。所有评分关键词均与大语言模型、模型训练技术、推理优化、AI代理等主题相关，而本文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流匹配和熵正则化的强化学习策略框架（FMER），解决了扩散策略中熵控制困难和计算成本高的问题，在稀疏多目标任务上优于现有方法，并将训练时间减少了7倍。

摘要翻译

基于扩散的策略因其表征复杂非高斯分布的能力，在强化学习（RL）中获得了广泛关注。基于随机微分方程（Stochastic Differential Equation, SDE）的扩散策略常因精确熵难以计算而依赖间接熵控制，同时其策略梯度需通过迭代去噪链计算，导致计算成本过高。为克服这些问题，我们提出了带熵正则化的流匹配策略（Flow Matching Policy with Entropy Regularization, FMER），这是一个基于常微分方程（Ordinary Differential Equation, ODE）的在线强化学习框架。FMER通过流匹配对策略进行参数化，并受最优传输理论启发，沿直线概率路径采样动作。FMER利用模型的生成特性，从候选集中构建优势加权的目标速度场，从而引导策略更新朝向高价值区域。通过推导出一个可处理的熵目标，FMER实现了有理论依据的最大熵优化，以增强探索能力。在稀疏多目标FrankaKitchen基准测试上的实验表明，FMER优于现有先进方法，同时在标准MuJoco基准测试上保持竞争力。此外，与计算量大的扩散基线（QVPO）相比，FMER将训练时间缩短了7倍，与高效变体相比也减少了10-15%。

摘要 (Abstract)

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model’s generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.

关键词: Flow Matching Policy, Entropy Regularization, Reinforcement Learning, Diffusion Policies, Optimal Transport, Online RL, SDE-based Diffusion, ODE-based Framework

254. ❌ Atomic Trajectory Modeling with State Space Models for Biomolecular Dynamics

作者: Liang Shi, Jiarui Lu, Junqi Liu, Chence Shi, Zhi Yang, Jian Tang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物分子动力学模拟的深度学习应用，提出了一种基于状态空间模型（SSM）的生成框架ATMOS，用于生成原子级分子动力学轨迹。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而本文研究的是特定科学领域（生物分子动力学）的深度学习模型，并非LLM。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确属于AI在生物信息学/科学领域的应用，且是核心内容，因此给予10分。

!!! tip deepseek-chat TL;DR

该论文针对生物分子动力学模拟计算成本高的问题，提出了一种基于状态空间模型的生成框架ATMOS，能够高效生成蛋白质单体及蛋白质-配体复合物的原子级运动轨迹，并取得了最先进的性能。

摘要翻译

理解生物分子的动态行为对于阐明生物学功能和促进药物发现至关重要。虽然分子动力学（MD）模拟为研究这些动态提供了严格的物理基础，但在长时间尺度上其计算成本仍然高昂。相反，近期的深度生成模型加速了构象生成，但通常要么无法建模时间关系，要么仅适用于单体蛋白质。为弥补这一差距，我们提出了ATMOS，一种基于状态空间模型（SSM）的新型生成框架，旨在为生物分子系统生成原子级别的分子动力学轨迹。ATMOS集成了基于Pairformer的状态转移机制以捕获长程时间依赖性，并采用一个基于扩散的模块以自回归方式解码轨迹帧。ATMOS使用来自蛋白质数据库（PDB）的晶体结构以及来自大规模分子动力学模拟数据集（包括mdCATH和MISATO）的构象轨迹进行训练。我们证明，ATMOS在生成蛋白质单体以及复杂的蛋白质-配体系统的构象轨迹方面均达到了最先进的性能。通过实现对原子运动轨迹的高效推断，这项工作为生物分子动力学建模奠定了有前景的基础。

摘要 (Abstract)

Understanding the dynamic behavior of biomolecules is fundamental to elucidating biological function and facilitating drug discovery. While Molecular Dynamics (MD) simulations provide a rigorous physical basis for studying these dynamics, they remain computationally expensive for long timescales. Conversely, recent deep generative models accelerate conformation generation but are typically either failing to model temporal relationship or built only for monomeric proteins. To bridge this gap, we introduce ATMOS, a novel generative framework based on State Space Models (SSM) designed to generate atom-level MD trajectories for biomolecular systems. ATMOS integrates a Pairformer-based state transition mechanism to capture long-range temporal dependencies, with a diffusion-based module to decode trajectory frames in an autoregressive manner. ATMOS is trained across crystal structures from PDB and conformation trajectory from large-scale MD simulation datasets including mdCATH and MISATO. We demonstrate that ATMOS achieves state-of-the-art performance in generating conformation trajectories for both protein monomers and complex protein-ligand systems. By enabling efficient inference of atomic trajectory of motions, this work establishes a promising foundation for modeling biomolecular dynamics.

关键词: biomolecular dynamics, state space models, molecular dynamics simulations, generative framework, protein-ligand systems, trajectory generation, deep learning, ATMOS

255. ❌ ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

作者: Zirui Gong, Leo Yu Zhang, Yanjun Zhang, Viet Vo, Tianqing Zhu, Shirui Pan, Cong Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习中的梯度反转攻击（ARES攻击），专注于隐私泄露和攻击方法，与所有评分关键词（均涉及大模型/深度学习技术原理、应用或科学AI）完全无关。论文未涉及大模型、MoE、小模型、扩展定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ARES的可扩展梯度反转攻击方法，能够在联邦学习中无需架构修改即可从大训练批次中高保真重建训练样本，揭示了中间激活带来的严重隐私风险。

摘要翻译

联邦学习（Federated Learning, FL）通过共享模型更新而非原始数据来实现协同模型训练，旨在保护用户隐私。然而，近期研究表明，这些共享的更新可能通过梯度反演攻击（Gradient Inversion Attacks, GIAs）无意中泄露敏感的训练数据。其中，主动式梯度反演攻击尤为强大，即使在大批量训练条件下，也能实现单个样本的高保真重建。然而，现有方法通常需要对模型架构进行修改，这限制了其实际适用性。在本研究中，我们通过提出一种名为“基于稀疏反演的激活恢复”（Activation REcovery via Sparse inversion, ARES）的攻击方法，填补了这一空白。ARES是一种主动式梯度反演攻击，旨在无需修改架构的情况下，从大批量训练数据中重建训练样本。具体而言，我们将恢复问题建模为噪声稀疏恢复任务，并采用广义最小绝对收缩与选择算子（Least Absolute Shrinkage and Selection Operator, Lasso）进行求解。为了将攻击扩展至多样本恢复，ARES引入了印记方法以解耦激活值，从而实现可扩展的逐样本重建。我们进一步建立了预期恢复率，并推导了重建误差的上界，为ARES攻击提供了理论保证。在卷积神经网络（CNNs）和多层感知机（MLPs）上的大量实验表明，ARES能够在多种数据集上实现高保真重建，在大批量训练和现实联邦学习设置下显著优于先前的梯度反演攻击方法。我们的研究结果揭示，中间激活值在联邦学习中构成了严重且被低估的隐私风险，凸显了加强防御措施的迫切性。

摘要 (Abstract)

Federated Learning (FL) enables collaborative model training by sharing model updates instead of raw data, aiming to protect user privacy. However, recent studies reveal that these shared updates can inadvertently leak sensitive training data through gradient inversion attacks (GIAs). Among them, active GIAs are particularly powerful, enabling high-fidelity reconstruction of individual samples even under large batch sizes. Nevertheless, existing approaches often require architectural modifications, which limit their practical applicability. In this work, we bridge this gap by introducing the Activation REcovery via Sparse inversion (ARES) attack, an active GIA designed to reconstruct training samples from large training batches without requiring architectural modifications. Specifically, we formulate the recovery problem as a noisy sparse recovery task and solve it using the generalized Least Absolute Shrinkage and Selection Operator (Lasso). To extend the attack to multi-sample recovery, ARES incorporates the imprint method to disentangle activations, enabling scalable per-sample reconstruction. We further establish the expected recovery rate and derive an upper bound on the reconstruction error, providing theoretical guarantees for the ARES attack. Extensive experiments on CNNs and MLPs demonstrate that ARES achieves high-fidelity reconstruction across diverse datasets, significantly outperforming prior GIAs under large batch sizes and realistic FL settings. Our results highlight that intermediate activations pose a serious and underestimated privacy risk in FL, underscoring the urgent need for stronger defenses.

关键词: Federated Learning, Gradient Inversion Attack, Privacy Leakage, Activation Recovery, Sparse Recovery, Large Batch Training, CNN, MLP

256. ❌ AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

作者: Cai Xu, Changhao Sun, Ziyu Guan, Wei Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多视图学习中的维度不平衡问题，提出自适应多视图稀疏学习框架，涉及稀疏性、参数修剪、自监督学习等，但所有关键词均针对大模型/深度学习技术原理或科学应用，而本文专注于传统多视图机器学习，未涉及大模型、深度学习、科学AI应用或所列具体技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文针对多视图学习中不同视图维度严重不平衡导致模型偏向高维数据、难以有效对齐表示的问题，提出了自适应多视图稀疏学习框架，通过视图特定编码器、参数无关修剪和稀疏融合方法，在合成和真实数据集上实现了优越性能并展现出强泛化能力。

摘要翻译

多视图学习的主要目标是通过融合多种特征来全面描述数据。大多数先前研究隐含假设不同视图具有相似的维度。然而在实际应用中，不同视图间常存在严重的维度差异，导致多视图学习的不平衡问题。例如在情感识别任务中，视频帧维度常高达$10^6$，而生理信号仅包含$10^1$维度。针对该问题，现有方法通常面临两大挑战：（1）它们往往偏向高维数据，忽视低维视图；（2）在极端维度不平衡条件下难以有效对齐表征，导致低维视图引入严重冗余。为解决这些问题，我们提出自适应多视图稀疏学习（Adaptive Multi-view Sparsity Learning, AdaMuS）框架。首先，为避免忽略低维视图信息，我们构建视图特定编码器将其映射到统一维度空间。鉴于将低维数据映射到高维空间常引发严重过拟合，我们设计了一种无参数剪枝方法来自适应消除编码器中的冗余参数。进一步，我们提出稀疏融合范式，能灵活抑制冗余维度并有效对齐各视图。此外，为学习具有更强泛化能力的表征，我们提出自监督学习范式，通过构建相似度图获取监督信息。在合成玩具数据集和七个真实世界基准上的大量实验表明，AdaMuS在分类与语义分割任务中均能持续取得优越性能，并展现出强大的泛化能力。

摘要 (Abstract)

Multi-view learning primarily aims to fuse multiple features to describe data comprehensively. Most prior studies implicitly assume that different views share similar dimensions. In practice, however, severe dimensional disparities often exist among different views, leading to the unbalanced multi-view learning issue. For example, in emotion recognition tasks, video frames often reach dimensions of $10^6$, while physiological signals comprise only $10^1$ dimensions. Existing methods typically face two main challenges for this problem: (1) They often bias towards high-dimensional data, overlooking the low-dimensional views. (2) They struggle to effectively align representations under extreme dimensional imbalance, which introduces severe redundancy into the low-dimensional ones. To address these issues, we propose the Adaptive Multi-view Sparsity Learning (AdaMuS) framework. First, to prevent ignoring the information of low-dimensional views, we construct view-specific encoders to map them into a unified dimensional space. Given that mapping low-dimensional data to a high-dimensional space often causes severe overfitting, we design a parameter-free pruning method to adaptively remove redundant parameters in the encoders. Furthermore, we propose a sparse fusion paradigm that flexibly suppresses redundant dimensions and effectively aligns each view. Additionally, to learn representations with stronger generalization, we propose a self-supervised learning paradigm that obtains supervision information by constructing similarity graphs. Extensive evaluations on a synthetic toy dataset and seven real-world benchmarks demonstrate that AdaMuS consistently achieves superior performance and exhibits strong generalization across both classification and semantic segmentation tasks.

关键词: multi-view learning, dimensional imbalance, sparse learning, parameter-free pruning, representation alignment, self-supervised learning, generalization

257. ❌ End-to-end data-driven prediction of urban airflow and pollutant dispersion

作者: Nishant Kumar, Franck Kerhervé, Lionel Agostini, Laurent Cordier 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17606v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究城市空气流动和污染物扩散的预测，采用数据驱动方法，结合了谱本征正交分解、自编码器、LSTM和卷积神经网络等技术。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指大型语言模型及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是环境科学/计算流体力学）领域的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究开发了一种端到端数据驱动模型，用于预测城市街道峡谷中的空气流动和污染物扩散，结合降阶建模和深度学习技术，结果表明该模型能有效预测瞬时和统计稳态流场。

摘要翻译

气候变化与城市人口的快速增长正加剧城市内部的环境压力，使得城市大气流动行为成为影响公共卫生、能源使用及整体宜居性的关键因素。本研究旨在开发快速准确的城市污染物扩散模型，以支持决策者及时且经济高效地实施缓解措施。为实现这一目标，本文提出一种端到端的数据驱动方法，用于模拟和预测处于掠流态的城市街谷内的气流与污染物扩散。研究数据库来源于大涡模拟（Large Eddy Simulation, LES）获取的一系列时间解析快照。所提出的框架基于四个基本步骤：首先，通过对数据库进行谱本征正交分解（Spectral Proper Orthogonal Decomposition, SPOD）获得降阶基；将时间序列快照数据投影至SPOD模态（时域方法）可得到动力学的时间系数。其次，利用自编码器对时间系数进行非线性压缩，以进一步降低问题的维度。随后，在潜空间中使用长短期记忆（Long Short-Term Memory, LSTM）网络学习降阶模型（Reduced-Order Model, ROM）。最后，通过卷积神经网络将预测的速度场映射至污染物场，从而估算污染物扩散。结果表明，该模型在长时间尺度上对瞬时场及统计稳态场的预测均具有显著效能。

摘要 (Abstract)

Climate change and the rapid growth of urban populations are intensifying environmental stresses within cities, making the behavior of urban atmospheric flows a critical factor in public health, energy use, and overall livability. This study targets to develop fast and accurate models of urban pollutant dispersion to support decision-makers, enabling them to implement mitigation measures in a timely and cost-effective manner. To reach this goal, an end-to-end data-driven approach is proposed to model and predict the airflow and pollutant dispersion in a street canyon in skimming flow regime. A series of time-resolved snapshots obtained from large eddy simulation (LES) serves as the database. The proposed framework is based on four fundamental steps. Firstly, a reduced basis is obtained by spectral proper orthogonal decomposition (SPOD) of the database. The projection of the time series snapshot data onto the SPOD modes (time-domain approach) provides the temporal coefficients of the dynamics. Secondly, a nonlinear compression of the temporal coefficients is performed by autoencoder to reduce further the dimensionality of the problem. Thirdly, a reduced-order model (ROM) is learned in the latent space using Long Short-Term Memory (LSTM) netowrks. Finally, the pollutant dispersion is estimated from the predicted velocity field through convolutional neural network that maps both fields. The results demonstrate the efficacy of the model in predicting the instantaneous as well as statistically stationary fields over long time horizon.

关键词: urban airflow, pollutant dispersion, data-driven modeling, reduced-order model, LSTM, convolutional neural network, large eddy simulation, street canyon

258. ❌ One-Step Sampler for Boltzmann Distributions via Drifting

作者: Wenhan Cao, Keyu Yan, Lin Zhao 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于漂移的玻尔兹曼分布采样框架，属于统计物理和机器学习交叉领域的方法学研究。所有关键词均与大模型、深度学习技术原理或具体应用无关，仅“AI for Science”因涉及科学计算应用获得5分（有一定关联）。论文未涉及语言模型、训练技术、推理优化、对齐、代理系统等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于漂移的框架，通过训练一步神经生成器来摊销玻尔兹曼分布的采样，在测试时实现单次前向传播的高效采样。

摘要翻译

本文提出一种基于漂移的框架，用于对由能量函数定义的玻尔兹曼分布进行摊销采样。该方法通过将样本沿高斯平滑的得分场从当前模型分布向目标玻尔兹曼分布投影，训练一步式神经生成器。对于仅已知未归一化常数的目标分布，我们从平滑能量推导出实用的目标侧漂移项，并使用两种估计器：局部重要性采样的均值漂移估计器与二阶曲率校正近似。结合采样器侧平滑得分的小批量高斯均值漂移估计，该方法构建了简单的停梯度目标，实现稳定的一步训练。在四峰高斯混合玻尔兹曼目标上，我们的采样器实现了均值误差 $0.0754$、协方差误差 $0.0425$ 和径向基函数（RBF）最大均值差异（MMD）$0.0020$。在双势阱和香蕉形目标上的进一步实验表明，同一框架也能处理非凸和弯曲的低能量几何结构。总体而言，结果证明漂移法是一种有效方法，可将玻尔兹曼分布的迭代采样在测试时摊销为单次前向计算。

摘要 (Abstract)

We present a drifting-based framework for amortized sampling of Boltzmann distributions defined by energy functions. The method trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from the current model distribution toward the target Boltzmann distribution. For targets specified only up to an unknown normalization constant, we derive a practical target-side drift from a smoothed energy and use two estimators: a local importance-sampling mean-shift estimator and a second-order curvature-corrected approximation. Combined with a mini-batch Gaussian mean-shift estimate of the sampler-side smoothed score, this yields a simple stop-gradient objective for stable one-step training. On a four-mode Gaussian-mixture Boltzmann target, our sampler achieves mean error $0.0754$, covariance error $0.0425$, and RBF MMD $0.0020$. Additional double-well and banana targets show that the same formulation also handles nonconvex and curved low-energy geometries. Overall, the results support drifting as an effective way to amortize iterative sampling from Boltzmann distributions into a single forward pass at test time.

关键词: Boltzmann distributions, amortized sampling, drifting framework, one-step neural generator, Gaussian-smoothed score field, energy functions, importance-sampling, RBF MMD

259. ❌ HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

作者: Zihao Zheng, Zhihao Mao, Sicheng Tian, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Speculative Decoding在Vision-Language-Action模型中的优化应用，与’Speculative Decoding OR Inference Acceleration’高度相关（10分）。论文涉及机器人控制，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分）。VLA模型通常基于大模型技术，与’Large Language Models OR LLMs OR Foundation Models’有间接关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对Vision-Language-Action模型推理速度慢的问题，提出了混合推测解码框架HeiSD，通过优化检索式推测解码和基于运动学的混合边界确定方法，在仿真和真实场景中实现了2.06-2.45倍的加速，同时保持高任务成功率。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型已成为机器人控制的主流解决方案，但其推理速度较慢。推测解码（Speculative Decoding, SD）是一种有前景的加速方法，可分为两类：基于草稿器的SD和基于检索的SD。现有方法未能分析这两类SD在VLA模型中的优缺点，导致其仅被单独应用或优化。本文分析了VLA模型控制的机器人轨迹模式，并得出一个关键结论：两类SD应以混合方式使用。然而，在VLA模型中实现混合SD面临若干挑战：（1）基于检索的SD中存在草稿拒绝和持续错误；（2）难以确定混合边界。为解决这些问题，我们提出了HeiSD框架。在HeiSD中，我们提出了一种基于检索的SD优化方法，包含验证-跳过机制和序列级松弛接受策略。此外，我们提出了一种基于运动学的融合度量，以自动确定混合边界。实验结果表明，HeiSD在仿真基准测试中实现了最高2.45倍的加速，在真实场景中实现了2.06倍至2.41倍的加速，同时保持了较高的任务成功率。

摘要 (Abstract)

Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.

关键词: Speculative Decoding, Vision-Language-Action Models, Inference Acceleration, Robot Control, Hybrid Decoding, Kinematic Awareness, Retrieval-based SD, Drafter-based SD

260. ❌ Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

作者: Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图变换器（Graph Transformers）的理论分析，属于图神经网络（GNN）和图注意力机制的范畴，与提供的关键词（主要针对大语言模型及其相关技术）无直接关联。论文未涉及大模型、深度学习在科学领域的应用，或大模型技术原理的创新，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过高斯过程极限理论分析了图变换器在节点级预测任务中的结构优势，证明了其相比图卷积网络能更好地保留社区信息和防止过平滑。

摘要翻译

图变换器是当前从图结构数据中学习的最先进方法，经验表明其能够避免消息传递架构的若干缺陷。然而，关于这些模型在实践中表现优异的理论分析仍较为有限。在本研究中，我们证明了在节点级预测任务背景下，基于注意力的架构相较于图卷积网络具有结构优势。具体而言，我们研究了具有无限宽度和无限注意力头的图变换器（GAT、Graphormer、Specformer）的神经网络高斯过程极限，并推导了跨层的节点级与边级核函数。我们的结果刻画了节点特征和图结构如何通过图注意力层进行传播。作为一个具体示例，我们证明了图变换器在结构上能够保持社区信息，即使在深层网络中也能维持具有区分度的节点表示，从而防止过度平滑现象。我们通过合成图和真实世界图上的实验证据验证了理论见解，例如整合信息先验和位置编码能够提升深层图变换器的性能。

摘要 (Abstract)

Graph transformers are the state-of-the-art for learning from graph-structured data and are empirically known to avoid several pitfalls of message-passing architectures. However, there is limited theoretical analysis on why these models perform well in practice. In this work, we prove that attention-based architectures have structural benefits over graph convolutional networks in the context of node-level prediction tasks. Specifically, we study the neural network gaussian process limits of graph transformers (GAT, Graphormer, Specformer) with infinite width and infinite heads, and derive the node-level and edge-level kernels across the layers. Our results characterise how the node features and the graph structure propagate through the graph attention layers. As a specific example, we prove that graph transformers structurally preserve community information and maintain discriminative node representations even in deep layers, thereby preventing oversmoothing. We provide empirical evidence on synthetic and real-world graphs that validate our theoretical insights, such as integrating informative priors and positional encoding can improve performance of deep graph transformers.

关键词: Graph Transformers, Gaussian Process Limit, Node-level Prediction, Attention-based Architectures, Graph Convolutional Networks, Community Information Preservation, Oversmoothing Prevention, Theoretical Analysis

261. ❌ In Trust We Survive: Emergent Trust Learning

作者: Qianpu Chen, Giulio Barbero, Mike Preuss, Derya Soydaner 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是AI智能体之间的信任学习算法（Emergent Trust Learning），属于多智能体系统领域。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（8分），因为论文研究多智能体在竞争环境中的合作协调机制，但未涉及大模型或深度学习技术。论文未提及任何大模型、深度学习或AI for Science的具体应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级的信任学习算法（ETL），使AI智能体在竞争性游戏环境中能够实现资源共享与合作，并在网格资源世界、分层塔环境和迭代囚徒困境中验证了其有效性。

摘要翻译

本文提出“涌现式信任学习”（Emergent Trust Learning，ETL），这是一种轻量级、基于信任的控制算法，可嵌入现有智能体中使用。该算法使得智能体能够在共享资源的竞争性游戏环境中实现合作。每个智能体维持一个简洁的内部信任状态，该状态调节其记忆、探索与行动选择。ETL仅需个体奖励与局部观测信息，且产生的计算与通信开销可忽略不计。

我们在三种环境中评估ETL：在基于网格的资源世界中，基于信任的智能体减少了冲突，防止了长期资源枯竭，同时获得了有竞争力的个体收益；在具有强烈社会困境和随机楼层分配的分层塔楼环境中，ETL维持了高存活率，即使在经历长期的强制贪婪阶段后仍能恢复合作；在迭代囚徒困境中，该算法可推广至策略元博弈，在与互惠型对手保持合作的同时，避免长期被背叛者利用。代码将在论文发表时公开。

摘要 (Abstract)

We introduce Emergent Trust Learning (ETL), a lightweight, trust-based control algorithm that can be plugged into existing AI agents. It enables these to reach cooperation in competitive game environments under shared resources. Each agent maintains a compact internal trust state, which modulates memory, exploration, and action selection. ETL requires only individual rewards and local observations and incurs negligible computational and communication overhead. We evaluate ETL in three environments: In a grid-based resource world, trust-based agents reduce conflicts and prevent long-term resource depletion while achieving competitive individual returns. In a hierarchical Tower environment with strong social dilemmas and randomised floor assignments, ETL sustains high survival rates and recovers cooperation even after extended phases of enforced greed. In the Iterated Prisoner’s Dilemma, the algorithm generalises to a strategic meta-game, maintaining cooperation with reciprocal opponents while avoiding long-term exploitation by defectors. Code will be released upon publication.

关键词: Emergent Trust Learning, AI agents, cooperation, multi-agent systems, trust-based control, competitive environments, resource sharing, Prisoner’s Dilemma

262. ❌ Consistency of the $k$-Nearest Neighbor Regressor under Complex Survey Designs

作者: Caren Hasler 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是传统统计学习方法（k-最近邻回归器）在复杂调查设计下的理论性质（一致性和收敛率），属于经典统计机器学习领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何现代大模型技术（如预训练、微调、对齐、推理优化、智能体等）。所有关键词均与大模型技术相关，而该论文专注于传统非参数回归方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了k-最近邻回归器在复杂调查设计下的理论一致性，证明了其在特定条件下的收敛性并分析了维度灾难问题。

摘要翻译

我们研究了复杂抽样设计下$k$近邻回归估计量的一致性。尽管该算法在独立同分布数据下的一致性结果已得到充分确立，但针对复杂抽样数据的相应研究尚属空白。我们证明，在抽样设计和数据分布满足一定正则性条件时，$k$近邻回归估计量具有一致性。我们推导了收敛速率的下界，并证明这些下界与独立同分布情形类似，同样呈现出维度灾难现象。基于模拟数据和实际数据的实证研究验证了我们的理论结果。

摘要 (Abstract)

We study the consistency of the $k$-nearest neighbor regressor under complex survey designs. While consistency results for this algorithm are well established for independent and identically distributed data, corresponding results for complex survey data are lacking. We show that the $k$-nearest neighbor regressor is consistent under regularity conditions on the sampling design and the distribution of the data. We derive lower bounds for the rate of convergence and show that these bounds exhibit the curse of dimensionality, as in the independent and identically distributed setting. Empirical studies based on simulated and real data illustrate our theoretical findings.

关键词: k-nearest neighbor regressor, complex survey designs, consistency, convergence rate, curse of dimensionality, nonparametric regression, sampling design, statistical learning theory

263. ❌ Conditional Inverse Learning of Time-Varying Reproduction Numbers Inference

作者: Lanlan Yu, Quan-Hui Liu, Haoyue Zheng, Xinfu Yang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于传染病流行病学中的时间变化再生数估计问题，提出了一种条件逆学习框架（CIRL）。论文内容涉及统计建模、流行病学模型和逆问题求解，属于AI在科学领域的应用。然而，论文未涉及任何大语言模型（LLM）、深度学习技术原理或相关关键词（如MoE、SFT、RAG、量化等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在流行病学（生物信息学相关领域）的应用，但并非核心内容，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种条件逆学习框架（CIRL），用于从流行病发病率数据中估计时间变化的再生数，解决了传统方法在非平稳传播动态下的局限性，并在合成和真实世界数据上验证了其有效性。

摘要翻译

从流行病发病率数据中估计时变再生数是传染病监测的核心任务，但这本质上是一个不适定的反问题。现有方法通常依赖于从流行病学模型衍生的强结构假设，这可能限制其适应由干预措施或行为变化引起的非平稳传播动态的能力，导致对状态转变的检测延迟以及估计精度下降。在本研究中，我们提出了一种条件逆再生学习框架（Conditional Inverse Reproduction Learning, CIRL），该框架通过学习从历史发病模式与显式时间信息到潜在再生数的条件映射来解决这一反问题。CIRL不强加严格的参数约束，而是通过基于灵活似然的统计建模，将流行病学结构进行软性整合，并以更新方程作为前向算子来保证动力学一致性。该框架结合了基于流行病学的约束与数据驱动的时间表征，能够生成对观测噪声稳健的再生数估计，同时保持对突发传播变化和零膨胀发病率观测的响应能力。在具有受控状态变化的合成流行病数据以及真实世界的SARS和COVID-19数据上的实验验证了所提方法的有效性。

摘要 (Abstract)

Estimating time-varying reproduction numbers from epidemic incidence data is a central task in infectious disease surveillance, yet it poses an inherently ill-posed inverse problem. Existing approaches often rely on strong structural assumptions derived from epidemiological models, which can limit their ability to adapt to non-stationary transmission dynamics induced by interventions or behavioral changes, leading to delayed detection of regime shifts and degraded estimation accuracy. In this work, we propose a Conditional Inverse Reproduction Learning framework (CIRL) that addresses the inverse problem by learning a {conditional mapping} from historical incidence patterns and explicit time information to latent reproduction numbers. Rather than imposing strongly enforced parametric constraints, CIRL softly integrates epidemiological structure with flexible likelihood-based statistical modeling, using the renewal equation as a forward operator to enforce dynamical consistency. The resulting framework combines epidemiologically grounded constraints with data-driven temporal representations, producing reproduction number estimates that are robust to observation noise while remaining responsive to abrupt transmission changes and zero-inflated incidence observations. Experiments on synthetic epidemics with controlled regime changes and real-world SARS and COVID-19 data demonstrate the effectiveness of the proposed approach.

关键词: time-varying reproduction numbers, inverse problem, epidemic incidence data, conditional inverse learning, renewal equation, SARS, COVID-19, regime shifts

264. ❌ CA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters

作者: Alexander Köhler, Michael Breuß 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于CAD几何设计参数的可解释性知识表示与分析，使用主成分分析（PCA）进行降维和参数估计。论文内容完全属于传统计算机辅助设计（CAD）和工程优化领域，未涉及任何大模型、深度学习、AI for Science或相关技术原理。所有关键词均与大模型、深度学习、AI应用或相关技术相关，而该论文研究的是传统工程方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何从基于主成分分析（PCA）的几何表示中准确估计CAD设计参数，分析了现有方法的局限性并提出了实现可解释参数估计的合理条件。

摘要翻译

在许多基于CAD的应用中，复杂几何结构由大量设计参数定义，这导致设计空间维度极高，对仿真、优化和设计探索等下游工程流程构成挑战。因此，常采用主成分分析（PCA）等降维方法。PCA能够识别几何变化的主导模式，并生成几何结构的紧凑表示。虽然经典PCA在紧凑表示方面表现优异，但无法直接还原生成几何的底层设计参数。本研究致力于解决从基于PCA的表示中估计设计参数的问题。通过分析近期针对本应用领域提出的PCA改进方法，我们证明其结果实际上与标准PCA相同。我们探讨了该方法的局限性，并提出了能够实现准确、可解释参数估计的合理条件。借助专项实验，我们对PCA各阶段及其过程中几何形态的可能变化进行了深入考察。

摘要 (Abstract)

In many CAD-based applications, complex geometries are defined by a high number of design parameters. This leads to high-dimensional design spaces that are challenging for downstream engineering processes like simulations, optimization, and design exploration tasks. Therefore, dimension reduction methods such as principal component analysis (PCA) are used. The PCA identifies dominant modes of geometric variation and yields a compact representation of the geometry. While classical PCA excels in the compact representation part, it does not directly recover underlying design parameters of a generated geometry. In this work, we deal with the problem of estimating design parameters from PCA-based representations. Analyzing a recent modification of the PCA dedicated to our field of application, we show that the results are actually identical to the standard PCA. We investigate limitations of this approach and present reasonable conditions under which accurate, interpretable parameter estimation can be obtained. With the help of dedicated experiments, we take a more in-depth look at every stage of the PCA and the possible changes of the geometry during these processes.

关键词: CAD, geometric design parameters, principal component analysis, dimension reduction, parameter estimation, interpretable representation, design space, engineering optimization

265. ❌ Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify

作者: Edoardo D’Amico, Marco De Nadai, Praveen Chandar, Divita Vohra, Shawn Lin, Max Lefarov, Paul Gigioli, Gustavo Penha, Ilya Kopysitsky, Ivo Joel Senese, Darren Mei, Francesco Fabbri, Oguz Semerci, Yu Zhao, Vincent Tang, Brian St. Thomas, Alexandra Ranieri, Matthew N. K. Smith, Aaron Bernkopf, Bryan Leung, Ghazal Fazelnia, Mark VanMiddlesworth, Timothy Christopher Heath, Petter Pehrson Skiden, Alice Y. Wang, Doug J. Cole, Andreas Damianou, Maya Hristakeva, Reid Wilbur, Tarun Chillara, Vladan Radosavljevic, Pooja Chitkara, Sainath Adapa, Juan Elenter, Bernd Huber, Jacqueline Wood, Saaketh Vedantam, Jan Stypka, Sandeep Ghael, Martin D. Gould, David Murgatroyd, Yves Raimond, Mounia Lalmas, Paul N. Bennett 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到使用LLMs进行语义推理和上下文条件化，并采用instruction-following任务框架，因此与’Large Language Models’和’Instruction Tuning’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG等未在摘要中提及或与论文内容无关，均给0分。论文属于大模型在推荐系统领域的应用，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用大型语言模型（LLMs）和指令跟随任务框架，在Spotify平台上构建一个生产级的生成式推荐系统GLIDE，以解决播客发现中兼顾用户稳定偏好和动态意图的挑战，实验表明该系统能显著提升非习惯性播客流媒体播放量和新节目发现率。

摘要翻译

播客收听行为通常基于一系列固定喜爱的节目，而听众的收听意图可能随时间变化。这种稳定偏好与动态意图的结合，催生了需要同时兼顾熟悉度与探索性的推荐方法。传统推荐系统通常侧重于长期交互模式，较少显式设计以融入丰富的上下文信号或灵活的、意图感知的发现目标。在此背景下，能够对语义、上下文和用户状态进行联合推理的模型展现出重要潜力。大语言模型（LLMs）为面向发现的推荐提供了强大的语义推理和上下文条件化能力，但将其部署于生产环境时，面临着目录锚定、用户级个性化以及低延迟服务等方面的挑战。

我们通过GLIDE应对这些挑战——这是Spotify面向播客发现的生产级生成式推荐系统。GLIDE利用语义标识符（Semantic IDs）将离散化目录中的推荐任务构建为指令跟随任务，从而实现对海量库存的锚定生成。该模型以近期收听历史和轻量级用户上下文为条件，同时将长期用户嵌入作为软提示注入，以在严格推理约束下捕捉稳定偏好。我们通过离线检索指标、人工评估和基于LLM的评估对GLIDE进行评测，并通过大规模在线A/B测试验证其实际影响。在涉及数百万用户的实验中，GLIDE将Spotify首页界面上非习惯性播客流媒体播放量提升了最高5.4%，新节目发现量提升了最高14.3%，同时满足了生产环境对成本和延迟的约束要求。

摘要 (Abstract)

Podcast listening is often grounded in a set of favorite shows, while listener intent can evolve over time. This combination of stable preferences and changing intent motivates recommendation approaches that support both familiarity and exploration. Traditional recommender systems typically emphasize long-term interaction patterns, and are less explicitly designed to incorporate rich contextual signals or flexible, intent-aware discovery objectives. In this setting, models that can jointly reason over semantics, context, and user state offer a promising direction. Large Language Models (LLMs) provide strong semantic reasoning and contextual conditioning for discovery-oriented recommendation, but deploying them in production introduces challenges in catalog grounding, user-level personalization, and latency-critical serving. We address these challenges with GLIDE, a production-scale generative recommender for podcast discovery at Spotify. GLIDE formulates recommendation as an instruction-following task over a discretized catalog using Semantic IDs, enabling grounded generation over a large inventory. The model conditions on recent listening history and lightweight user context, while injecting long-term user embeddings as soft prompts to capture stable preferences under strict inference constraints. We evaluate GLIDE using offline retrieval metrics, human judgments, and LLM-based evaluation, and validate its impact through large-scale online A/B testing. Across experiments involving millions of users, GLIDE increases non-habitual podcast streaming on Spotify home surface by up to 5.4% and new-show discovery by up to 14.3%, while meeting production cost and latency constraints.

关键词: Large Language Models, Generative Recommender, Semantic IDs, Instruction-following, Podcast Discovery, Personalization, A/B Testing, Production Deployment

266. ❌ A Unified Language Model for Large Scale Search, Recommendation, and Reasoning

作者: Marco De Nadai, Edoardo D’Amico, Max Lefarov, Alexandre Tamborrino, Divita Vohra, Mark VanMiddlesworth, Shawn Lin, Jacqueline Wood, Jan Stypka, Eliza Klyce, Keshi Dai, Timothy Christopher Heath, Martin D. Gould, Yves Raimond, Sandeep Ghael, Tony Jebara, Andreas Damianou, Vladan Radosavljevic, Paul N. Bennett, Mounia Lalmas, Praveen Chandar 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在推荐、搜索和推理任务中的统一应用，与’Large Language Models’和’Instruction Tuning’高度相关（10分），因为论文基于预训练LLM进行指令调优以实现任务控制。与’Pre-training’和’Post-training’有一定关联（5分），涉及预训练模型适应和微调。与’Chain of Thought’和’Tool Use’部分相关（5分），涉及推理和避免工具使用的设计。其他关键词如MoE、SLMs、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何将预训练大语言模型适应为统一的工具无关生成模型，以同时支持大规模推荐、搜索和推理任务，提出的NEO框架在真实世界任务中超越了特定任务基线并展示了跨任务迁移能力。

摘要翻译

大型语言模型正日益应用于推荐、检索与推理任务，然而部署一个能够在大规模异构目录上联合支持这些行为的单一端到端模型仍具挑战性。此类系统必须生成对真实项目的明确引用，处理多种实体类型，并在严格的延迟与可靠性约束下运行——这些要求仅凭纯文本生成难以满足。虽然工具增强的推荐系统解决了部分问题，但它们引入了编排复杂性并限制了端到端优化。我们将此场景视为一个更广泛研究问题的实例：如何使LLMs以完全自包含的方式，在多个领域实体、用户与语言上联合进行推理。为此，我们提出NEO框架，它将预训练的仅解码器LLM适配为一个无需工具、基于目录的生成器。NEO将项目表示为结构化标识符（SIDs），并训练单一模型在共享序列中交错生成自然语言与类型化项目标识符。文本提示控制任务、目标实体类型和输出格式（标识符、文本或混合形式），而约束解码则保证生成目录有效的项目，同时不限制自由文本的生成。我们将这种指令条件控制能力称为语言可操控性。我们将SIDs视为一种独立模态，并通过分阶段对齐与指令微调，研究了将离散实体表示整合进LLMs的设计方案。我们在一个包含多种媒体类型、超过1000万个项目的真实世界目录上，对NEO进行了大规模评估，涵盖推荐、搜索和用户理解等多种发现任务。离线实验表明，NEO持续优于强大的任务专用基线模型，并展现出跨任务迁移能力，为将大规模发现能力整合至单一语言可操控生成模型提供了一条可行路径。

摘要 (Abstract)

LLMs are increasingly applied to recommendation, retrieval, and reasoning, yet deploying a single end-to-end model that can jointly support these behaviors over large, heterogeneous catalogs remains challenging. Such systems must generate unambiguous references to real items, handle multiple entity types, and operate under strict latency and reliability constraints requirements that are difficult to satisfy with text-only generation. While tool-augmented recommender systems address parts of this problem, they introduce orchestration complexity and limit end-to-end optimization. We view this setting as an instance of a broader research problem: how to adapt LLMs to reason jointly over multiple-domain entities, users, and language in a fully self-contained manner. To this end, we introduce NEO, a framework that adapts a pre-trained decoder-only LLM into a tool-free, catalog-grounded generator. NEO represents items as SIDs and trains a single model to interleave natural language and typed item identifiers within a shared sequence. Text prompts control the task, target entity type, and output format (IDs, text, or mixed), while constrained decoding guarantees catalog-valid item generation without restricting free-form text. We refer to this instruction-conditioned controllability as language-steerability. We treat SIDs as a distinct modality and study design choices for integrating discrete entity representations into LLMs via staged alignment and instruction tuning. We evaluate NEO at scale on a real-world catalog of over 10M items across multiple media types and discovery tasks, including recommendation, search, and user understanding. In offline experiments, NEO consistently outperforms strong task-specific baselines and exhibits cross-task transfer, demonstrating a practical path toward consolidating large-scale discovery capabilities into a single language-steerable generative model.

关键词: Large Language Models, Unified Model, Recommendation, Search, Reasoning, Instruction Tuning, Catalog-grounded Generation, Cross-task Transfer

267. ❌ Mirror Descent on Riemannian Manifolds

作者: Jiaxin Jiang, Lei Shi, Jiyuan Tan 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究黎曼流形上的镜像下降优化方法，属于数学优化理论领域，与所有评分关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未提及任何大模型、深度学习、AI应用或相关技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文将镜像下降方法推广到黎曼流形优化问题，提出了黎曼镜像下降框架及其随机变体，并建立了非渐近收敛保证，应用于Stiefel流形时简化为曲率梯度下降方法。

摘要翻译

镜像下降法是一种可扩展的一阶优化方法，广泛应用于大规模优化问题中，在图像处理、策略优化和神经网络训练等领域均有应用。本文将镜像下降法推广至黎曼流形上的优化问题。具体而言，我们通过重参数化技术发展了黎曼镜像下降法框架，并进一步提出了其随机变体。同时，我们为黎曼镜像下降法及其随机版本建立了非渐近收敛性保证。作为在Stiefel流形上的应用，我们的黎曼镜像下降法框架可简化为文献[26]提出的曲率梯度下降法。此外，将随机黎曼镜像下降法框架特化到Stiefel流形场景时，我们得到了曲率梯度下降法的随机扩展版本，该方法能有效处理大规模流形优化问题。

摘要 (Abstract)

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

关键词: Mirror Descent, Riemannian manifolds, optimization, stochastic optimization, convergence guarantees, Stiefel manifold, Curvilinear Gradient Descent

268. ❌ Anisotropic Permeability Tensor Prediction from Porous Media Microstructure via Physics-Informed Progressive Transfer Learning with Hybrid CNN-Transformer

作者: Mohammad Nooraiepour 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用深度学习（特别是混合CNN-Transformer架构）预测多孔介质的渗透率张量，属于科学AI应用领域。与大多数关键词（如LLMs、MoE、RLHF等）无关，因为这些关键词主要涉及大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（评分为10分），因为论文明确应用AI解决科学问题（地下流动建模），以及’Pre-training OR Continual Pre-training OR Domain Adaptation’（评分为5分），因为论文使用了ImageNet预训练和渐进式迁移学习，但未涉及大语言模型的预训练或领域适应。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理信息的深度学习框架，结合MaxViT混合CNN-Transformer架构和渐进式迁移学习，从多孔介质微观结构图像中准确预测渗透率张量，在测试集上实现了方差加权R² = 0.9960的性能。

摘要翻译

基于孔隙尺度微观结构图像精确预测渗透率张量是地下渗流建模的关键，但直接数值模拟每个样本需耗时数小时，从根本上制约了大规模不确定性量化与储层优化工作流程。本文提出一种物理信息深度学习框架，通过将MaxViT混合CNN-Transformer架构与渐进式迁移学习及可微分物理约束相结合，解决了这一瓶颈问题。MaxViT的多轴注意力机制通过块局部操作解析颗粒级孔喉几何形态，并通过网格全局操作获取表征体元尺度的连通性统计特征，提供了渗透率张量预测所需的物理空间层次结构。基于涵盖三个数量级渗透率的20000个合成多孔介质样本进行训练，采用三阶段渐进式学习策略：从经过D4等变性增强与张量变换的ImageNet预训练基线出发，通过优先考虑非对角耦合项的组分加权损失函数，最终推进至采用特征线性调制（FiLM）进行孔隙度条件约束的冻结主干网络迁移学习。昂萨格互易性与正定性通过可微分惩罚项强制实施。在包含4000个样本的独立测试集上，该框架实现了方差加权R2 = 0.9960（R2_Kxx = 0.9967，R2_Kxy = 0.9758），相较于监督基线模型未解释方差降低33%。研究结果为物理信息科学机器学习提供了三项可迁移原则：大规模视觉预训练能够有效跨越领域边界迁移；物理约束作为可微分架构组件集成时最具鲁棒性；基于诊断性失效模式分析的渐进式训练，能够明确归因各方法阶段对性能提升的贡献。

摘要 (Abstract)

Accurate prediction of permeability tensors from pore-scale microstructure images is essential for subsurface flow modeling, yet direct numerical simulation requires hours per sample, fundamentally limiting large-scale uncertainty quantification and reservoir optimization workflows. A physics-informed deep learning framework is presented that resolves this bottleneck by combining a MaxViT hybrid CNN-Transformer architecture with progressive transfer learning and differentiable physical constraints. MaxViT’s multi-axis attention mechanism simultaneously resolves grain-scale pore-throat geometry via block-local operations and REV-scale connectivity statistics through grid-global operations, providing the spatial hierarchy that permeability tensor prediction physically requires. Training on 20000 synthetic porous media samples spanning three orders of magnitude in permeability, a three-phase progressive curriculum advances from an ImageNet-pretrained baseline with D4-equivariant augmentation and tensor transformation, through component-weighted loss prioritizing off-diagonal coupling, to frozen-backbone transfer learning with porosity conditioning via Feature-wise Linear Modulation (FiLM). Onsager reciprocity and positive definiteness are enforced via differentiable penalty terms. On a held-out test set of 4000 samples, the framework achieves variance-weighted R2 = 0.9960 (R2_Kxx = 0.9967, R2_Kxy = 0.9758), a 33% reduction in unexplained variance over the supervised baseline. The results offer three transferable principles for physics-informed scientific machine learning: large-scale visual pretraining transfers effectively across domain boundaries; physical constraints are most robustly integrated as differentiable architectural components; and progressive training guided by diagnostic failure-mode analysis enables unambiguous attribution of performance gains across methodological stages.

关键词: permeability tensor prediction, porous media microstructure, physics-informed deep learning, CNN-Transformer architecture, progressive transfer learning, MaxViT, differentiable physical constraints, subsurface flow modeling

269. ❌ Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model

作者: Luca Pellegrini 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究神经算子（Neural Operators）在FitzHugh-Nagumo偏微分方程模型中的应用，属于深度学习在科学计算领域的应用。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及深度学习在生物物理模型（描述可兴奋细胞）中的应用，属于AI for Science范畴，但并非核心匹配（论文未明确提及生物信息学或化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究评估了七种神经算子架构在捕捉FitzHugh-Nagumo模型（描述可兴奋细胞的偏微分方程）的刚性时空动力学方面的能力，特别关注平移不变性，发现卷积神经算子在平移测试中表现良好但训练成本高，而傅里叶神经算子训练误差最低但推理时间最长且泛化较差。

摘要翻译

神经算子是一种强大的深度学习框架，旨在学习由偏微分方程导出的解算子。本研究探讨了神经算子捕捉FitzHugh-Nagumo模型（一种描述可兴奋细胞的模型）中刚性时空动力学的能力。本工作的一个关键贡献在于通过一种新颖的训练策略来评估其平移不变性。我们使用在固定时间施加不同空间位置和强度的电流对神经算子进行训练，并在测试集中引入更具挑战性的分布外场景，其中施加的电流在时间和空间上均发生平移。该方法显著降低了数据集生成的计算成本。此外，我们对七种神经算子架构进行了基准测试：卷积神经算子、深度算子网络、带CNN编码器的深度算子网络、本征正交分解深度算子网络、傅里叶神经算子、Tucker张量化傅里叶神经算子以及局部化神经算子。我们根据训练与测试精度、效率以及推理速度对这些模型进行了评估。我们的结果表明，卷积神经算子在平移后的测试动力学上表现良好，但其训练成本较高，尽管其在训练集上的性能与其他考虑的架构相似。相比之下，傅里叶神经算子实现了最低的训练误差，但推理时间最长。对于平移后的动力学，傅里叶神经算子及其变体的预测精度较低。最后，深度算子网络及其变体在训练和推理方面均表现出高效率，但其对测试集的泛化能力不佳。这些发现凸显了神经算子在捕捉复杂离子模型动力学方面的当前能力与局限，并提供了一个全面的基准，包括其在涉及平移动力学的场景中的应用。

摘要 (Abstract)

Neural Operators (NOs) are a powerful deep learning framework designed to learn the solution operator that arise from partial differential equations. This study investigates NOs ability to capture the stiff spatio-temporal dynamics of the FitzHugh-Nagumo model, which describes excitable cells. A key contribution of this work is evaluating the translation invariance using a novel training strategy. NOs are trained using an applied current with varying spatial locations and intensities at a fixed time, and the test set introduces a more challenging out-of-distribution scenario in which the applied current is translated in both time and space. This approach significantly reduces the computational cost of dataset generation. Moreover we benchmark seven NOs architectures: Convolutional Neural Operators (CNOs), Deep Operator Networks (DONs), DONs with CNN encoder (DONs-CNN), Proper Orthogonal Decomposition DONs (POD-DONs), Fourier Neural Operators (FNOs), Tucker Tensorized FNOs (TFNOs), Localized Neural Operators (LocalNOs). We evaluated these models based on training and test accuracy, efficiency, and inference speed. Our results reveal that CNOs performs well on translated test dynamics. However, they require higher training costs, though their performance on the training set is similar to that of the other considered architectures. In contrast, FNOs achieve the lowest training error, but have the highest inference time. Regarding the translated dynamics, FNOs and their variants provide less accurate predictions. Finally, DONs and their variants demonstrate high efficiency in both training and inference, however they do not generalize well to the test set. These findings highlight the current capabilities and limitations of NOs in capturing complex ionic model dynamics and provide a comprehensive benchmark including their application to scenarios involving translated dynamics.

关键词: Neural Operators, FitzHugh-Nagumo model, translation invariance, partial differential equations, deep learning, benchmark, spatio-temporal dynamics, excitable cells

270. ❌ Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

作者: Hao Ma, Zhiqiang Pu, Xiaolin Ai, Huimu Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为GuidedSAC的新型强化学习算法，该算法利用大型语言模型（LLMs）作为智能监督器，为Soft Actor-Critic（SAC）算法提供动作级指导，以促进在广阔状态-动作空间中的高效探索。论文的核心创新在于将LLMs应用于强化学习中的探索问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。然而，论文并未涉及其他关键词所描述的具体技术（如MoE、量化、RAG、对齐等）或特定应用领域（如生物信息学），因此这些关键词的得分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GuidedSAC的新型强化学习算法，它利用大型语言模型为Soft Actor-Critic提供动作级指导，从而在离散和连续控制环境中实现了比标准SAC及其他先进探索增强变体更高的样本效率和最终性能。

摘要翻译

本文提出GuidedSAC，一种新颖的强化学习（Reinforcement Learning, RL）算法，旨在促进广阔状态-动作空间中的高效探索。GuidedSAC利用大语言模型（Large Language Models, LLMs）作为智能监督器，为软演员-评论家（Soft Actor-Critic, SAC）算法提供动作层面的指导。基于LLM的监督器通过状态信息和视觉回放分析最近的轨迹轨迹，提供动作层面的干预，从而实现有目标的探索。此外，我们对GuidedSAC进行了理论分析，证明其在保持SAC收敛性保证的同时，提高了收敛速度。通过在离散和连续控制环境（包括玩具文本任务和复杂的MuJoCo基准测试）中的实验，我们证明GuidedSAC在样本效率和最终性能方面均持续优于标准SAC以及最先进的探索增强变体（例如RND、ICM和E3B）。

摘要 (Abstract)

We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for the Soft Actor-Critic (SAC) algorithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-level interventions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSAC, proving that it preserves the convergence guarantees of SAC while improving convergence speed. Through experiments in both discrete and continuous control environments, including toy text tasks and complex MuJoCo benchmarks, we demonstrate that GuidedSAC consistently outperforms standard SAC and state-of-the-art exploration-enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficiency and final performance.

关键词: GuidedSAC, Reinforcement Learning, Large Language Models, Soft Actor-Critic, Exploration, Action-Level Guidance, Sample Efficiency, Continuous Control

271. ❌ ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

作者: Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiaowen Chu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理加速和内存优化，与’Large Language Models’高度相关（10分），直接涉及’Quantization/Model Compression’（10分）和’Speculative Decoding/Inference Acceleration’（10分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

论文提出ZipServ框架，通过硬件感知无损压缩技术解决LLM推理中的内存和带宽瓶颈，在减少模型大小30%的同时实现最高2.21倍加速。

摘要翻译

无损模型压缩在缓解精确比特级大语言模型（LLM）服务中的内存与带宽瓶颈方面具有巨大潜力。然而，现有方法常因与GPU架构存在根本设计不匹配而导致显著的推理减速：在核心层面，传统熵编码器产生的变长比特流破坏了SIMT并行性；在系统层面，解耦的流水线导致了冗余的内存传输。本文提出ZipServ，一个专为高效LLM推理协同设计的无损压缩框架。ZipServ引入了张量核心感知三重位图编码（Tensor-Core-Aware Triple Bitmap Encoding, TCA-TBE），这是一种新颖的定长格式，支持恒定时间并行解码，并结合了融合解压缩-通用矩阵乘法（ZipGEMM）内核，可将权重即时解压至张量核心寄存器。这种“加载压缩数据、计算时解压”的设计消除了中间缓冲区，并最大化计算强度。实验表明，ZipServ将模型大小减少高达30%，在核心层面相比NVIDIA cuBLAS实现最高2.21倍的加速，并在端到端推理中平均比vLLM提速1.22倍。ZipServ是首个在GPU上为LLM推理同时提供存储节省与显著加速的无损压缩系统。

摘要 (Abstract)

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This “load-compressed, compute-decompressed” design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA’s cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

关键词: LLM inference, lossless compression, GPU acceleration, Tensor Core, memory efficiency, model compression, inference optimization, hardware-aware design

272. ❌ Data-driven model order reduction for structures with piecewise linear nonlinearity using dynamic mode decomposition

作者: Akira Saito, Masato Tanaka 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是工程结构动力学中的模型降阶方法，使用动态模态分解（DMD）处理分段线性非线性系统。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文完全不涉及这些主题，属于传统工程计算领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于动态模态分解的数据驱动模型降阶方法，用于预测分段线性非线性结构系统的动态响应，并通过两个工程实例验证了该方法的准确性。

摘要翻译

分段线性非线性系统广泛存在于诸多工程学科中。从实践与理论角度而言，对此类系统动态行为的预测具有重要意义。本文提出一种基于动态模态分解（Dynamic Mode Decomposition, DMD）的数据驱动模型降阶方法，适用于分段线性系统。文中概述了DMD的基本概念，并阐述了其在基于伽辽金投影的非线性系统模型降阶中的应用。所提方法利用系统的脉冲响应来获取状态变量的快照数据，随后通过这些快照提取动态模态，用以构建投影基向量。将原始全阶系统运动方程所描述的动力学特性投影到由基向量张成的子空间上，从而得到一个自由度数量大幅缩减的系统。该方法应用于两个典型的分段线性系统示例：端部受弹性阻挡的悬臂梁，以及存在局部脱粘的粘接板组件。通过单独使用DMD模态，或结合DMD模态与一组经典约束模态以有效处理接触非线性，对运动方程进行伽辽金投影，构建了这些系统的降阶模型。所获得的降阶模型被用于系统在谐波载荷下的非线性受迫响应分析。结果表明，采用本方法构建的降阶模型能够产生精确的受迫响应结果。

摘要 (Abstract)

Piecewise-linear nonlinear systems appear in many engineering disciplines. Prediction of the dynamic behavior of such systems is of great importance from practical and theoretical viewpoint. In this paper, a data-driven model order reduction method for piecewise-linear systems is proposed, which is based on dynamic mode decomposition (DMD). The overview of the concept of DMD is provided, and its application to model order reduction for nonlinear systems based on Galerkin projection is explained. The proposed approach uses impulse responses of the system to obtain snapshots of the state variables. The snapshots are then used to extract the dynamic modes that are used to form the projection basis vectors. The dynamics described by the equations of motion of the original full-order system are then projected onto the subspace spanned by the basis vectors. This produces a system with much smaller number of degrees of freedom (DOFs). The proposed method is applied to two representative examples of piecewise linear systems: a cantilevered beam subjected to an elastic stop at its end, and a bonded plates assembly with partial debonding. The reduced order models (ROMs) of these systems are constructed by using the Galerkin projection of the equation of motion with DMD modes alone, or DMD modes with a set of classical constraint modes to be able to handle the contact nonlinearity efficiently. The obtained ROMs are used for the nonlinear forced response analysis of the systems under harmonic loading. It is shown that the ROMs constructed by the proposed method produce accurate forced response results.

关键词: model order reduction, piecewise-linear nonlinearity, dynamic mode decomposition, Galerkin projection, reduced order models, forced response analysis, structural dynamics, contact nonlinearity

273. ❌ Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

作者: Alireza Sadeghi, Wael AbdAlmageed 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于因果表示学习（CRL）领域，研究高维数据的因果表示学习模型、基准数据集、评估指标和可重复性问题。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，没有涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用。论文属于机器学习中的因果推理子领域，而非大模型研究。

!!! tip deepseek-chat TL;DR

该论文分析了因果表示学习领域的现有数据集和评估方法的局限性，提出了数据集应具备的关键特性，并引入了一个综合多个评估方向的聚合指标来全面评估模型性能，同时评估了现有实现的可重复性。

摘要翻译

因果表征学习（CRL）模型旨在将高维数据转换到潜在空间，从而能够基于潜在变量间的因果关系实施干预以生成反事实样本或修改现有数据。为促进此类模型的开发与评估，学界已提出多种合成与真实世界数据集，它们各具优势与局限。在实际应用中，CRL模型必须在多个评估方向上均表现出稳健性能，包括重构、解耦、因果发现与反事实推理，且每个方向需采用恰当的度量指标。然而，这种多方向评估可能使模型比较复杂化，因为模型可能在某个方向上表现优异而在其他方向上欠佳。该领域的另一重大挑战是可复现性：已发表结果对应的源代码必须公开，且重复实验应获得与原始报告一致的性能。本研究批判性地分析了当前文献中使用的合成与真实世界数据集，指出其局限性，并为适用于CRL模型开发的数据集提出了一组必要特征。我们还引入了一个单一聚合指标，该指标整合了所有评估方向的性能，为每个模型提供综合评分。最后，我们回顾了文献中的现有实现方案，并从可复现性角度对其进行了评估，从而指出了该领域的不足与最佳实践。

摘要 (Abstract)

Causal representation learning (CRL) models aim to transform high-dimensional data into a latent space, enabling interventions to generate counterfactual samples or modify existing data based on the causal relationships among latent variables. To facilitate the development and evaluation of these models, a variety of synthetic and real-world datasets have been proposed, each with distinct advantages and limitations. For practical applications, CRL models must perform robustly across multiple evaluation directions, including reconstruction, disentanglement, causal discovery, and counterfactual reasoning, using appropriate metrics for each direction. However, this multi-directional evaluation can complicate model comparison, as a model may excel in some direction while under-performing in others. Another significant challenge in this field is reproducibility: the source code corresponding to published results must be publicly available, and repeated runs should yield performance consistent with the original reports. In this study, we critically analyzed the synthetic and real-world datasets currently employed in the literature, highlighting their limitations and proposing a set of essential characteristics for suitable datasets in CRL model development. We also introduce a single aggregate metric that consolidates performance across all evaluation directions, providing a comprehensive score for each model. Finally, we reviewed existing implementations from the literature and assessed them in terms of reproducibility, identifying gaps and best practices in the field.

关键词: Causal representation learning, High-dimensional data, Benchmarks, Reproducibility, Evaluation metrics, Counterfactual reasoning, Latent variables, Model comparison

274. ❌ Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching

作者: Yaozhong Shi, Grigorios Lavrentiadis, Konstantinos Tsalouchidis, Zachary E. Ross, David McCallen, Caifeng Zou, Kamyar Azizzadenesheli, Domniki Asimaki 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究地震地面运动合成的物理启发式生成模型（GMFlow），属于AI在科学领域的应用（地震工程），与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型、深度学习技术原理创新或其他具体技术关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对地震灾害分析中基于物理模拟生成大规模地面运动时程计算成本高的问题，提出了一个物理启发的潜在算子流匹配框架（GMFlow），能够在秒级生成覆盖900多万网格点的空间相干地面运动，实现比模拟工作流快10000倍的速度，为分布式基础设施的快速不确定性感知灾害评估开辟了道路。

摘要翻译

针对电网与能源管网等空间分布式基础设施的地震危险性分析及设计，需要具备真实频率特征与时空相干性的特定场景地震动时程。然而，基于物理模拟生成不确定性量化所需的大规模时程集合计算成本高昂，难以适用于工程实践流程。为应对这一挑战，我们提出了地震动流（GMFlow）——一种受物理学启发的隐式算子流匹配框架，该框架能够依据物理参数生成真实、大规模区域性的地震动时程。通过在旧金山湾区模拟地震场景中的验证，GMFlow可在数秒内生成覆盖超过900万个网格点的时空相干地震动，相比模拟工作流程实现了10,000倍的加速，为分布式基础设施的快速且考虑不确定性的危险性评估开辟了新路径。更广泛而言，GMFlow推动了与网格无关的泛函生成建模的发展，并有望扩展至不同科学领域的大规模时空物理场合成。

摘要 (Abstract)

Earthquake hazard analysis and design of spatially distributed infrastructure, such as power grids and energy pipeline networks, require scenario-specific ground-motion time histories with realistic frequency content and spatiotemporal coherence. However, producing the large ensembles needed for uncertainty quantification with physics-based simulations is computationally intensive and impractical for engineering workflows. To address this challenge, we introduce Ground-Motion Flow (GMFlow), a physics-inspired latent operator flow matching framework that generates realistic, large-scale regional ground-motion time-histories conditioned on physical parameters. Validated on simulated earthquake scenarios in the San Francisco Bay Area, GMFlow generates spatially coherent ground motion across more than 9 million grid points in seconds, achieving a 10,000-fold speedup over the simulation workflow, which opens a path toward rapid and uncertainty-aware hazard assessment for distributed infrastructure. More broadly, GMFlow advances mesh-agnostic functional generative modeling and could potentially be extended to the synthesis of large-scale spatiotemporal physical fields in diverse scientific domains.

关键词: ground-motion synthesis, physics-inspired generative modeling, latent operator flow matching, earthquake hazard analysis, spatiotemporal coherence, uncertainty quantification, mesh-agnostic functional generation, large-scale physical fields

275. ❌ Bootstrapping Coding Agents: The Specification Is the Program

作者: Martin Monperrus 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI编码代理的自举能力，核心涉及LLM代理（LLM Agents）和自我改进（Self-Correction/Self-Improvement）概念，与这两个关键词高度相关（10分）。论文使用Claude Code（基于LLM）作为基础代理，因此与LLM/Foundation Models有一定关联（8分）。其他关键词如MoE、SFT、RAG、推理加速等均未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了AI编码代理如何通过规范（specification）实现自举（bootstrap），即从现有代理生成的初始实现出发，新代理能够根据同一规范从头正确重新实现，证明了规范而非实现是稳定记录的核心。

摘要翻译

编码智能体能够实现自我引导。从一份926字的规范说明和由现有智能体（Claude Code）生成的初始实现出发，一个新生成的智能体能够从零开始正确地对同一规范进行重新实现。这在人工智能编码智能体领域复现了编译器构造中经典的引导序列，并实例化了Lisp语言中已知的元循环特性。这一结果具有实际意义：规范说明而非具体实现，才是记录中稳定的产物。改进一个智能体意味着改进其规范说明；而具体实现原则上可在任何时候重新生成。

摘要 (Abstract)

A coding agent can bootstrap itself. Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch. This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp. The result carries a practical implication: the specification, not the implementation, is the stable artifact of record. Improving an agent means improving its specification; the implementation is, in principle, regenerable at any time.

关键词: coding agents, bootstrapping, specification, AI coding, self-improvement, Claude Code, meta-circular, regenerable implementation

276. ❌ Rapid Neural Network Prediction of Linear Block Copolymer Free Energies

作者: Ian Chen, Alfredo Alexander-Katz 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17391v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究使用前馈神经网络预测线性二嵌段共聚物系统的自由能，属于AI在科学领域的应用（具体为聚合物科学），因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文未涉及大模型、深度学习技术原理创新或其他关键词相关的大模型技术，其余关键词均完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文开发了一个机器学习框架，使用前馈神经网络从模拟衍生的能量描述符快速预测线性二嵌段共聚物系统的自由能，有效替代了计算成本高昂的传统自由能计算方法。

摘要翻译

自由能是决定聚合物体系相行为与热力学稳定性的基本物理量，但其精确计算通常需要大量模拟以及诸如贝内特接受比（Bennett Acceptance Ratio, BAR）等后处理技术。虽然BAR在热力学状态相近时能提供可靠估计，但在相互作用强度变化较大的情况下评估自由能，通常需要一系列中间模拟以维持足够的相空间重叠，这显著增加了计算成本。本研究开发了一种机器学习框架，用于依据模拟得出的能量描述符快速预测线性二嵌段共聚物体系的过量自由能。通过对自由连接链聚合物进行耗散粒子动力学模拟，我们构建了一个包含单链能量统计量的数据集，其中包括异质相互作用能、同质相互作用能以及键合弹簧能，并训练前馈神经网络学习这些描述符与通过分层BAR程序计算所得自由能之间的关系。所得模型能够准确复现一系列链长、组成和密度下的参考自由能，包括训练集未包含的聚合物构型。在直接使用暴力BAR估计因相空间重叠不足而不可靠的区域，神经网络的预测结果仍与参考值保持一致。这些结果表明，基于物理信息的机器学习模型可作为昂贵自由能计算的高效替代方案，并为加速聚合物体系的热力学分析提供了一种前景广阔的方法。

摘要 (Abstract)

Free energies are fundamental quantities governing phase behavior and thermodynamic stability in polymer systems, yet their accurate computation often requires extensive simulations and post-processing techniques such as the Bennett Acceptance Ratio (BAR). While BAR provides reliable estimates when applied between closely related thermodynamic states, evaluating free energies across large changes in interaction strength typically requires a sequence of intermediate simulations to maintain sufficient phase-space overlap, substantially increasing computational cost. In this work we develop a machine learning framework for rapidly predicting excess free energies of linear diblock copolymer systems from simulation-derived energetic descriptors. Using dissipative particle dynamics simulations of freely-jointed chain polymers, we construct a dataset of per-chain energetic statistics, including heterogeneous interaction energies, homogeneous interaction energies, and bonded spring energies, and train feed-forward neural networks to learn the relationship between these descriptors and free energies computed using a stratified BAR procedure. The resulting models accurately reproduce the reference free energies across a range of chain lengths, compositions, and densities, including polymer architectures held out from training. In regimes where direct, brute-force BAR estimates become unreliable due to poor phase-space overlap, the neural network predictions remain consistent with the reference values. These results demonstrate that physically informed machine learning models can serve as efficient surrogates for expensive free-energy calculations and provide a promising approach for accelerating thermodynamic analysis of polymer systems.

关键词: free energy prediction, neural networks, linear diblock copolymers, dissipative particle dynamics, Bennett Acceptance Ratio, thermodynamic analysis, polymer systems, machine learning framework

277. ❌ The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

作者: Rui Wu, Hong Xie, Yongjun Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于因果推断的数学基础（流形几何、拓扑学）和算法开发（Geometry-Aware Causal Flow），并应用于单细胞RNA测序数据。所有关键词均与大模型/深度学习技术原理或应用无关，仅“AI for Science OR Bioinformatics OR Cheminformatics”因论文在生物信息学（scRNA-seq）中的应用而获得5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了连续生成模型中因果干预的几何极限，提出了Counterfactual Event Horizon和Manifold Tearing Theorem，并开发了Geometry-Aware Causal Flow算法，在单细胞RNA测序数据上进行了验证。

摘要翻译

Judea Pearl的do-演算为因果推断奠定了基础，但其向连续生成模型的转化仍面临诸多几何挑战。本文界定了此类干预的基本极限。我们定义了反事实事件视界，并证明了流形撕裂定理：确定性流在极端干预下必然产生有限时间奇点。我们建立了因果不确定性原理，以描述干预强度与身份保持之间的权衡关系。最后，我们提出了几何感知因果流——一种可扩展的算法，该算法利用拓扑雷达规避流形撕裂现象，并在高维单细胞RNA测序数据上得到验证。

摘要 (Abstract)

Judea Pearl’s do-calculus provides a foundation for causal inference, but its translation to continuous generative models remains fraught with geometric challenges. We establish the fundamental limits of such interventions. We define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem: deterministic flows inevitably develop finite-time singularities under extreme interventions. We establish the Causal Uncertainty Principle for the trade-off between intervention extremity and identity preservation. Finally, we introduce Geometry-Aware Causal Flow (GACF), a scalable algorithm that utilizes a topological radar to bypass manifold tearing, validated on high-dimensional scRNA-seq data.

关键词: causal inference, manifold tearing, counterfactual interventions, topological limits, Geometry-Aware Causal Flow, scRNA-seq, generative models, do-calculus

278. ❌ Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

作者: Rui Wu, Hong Xie, Yongjun Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于代数拓扑、因果建模和生成模型的理论基础，与绝大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在scRNA-seq（单细胞RNA测序）生物信息学数据上进行了实证验证，属于AI在科学领域的应用，但并非核心焦点。

!!! tip deepseek-chat TL;DR

该论文证明了当因果图具有非平凡同调时，局部一致的因果机制无法产生全局一致的因果反事实，并提出了基于Wasserstein空间上细胞层（cellular sheaves）的代数拓扑框架来形式化这一障碍，最终开发了一种拓扑感知的因果发现方法。

摘要翻译

当前连续生成模型（如扩散模型、流匹配）隐含假设局部一致的因果机制自然能产生全局连贯的反事实。本文证明，当因果图呈现非平凡同调（如结构冲突或隐藏混杂因子）时，该假设从根本上失效。我们将结构因果模型形式化为Wasserstein空间上的层胞腔，为测度空间中的上同调障碍提供了严格的代数拓扑定义。为确保计算可处理性并避免确定性奇点（我们将其定义为流形撕裂），我们引入熵正则化并推导出熵正则化Wasserstein因果层拉普拉斯算子——一种新型耦合非线性福克-普朗克方程组。关键性地，我们证明了前推测度一阶变分的熵拉回引理。通过将其与Sinkhorn最优性条件的隐函数定理相结合，我们建立了与自动微分（向量-雅可比积）的直接算法桥梁，实现了严格独立于迭代步长的O(1)内存反向模式梯度计算。实证中，我们的框架成功利用热力学噪声穿越高维单细胞RNA测序反事实中的拓扑障碍（“熵隧穿效应”）。最后，我们反转该理论框架，提出拓扑因果评分，证明我们的层拉普拉斯算子可作为拓扑感知因果发现的高灵敏度代数检测器。

摘要 (Abstract)

Current continuous generative models (e.g., Diffusion Models, Flow Matching) implicitly assume that locally consistent causal mechanisms naturally yield globally coherent counterfactuals. In this paper, we prove that this assumption fails fundamentally when the causal graph exhibits non-trivial homology (e.g., structural conflicts or hidden confounders). We formalize structural causal models as cellular sheaves over Wasserstein spaces, providing a strict algebraic topological definition of cohomological obstructions in measure spaces. To ensure computational tractability and avoid deterministic singularities (which we define as manifold tearing), we introduce entropic regularization and derive the Entropic Wasserstein Causal Sheaf Laplacian, a novel system of coupled non-linear Fokker-Planck equations. Crucially, we prove an entropic pullback lemma for the first variation of pushforward measures. By integrating this with the Implicit Function Theorem (IFT) on Sinkhorn optimality conditions, we establish a direct algorithmic bridge to automatic differentiation (VJP), achieving O(1)-memory reverse-mode gradients strictly independent of the iteration horizon. Empirically, our framework successfully leverages thermodynamic noise to navigate topological barriers (“entropic tunneling”) in high-dimensional scRNA-seq counterfactuals. Finally, we invert this theoretical framework to introduce the Topological Causal Score, demonstrating that our Sheaf Laplacian acts as a highly sensitive algebraic detector for topology-aware causal discovery.

关键词: Causal Models, Counterfactuals, Sheaf Theory, Wasserstein Spaces, Cohomological Obstructions, Entropic Regularization, Topological Causal Score, scRNA-seq

279. ❌ Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

作者: Ziran Liu 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17365v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度神经网络中的内部噪声设计（Gaussian Chaos Noise），属于深度学习技术原理的底层创新，但所有评分关键词均聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用框架等）。论文内容不涉及LLM、MoE、SLM、Scaling Laws、各种训练调优方法（Pre-training、SFT、RLHF等）、推理加速技术、Agent系统、模型压缩、幻觉缓解、可解释性、世界模型、模型融合、上下文学习或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于深度神经网络内部噪声设计的变分核设计框架，通过高斯混沌噪声改善了模型校准性和在分布偏移下的负对数似然性能。

摘要翻译

深度网络中的内部噪声通常源于启发式方法，如丢弃法、硬掩码或加性扰动。我们提出两个问题：内部噪声应具有何种相关几何结构？所实现的扰动是否与其作用的表征兼容？我们通过变分核设计框架来回答这些问题，该框架中噪声机制由分布族、相关核和注入算子定义，并源自学习目标。在一个已求解的空间子族中，基于隐式对数场的二次最大熵原理导出了以狄利克雷拉普拉斯算子的精度矩阵为参数的高斯优化器，因此诱导的几何结构为狄利克雷格林核。维克归一化随后产生一个规范的正均值单门控——高斯混沌噪声。针对实践中使用的样本级门控，我们严格证明了其对数比成对变形的高斯控制性、对边际敏感的排序稳定性，以及精确的期望内在粗糙度预算；而硬二值掩码则会在正相干表征上引发奇异或相干放大的畸变。在ImageNet和ImageNet-C数据集上，高斯混沌噪声在保持竞争力的准确率的同时，持续改善了校准性能，并在分布偏移下进一步提升了负对数似然指标。

摘要 (Abstract)

Internal noise in deep networks is usually inherited from heuristics such as dropout, hard masking, or additive perturbation. We ask two questions: what correlation geometry should internal noise have, and is the implemented perturbation compatible with the representations it acts on? We answer these questions through Variational Kernel Design (VKD), a framework in which a noise mechanism is specified by a law family, a correlation kernel, and an injection operator, and is derived from learning desiderata. In a solved spatial subfamily, a quadratic maximum-entropy principle over latent log-fields yields a Gaussian optimizer with precision given by the Dirichlet Laplacian, so the induced geometry is the Dirichlet Green kernel. Wick normalization then gives a canonical positive mean-one gate, Gaussian Chaos Noise (GCh). For the sample-wise gate used in practice, we prove exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stability, and an exact expected intrinsic roughness budget; hard binary masks instead induce singular or coherence-amplified distortions on positive coherent representations. On ImageNet and ImageNet-C, GCh consistently improves calibration and under shift also improves NLL at competitive accuracy.

关键词: Variational Kernel Design, Internal Noise, Gaussian Chaos Noise, Deep Learning, Calibration, ImageNet, Representation Compatibility, Dirichlet Laplacian

280. ❌ WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation

作者: Zahin Sufiyan, Shadan Golestan, Yoshihiro Mitsuka, Shotaro Miwa, Osmar Zaiane 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Generative Flow Networks在机器人控制任务中的应用，属于AI在科学/工程领域的应用，与’AI for Science’有一定关联（5分）。论文提到传统方法依赖预训练检索网络，而新方法通过warm-up和co-training减少对预训练的依赖，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。论文未涉及大语言模型、MoE、推理、对齐、压缩等关键词，其他关键词均为0分。

!!! tip deepseek-chat TL;DR

论文提出WINFlowNets框架，通过共同训练流网络和检索网络，解决了传统Generative Flow Networks在机器人控制中依赖预训练的问题，在模拟环境中实现了更高的奖励和训练稳定性，并展示了在故障环境中的强适应能力。

摘要翻译

面向连续场景的生成流网络（CFlowNets）通过利用流网络和检索网络学习随机策略，在解决序列决策任务中展现出潜力。尽管与先进的强化学习（RL）算法相比，其已证明具有高效性，但在机器人控制任务中的实际应用受到对检索网络预训练的依赖所限制。这种依赖性在动态机器人环境中带来了挑战，因为预训练数据可能不易获取或无法代表当前环境。本文提出了WINFlowNets，一种新颖的CFlowNets框架，能够实现流网络与检索网络的协同训练。WINFlowNets首先通过检索网络的预热阶段来引导其策略，随后采用共享训练架构和共享经验回放缓冲区对两个网络进行协同训练。在模拟机器人环境中的实验表明，WINFlowNets在平均奖励和训练稳定性方面均超越了CFlowNets以及先进的RL算法。此外，WINFlowNets在故障环境中展现出强大的适应能力，使其适用于需要有限样本数据快速适应的任务。这些发现凸显了WINFlowNets在动态且易发生故障的机器人系统中部署的潜力，而传统的预训练或样本效率低下的数据收集方法在此类系统中可能并不适用。

摘要 (Abstract)

Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets’ potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.

关键词: Generative Flow Networks, CFlowNets, robotic control, sequential decision-making, co-training, fault adaptation, reinforcement learning, retrieval network

281. ❌ Classifier Pooling for Modern Ordinal Classification

作者: Noam H. Rotenberg, Andreia V. Faria, Brian Caffo 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17278v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于开发一种处理序数分类数据的模型无关方法，并提供了开源软件实现。所有关键词均与大语言模型、深度学习技术原理或特定AI应用领域相关，而本文属于传统机器学习方法研究，仅与"AI for Science OR Bioinformatics OR Cheminformatics"有微弱关联（提及临床数据应用），其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种模型无关的序数分类方法，并开发了开源Python软件包，在多个真实数据集上验证了该方法在小样本或多类别场景下优于非序数分类方法。

摘要翻译

序数数据在临床及其他领域广泛存在，但目前仍缺乏基于现代机器学习的方法及公开可用的处理软件。本文提出一种与模型无关的序数分类方法，该方法能够以序数方式应用任何非序数分类算法。同时，我们以Python软件包的形式提供了这些算法的开源实现。我们在多个真实世界数据集上应用这些模型，以展示其跨领域的性能。结果表明，这些方法通常优于非序数分类方法，特别是在数据点数量相对较少或结果类别较多的情况下。本研究及所开发的软件有助于推动使用更强大的现代机器学习算法来处理序数数据。

摘要 (Abstract)

Ordinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.

关键词: ordinal classification, model-agnostic method, machine learning, clinical data, open-source software, Python package, real-world datasets, small sample performance

282. ❌ Wasserstein-type Gaussian Process Regressions for Input Measurement Uncertainty

作者: Hengrui Luo, Xiaoye S. Li, Yang Liu, Marcus Noack, Ji Qiang, Mark D. Risser 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高斯过程回归在输入测量误差下的改进方法，提出基于Wasserstein距离的核函数来处理输入不确定性。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文专注于传统统计机器学习中的高斯过程回归方法，与评分关键词中的大模型技术、训练方法、推理优化、AI科学应用等主题完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了高斯过程回归在输入存在测量误差时会导致后验区间过于乐观和决策偏差的问题，提出了一种基于Wasserstein距离的确定性投影Wasserstein ARD核函数，能够更透明和鲁棒地处理输入噪声。

摘要翻译

高斯过程（GP）回归被广泛用于不确定性量化，但其标准形式假设协变量无测量误差。当输入变量存在测量误差时，这种变量含误差（EIV）设定可能导致后验区间过于乐观地偏窄以及决策偏差。我们通过将每个含噪声输入表示为一个概率测度，并利用这些测度之间的Wasserstein距离定义协方差，研究了输入测量不确定性下的GP回归。基于这一视角，我们实例化了一个确定性的投影Wasserstein自动相关性确定（PWA）核函数，其一维分量具有闭式表达式，其乘积结构则产生了一个可扩展的、定义在分布上的正定核。与隐变量输入GP模型不同，基于PWA的GP（\PWAGP）能够处理输入噪声，而无需引入未观测协变量或蒙特卡洛投影，从而使不确定性量化更加透明和稳健。

摘要 (Abstract)

Gaussian process (GP) regression is widely used for uncertainty quantification, yet the standard formulation assumes noise-free covariates. When inputs are measured with error, this errors-in-variables (EIV) setting can lead to optimistically narrow posterior intervals and biased decisions. We study GP regression under input measurement uncertainty by representing each noisy input as a probability measure and defining covariance through Wasserstein distances between these measures. Building on this perspective, we instantiate a deterministic projected Wasserstein ARD (PWA) kernel whose one-dimensional components admit closed-form expressions and whose product structure yields a scalable, positive-definite kernel on distributions. Unlike latent-input GP models, PWA-based GPs (\PWAGPs) handle input noise without introducing unobserved covariates or Monte Carlo projections, making uncertainty quantification more transparent and robust.

关键词: Gaussian process regression, input measurement uncertainty, errors-in-variables, Wasserstein distance, covariance kernel, uncertainty quantification, PWA kernel, PWAGPs

283. ❌ Variational Rectification Inference for Learning with Noisy Labels

作者: Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu, Fan Liu, Hehe Fan, Yilong Yin 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17255v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度学习中的噪声标签学习问题，提出了一种基于变分推断和元学习的损失函数校正方法（VRI）。论文内容完全围绕传统深度学习训练中的鲁棒性优化，未涉及大语言模型（LLM）、大模型技术原理、大模型应用或任何评分关键词中的特定技术（如MoE、RLHF、RAG等）。所有关键词均与大模型相关，而本文研究的是通用深度模型的训练问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种变分校正推理（VRI）方法，通过分层贝叶斯框架和元学习来解决深度模型在噪声标签数据上的过拟合问题，显著提高了模型的泛化性能。

摘要翻译

标签噪声在现实世界数据集中广泛存在。为减轻深度模型因过拟合标签噪声而产生的负面影响，主流方法普遍采用有效策略（如重加权或损失校正），这些策略通常在元学习场景下习得。尽管概率元学习模型实现了对噪声的鲁棒性，但它们常遭受模型坍塌问题，导致泛化性能下降。本文提出变分校正推断（VRI），将损失函数的自适应校正构建为一种摊销变分推断问题，并在元学习框架下推导证据下界。具体而言，VRI通过将校正向量视为潜变量，构建为分层贝叶斯模型。该模型能通过额外的随机性正则化校正噪声样本的损失，从而对标签噪声更具鲁棒性。为实现校正向量的推断，我们使用摊销元网络近似其条件后验分布。通过在VRI中引入变分项，条件后验得以准确估计，并避免坍缩为狄拉克δ函数，从而显著提升泛化性能。所设计的元网络与先验网络遵循平滑性假设，能够生成可靠的校正向量。给定一组干净元数据，VRI可通过双层优化编程进行高效的元学习。此外，理论分析证明元网络能通过我们的算法有效学习。综合对比实验与分析验证了其在带噪声标签的鲁棒学习中的有效性，尤其在存在开集噪声的场景下。

摘要 (Abstract)

Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (\textit{e.g.}, re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.

关键词: noisy labels, variational inference, meta-learning, loss rectification, robust learning, deep models, generalization performance, open-set noise

284. ❌ Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction

作者: Youssef Youssef, Jitin Singla 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（特别是对比学习）进行心电图（ECG）重建，属于医学AI应用领域。论文未涉及任何大语言模型（LLM）相关技术，如预训练、微调、推理优化、代理系统等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为ECG分析可视为生物信息学或科学AI的应用，但论文核心是特定医学任务而非通用大模型技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种病理感知的多视图对比学习框架，用于从减少的导联集中重建12导联心电图，在患者独立设置中实现了比现有方法约76%的RMSE降低，并展示了优异的跨数据集泛化能力。

摘要翻译

从有限导联组重建12导联心电图是一个因解剖结构变异而定义不良的逆问题。标准深度学习方法常忽略潜在的心脏病理特征，导致胸导联关键形态信息丢失。我们提出病理感知多视角对比学习框架，通过病理流形对潜在空间进行正则化。该架构将高保真时域波形与通过监督对比对齐学习的病理感知嵌入相融合。通过最大化潜在表征与临床标签之间的互信息，该框架学会滤除解剖学上的“干扰”变量。在PTB-XL数据集上，我们的方法在患者独立测试场景中相较于最先进模型实现了约76%的均方根误差降低。在PTB诊断数据库上的跨数据集评估验证了其卓越的泛化能力，从而弥合了硬件便携性与诊断级重建质量之间的鸿沟。

摘要 (Abstract)

Reconstructing a 12-lead electrocardiogram (ECG) from a reduced lead set is an ill-posed inverse problem due to anatomical variability. Standard deep learning methods often ignore underlying cardiac pathology losing vital morphology in precordial leads. We propose Pathology-Aware Multi-View Contrastive Learning, a framework that regularizes the latent space through a pathological manifold. Our architecture integrates high-fidelity time-domain waveforms with pathology-aware embeddings learned via supervised contrastive alignment. By maximizing mutual information between latent representations and clinical labels, the framework learns to filter anatomical “nuisance” variables. On the PTB-XL dataset, our method achieves approx. 76% reduction in RMSE compared to state-of-the-art model in patient-independent setting. Cross-dataset evaluation on the PTB Diagnostic Database confirms superior generalization, bridging the gap between hardware portability and diagnostic-grade reconstruction.

关键词: ECG reconstruction, contrastive learning, pathology-aware, multi-view learning, patient-independent, PTB-XL dataset, supervised contrastive alignment, anatomical variability

285. ❌ Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

作者: Truong-Son Hy 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究蛋白质工程中的AI应用，与大多数大模型技术关键词无关。唯一高度相关的是"AI for Science OR Bioinformatics OR Cheminformatics”（10分），因为论文使用预训练蛋白质语言模型进行蛋白质表示学习，属于生物信息学应用。“Pre-training OR Continual Pre-training OR Domain Adaptation"得5分，因为论文使用了预训练蛋白质语言模型，但未涉及大语言模型的预训练技术。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Q-BIOLAT框架，利用预训练蛋白质语言模型将蛋白质序列编码为二进制潜在表示，并通过组合优化方法有效识别高适应性蛋白质变体，为量子辅助蛋白质工程提供了新方向。

摘要翻译

我们提出Q-BIOLAT框架，用于在二元隐空间中建模和优化蛋白质适应度景观。该框架从蛋白质序列出发，利用预训练的蛋白质语言模型获取连续嵌入表示，随后将其转化为紧凑的二元隐空间表示。在此空间中，蛋白质适应度通过二次无约束二元优化模型进行近似建模，从而能够借助模拟退火、遗传算法等经典启发式方法实现高效的组合搜索。

在ProteinGym基准测试中，我们证明Q-BIOLAT能够捕捉蛋白质适应度景观中的有意义结构，并有效识别高适应度变异体。尽管采用简单的二值化方案，我们的方法始终能检索到其最近邻位于训练适应度分布顶端区间的序列，在最强配置下表现尤为突出。我们进一步发现不同优化策略呈现差异化行为：进化搜索在高维隐空间中表现更优，而局部搜索在保持序列真实性方面仍具竞争力。

除实证性能外，Q-BIOLAT在蛋白质表示学习与组合优化之间建立了天然桥梁。通过将蛋白质适应度建模为QUBO问题，该框架可直接兼容新兴的量子退火硬件，为量子辅助蛋白质工程开辟了新方向。

项目代码已公开于：https://github.com/HySonLab/Q-BIOLAT

摘要 (Abstract)

We propose Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q-BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high-fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher-dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum-assisted protein engineering. Our implementation is publicly available at: https://github.com/HySonLab/Q-BIOLAT

关键词: protein fitness landscapes, binary latent representations, pretrained protein language models, QUBO optimization, quantum annealing, protein engineering, combinatorial optimization, ProteinGym benchmark

286. ❌ From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

作者: Boyong Wu, Sanghwan Kim, Zeynep Akata 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在分割任务中的机制分析，核心关注LLMs的视觉处理能力和可解释性。与’Large Language Models’高度相关（10分），因为论文明确研究MLLMs；与’Mechanistic Interpretability’高度相关（10分），因为论文进行层间线性探测和注意力干预分析来理解模型机制。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过层间线性探测和注意力干预分析，揭示了多模态大语言模型在分割任务中视觉表示先下降后通过注意力机制恢复的机制，为设计分割能力模型提供了见解。

摘要翻译

多模态大语言模型（MLLMs）日益应用于像素级视觉任务，但其内在的空间理解能力仍不甚明晰。本研究通过在整个MLLM流程（包括视觉编码器、适配器和大型语言模型）中进行分层线性探针评估，深入探究其分割能力。我们进一步实施了基于干预的注意力剔除分析，以检验跨令牌注意力是否逐步优化视觉表征，并对图像令牌间的双向注意力在空间一致性方面进行了评估。分析表明：适配器会引发分割表征的衰退，但大型语言模型层通过注意力介导的优化过程逐步恢复表征，其中正确分类的令牌能够引导误分类的邻近令牌转向正确标签。在图像令牌的早期位置，这种恢复受到因果注意力的限制，而图像令牌间的双向注意力则能缓解此限制。这些发现从机制层面揭示了多模态大语言模型如何处理视觉信息以完成分割任务，为未来具备分割能力的模型设计提供了理论依据。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.

关键词: Multimodal Large Language Models, segmentation, mechanistic analysis, attention knockout, visual representations, layerwise linear probing, spatial understanding, MLLMs

287. ❌ Intermitotic timing and motility patterns in the cell division of the diatom Seminavis robusta

作者: Jonas Ziebarth, Thomas Fuhrmann-Lieker 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16984v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究硅藻细胞分裂的生物学特性，使用机器学习进行细胞检测，但核心内容属于传统生物学和细胞行为研究，与所有大模型、深度学习技术原理及创新方法的关键词完全无关。仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’因涉及机器学习在科学中的应用而获得5分（有一定关联），但论文未涉及大模型或深度学习创新。

!!! tip deepseek-chat TL;DR

该研究通过细胞追踪和机器学习检测算法，探究了硅藻Seminavis robusta在细胞分裂过程中大小差异是否影响分裂间隔时间或细胞运动性，发现分裂时间无显著差异、与昼夜周期弱相关，且较小子细胞运动性更高。

摘要翻译

许多硅藻在其营养阶段遵循尺寸缩减-尺寸恢复的循环，导致子细胞在尺寸上存在差异。针对硅藻Seminavis robusta，我们通过多代细胞追踪研究了这种尺寸差异是否也反映在细胞分裂间期时间或细胞运动性的不同上。研究开发了一套追踪装置及基于机器学习的检测算法，结果显示：分裂间期时间无显著差异，与昼夜周期的关联性较弱，且下壳面（hypothecal）较小的子细胞表现出更高的运动性。

摘要 (Abstract)

Many diatoms follow a size diminuation - size restoration cycle in their vegetative phase, leading to daughter cells that differ in size. For the diatom Seminavis robusta, we investigated by cell tracking over several generations whether the size difference reflects also in different intermitotic times or in the mobility of the cells. A tracking setup and machine-learning based detection algorithm was developed that revealed no significant difference in intermitotic times, a weak coupling to the day- night cycle, and a higher motility of the hypothecal, smaller daughter cell.

关键词: diatom, cell division, intermitotic timing, cell motility, machine learning, cell tracking, Seminavis robusta, size diminution

288. ❌ Non-perturbative Bacterial Identification Directly from Solid Agar Plates Using Raman

作者: Jeong Hee Kim, Jia Dong, Marissa Morales, Loza Tadesse 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用拉曼光谱和机器学习进行细菌识别，属于AI在生物信息学/科学领域的应用。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其应用机器学习于生物医学问题。其他所有关键词均涉及大模型、深度学习技术原理或特定AI方法（如LLM、MoE、微调、推理优化、智能体等），而本文未涉及这些内容，核心是传统光谱分析与基础机器学习分类，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了拉曼光谱在微生物识别中需要额外样品制备的问题，通过直接从封闭琼脂平板进行拉曼测量并结合密度泛函理论与机器学习分析，实现了非侵入性、高精度的细菌识别，分类准确率超过97.7%。

摘要翻译

拉曼光谱是一种极具前景的微生物鉴定工具，然而，由于需要额外的样品制备来聚焦微生物信号，其在微生物学和临床工作流程中的应用仍受到限制。本文中，我们展示了直接从未开封、倒置的琼脂平板上进行基于拉曼光谱的细菌鉴定，该条件与培养期间的条件完全相同。我们的方法实现了单基因水平的灵敏鉴定：使用两种仅绿色荧光蛋白（GFP）表达不同的大肠杆菌变体，在多种培养基和基底材料条件下，尽管探测光路需穿透3-4毫米厚的背景材料，仍能成功鉴别。我们将传统的基于密度泛函理论（DFT）的材料计算与机器学习分析相结合，实现了超过97.7%的分类准确率，相较于从已开封平板进行的标准测量，平均准确率提高了10.8%，方差降低了0.76%。我们进一步通过DFT计算生成的GFPmut3发色团结构特征拉曼峰，展示了基于拉曼成像的菌落鉴定。我们的方法对算法或基底材料的变化具有鲁棒性，有望实现对细菌生长、生物膜形成以及抗菌素耐药性发展的实时、非侵扰性监测。

摘要 (Abstract)

Raman spectroscopy is a promising tool for microbial identification, yet its implementation in microbiology and clinical workflow is still restricted due to the accompanying additional preparation required to focus on microbial signals. Here, we demonstrate Raman-based bacterial identification directly from unopened, inverted agar plates, the same conditions used during incubation. Our approach enabled identification with single gene-level sensitivity using two Escherichia coli variants, differing only in green fluorescent protein (GFP) expression, across diverse media and substrate material conditions, despite the interrogation path traversing 3-4 mm thick background material. We integrated traditional density functional theory (DFT)-based material computation with machine learning analysis, achieving over 97.7% classification accuracy, surpassing the performance of standard measurements from opened plates by 10.8% higher mean accuracy and 0.76% less variance. We further demonstrated Raman mapping-based colony identification via Raman peaks characteristic to GFPmut3 chromophore structure generated by DFT. Our approach is robust to changes in algorithms or substrate materials and promises real-time, non-perturbative monitoring of bacterial growth, biofilm formation, and antimicrobial resistance development.

关键词: Raman spectroscopy, bacterial identification, agar plates, machine learning, density functional theory, non-perturbative monitoring, classification accuracy, Escherichia coli

289. ❌ Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

作者: Madhulatha Mandarapu, Sandeep Kunkunuru 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15080v3

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是构建大规模生物医学知识图谱（KG）并开发LLM代理访问系统。与关键词相关性分析：1）高度相关（10分）：‘LLM Agents’（论文开发了schema-driven MCP server for LLM agent access）、‘Tool Use’（MCP server提供工具调用功能）、‘AI for Science’（应用于生物医学领域）。2）中等相关（8分）：‘Large Language Models’（使用GPT-4o作为基准对比）、‘Retrieval-Augmented Generation’（通过知识图谱增强LLM回答，类似RAG范式）。3）无关（0分）：其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文解决了生物医学知识分散在不同数据库中的问题，通过构建三个大规模开放知识图谱并开发LLM代理访问系统，在BiomedQA基准测试中实现了98%的准确率。

摘要翻译

生物医学知识目前分散在多个孤立的数据库中——Reactome存储通路数据，STRING存储蛋白质相互作用，ClinicalTrials.gov存储研究注册信息，DrugBank存储药物词汇，DGIdb存储药物-基因相互作用，SIDER存储副作用信息。我们提出了三个开源生物医学知识图谱——通路知识图谱（整合5个来源，含118,686个节点、834,785条边）、临床试验知识图谱（整合5个来源，含7,774,446个节点、26,973,997条边）和药物相互作用知识图谱（整合3个来源，含32,726个节点、191,970条边）——它们构建于用Rust编写的高性能图数据库Samyama之上。

我们的贡献主要体现在三个方面。首先，我们描述了一种可复现的ETL（提取-转换-加载）模式，用于从异构公共数据源构建大规模知识图谱，该模式具备跨源去重、批量加载（支持Python Cypher和Rust原生加载器）以及便携式快照导出功能。其次，我们展示了跨知识图谱的联邦查询能力：将三个快照加载到单一图租户中，可实现跨数据集的基于属性的连接查询。第三，我们引入了基于模式驱动的MCP（模型上下文协议）服务器生成技术，以供LLM智能体访问；在新构建的BiomedQA基准测试（含40个药理学问题）中评估显示：领域特定的MCP工具准确率达到98%，而模式感知的文本到Cypher查询为85%，独立GPT-4o为75%，且实现了零模式错误。

所有数据源均为开放许可。整合后的联邦图谱（含790万个节点、2800万条边）在商用云硬件上加载仅需约3分钟，单一知识图谱查询可在80-100毫秒内完成，跨知识图谱联邦连接查询则在1-4秒内完成。

摘要 (Abstract)

Biomedical knowledge is fragmented across siloed databases – Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side effects. We present three open-source biomedical knowledge graphs – Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) – built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross-KG federation: loading all three snapshots into a single graph tenant enables property-based joins across datasets. Third, we introduce schema-driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain-specific MCP tools achieve 98% accuracy vs. 85% for schema-aware text-to-Cypher and 75% for standalone GPT-4o, with zero schema errors. All data sources are open-license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, with single-KG queries completing in 80-100ms and cross-KG federation joins in 1-4s

关键词: biomedical knowledge graphs, graph database, LLM agents, MCP server, knowledge federation, ETL pipeline, BiomedQA benchmark, pharmacology questions

290. ❌ UNICORN: Ultrasound Nakagami Imaging via Score Matching and Adaptation for Assessing Hepatic Steatosis

作者: Kwanyoung Kim, Jaa-Yeon Lee, Youngjun Ko, GunWoo Lee, Jong Chul Ye 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学超声成像技术，提出了一种基于分数匹配和适应的Nakagami参数估计新方法（UNICORN），用于评估肝脏脂肪变性。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学成像（可视为生物信息学或科学AI应用）领域的应用，但论文本身并未强调其AI属性（更偏向于信号处理和医学图像分析），因此给予5分（有一定关联）。加权总分仅为5.0，远低于动态及格分26.6，表明该论文与评审关注的大模型及深度学习技术核心领域高度不相关。

!!! tip deepseek-chat TL;DR

该论文针对传统超声Nakagami成像在窗口大小选择和估计器稳定性方面的局限性，提出了一种名为UNICORN的新方法，通过基于超声包络信号分数函数的闭式估计器，实现了高分辨率的像素级参数映射，有效用于临床肝脏脂肪变性的评估和可视化区分。

摘要翻译

超声成像是评估肝脏脂肪变性的一线重要工具。传统B型超声成像在提供详细组织表征方面存在局限，而超声Nakagami成像则有望通过背向散射信号实现组织散射的可视化与量化，在脂肪分数分析中具有应用潜力。然而，现有的Nakagami成像方法在最优窗口尺寸选择方面存在困难，且估计器不稳定性导致图像分辨率下降。为解决这些问题，我们提出了一种名为UNICORN（基于分数匹配与自适应的超声Nakagami成像）的新方法，该方法基于超声包络信号的分数函数，为Nakagami参数估计提供了精确的闭式估计器。与仅可视化特定感兴趣区域（ROI）并在固定窗口尺寸内估计参数的方法不同，我们的方法通过提供逐像素估计器实现全面的参数映射，从而获得高分辨率成像。我们证明，所提出的估计器能有效评估肝脏脂肪变性，并在与该病症相关的背向散射统计量中提供视觉区分。通过对患者真实包络数据进行大量实验，我们验证了UNICORN能够实现肝脏脂肪变性的临床检测，并表现出鲁棒性和泛化能力。

摘要 (Abstract)

Ultrasound imaging is an essential first-line tool for assessing hepatic steatosis. While conventional B-mode ultrasound imaging has limitations in providing detailed tissue characterization, ultrasound Nakagami imaging holds promise for visualizing and quantifying tissue scattering in backscattered signals, with potential applications in fat fraction analysis. However, existing methods for Nakagami imaging struggle with optimal window size selection and suffer from estimator instability, leading to degraded image resolution. To address these challenges, we propose a novel method called UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation), which offers an accurate, closed-form estimator for Nakagami parameter estimation based on the score function of the ultrasound envelope signal. Unlike methods that visualize only specific regions of interest (ROI) and estimate parameters within fixed window sizes, our approach provides comprehensive parameter mapping by providing a pixel-by-pixel estimator, resulting in high-resolution imaging. We demonstrated that our proposed estimator effectively assesses hepatic steatosis and provides visual distinction in the backscattered statistics associated with this condition. Through extensive experiments using real envelope data from patient, we validated that UNICORN enables clinical detection of hepatic steatosis and exhibits robustness and generalizability.

关键词: Ultrasound Nakagami Imaging, Hepatic Steatosis, Score Matching, Parameter Estimation, Backscattered Signals, High-resolution Imaging, Clinical Detection

291. ❌ Rotational excitation of asymmetric-top molecular ions by electron impact: application to H$_2$O$^+$, HDO$^+$, and D$_2$O$^+$

作者: Joshua Forer 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究分子离子的旋转激发，属于计算化学/分子物理领域，使用R-矩阵散射理论、多通道量子缺陷理论等传统物理方法，完全不涉及大模型、深度学习、AI技术或任何评分关键词中的概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文使用电子-分子散射理论框架，计算了H2O+、HDO+和D2O+分子离子从旋转基态出发的态分辨截面和动力学速率系数。

摘要翻译

本文采用电子-分子R矩阵散射理论、多通道量子亏损理论、框架变换理论及库仑-玻恩近似相结合的理论框架，对三种不对称陀螺分子离子同位素体H$_2$O$^+$、HDO$^+$和D$_2$O$^+$的转动激发过程进行了理论研究。其中后两种理论方法已针对不对称陀螺转子进行了适应性调整。研究给出了从离子转动基态出发的态分辨截面与动力学速率系数。所有计算跃迁（$N=0\ldotstwo4$）的态分辨速率系数已作为补充材料收录，并将通过EMAA数据库公开。

摘要 (Abstract)

The rotational excitation of the three asymmetric-top molecular ion isotopologues H$_2$O$^+$, HDO$^+$, and D$_2$O$^+$ is studied theoretically using a combined framework of electron-molecule R-matrix scattering theory, multichannel quantum-defect theory, frame transformation theory, and the Coulomb-Born approximation. The latter two have been adapted here for asymmetric-top rotors. State-resolved cross sections and kinetic rate coefficients for transitions from the rotational ground state of the ions are presented. State-resolved rate coefficients for all calculated transitions $N=0\ldotstwo4$ are included as supplementary material and will be made available through the EMAA database.

关键词: rotational excitation, asymmetric-top molecular ions, electron impact, R-matrix scattering theory, multichannel quantum-defect theory, cross sections, rate coefficients, H2O+ HDO+ D2O+

292. ❌ Mechanistic Insights into Enhanced Alkaline Oxygen Evolution on Zn-Al Alloy Electrodes

作者: Abdul Ahad Mamun, Rokon Uddin Mahmud, Shahin Aziz, Muhammad Shahriar Bashar, Ahmed Sharif, Muhammad Anisuzzaman Talukder 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Zn-Al合金电极在碱性水电解中的电化学性能，属于材料科学和电化学领域。论文内容完全不涉及大模型、深度学习、人工智能或任何计算机科学相关技术。所有关键词均与大模型、深度学习技术及其应用相关，而本文是纯实验材料研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究通过实验和理论计算发现，在锌中添加10-15 wt.%铝的Zn-Al合金电极能显著提高碱性水电解中析氧反应的电催化性能，降低过电位并改善反应动力学。

摘要翻译

电化学水电解技术通过制备清洁能源载体以降低碳排放，但在碱性水分解中缺乏高效析氧反应（OER）所需的适宜低成本电极。为应对这一挑战，本研究通过粉末冶金法制备了铝含量最高达20 wt.%的锌铝合金电极，并在碱性溶液中进行析氧反应电化学测试以评估其催化性能。同时采用第一性原理计算研究了材料的热力学相稳定性与电子结构。理论与实验结果表明，当铝掺入量≥20 wt.%时，锌基体出现热力学相不稳定性及富铝区域的第二相偏析，从而限制了反应动力学并降低催化效率。尽管铝含量为5 wt.%的合金表现出良好的热力学和电子特性，但其表面反应活性位点不足，导致电化学性能低下。相比之下，铝含量为10 wt.%和15 wt.%的合金相对于纯锌的阳极交换电流密度分别提升约三倍和两倍。此外，在电流密度为12 mAcm⁻²条件下测得的阳极过电位损失（η₀,ₐ）分别为：Zn₀.₉Al₀.₁合金0.240 V，Zn₀.₈₅Al₀.₁₅合金0.5603 V，显著低于纯锌的过电位（η₀,ₐ = 1.086 V）。虽然Zn₀.₉Al₀.₁与Zn₀.₈₅Al₀.₁₅表现出相近的电荷转移电阻（RCT），但在所有测试样品中Zn₀.₉Al₀.₁展现出更优的反应动力学和更低的η₀,ₐ。进一步对比表明，锌铝合金改进的动力学特性与降低的过电位优于其他过渡金属基催化剂（包括铁-钴-镍-钼合金及铁掺杂氧化铜体系）。

摘要 (Abstract)

Electrochemical water electrolysis, which produces clean energy carriers to mitigate carbon emissions, lacks suitable, low-cost electrodes for efficient oxygen evolution reaction (OER) in alkaline water splitting. To address this challenge, we developed Zn-Al alloy electrodes with varying Al contents up to 20 wt.% via powder metallurgy method and conducted electrochemical measurements of the OER in alkaline solution to investigate their catalytic performance. We also performed first-principles calculations to examine their thermodynamic phase stability and electronic structures. Both theoretical and experimental results indicated that incorporating $\geq 20$ wt.% Al into Zn led to thermodynamic phase instability and secondary-phase segregation in Al-rich regions, limiting reaction kinetics and reducing catalytic efficiency. Although the Al content of 5 wt.% into Zn exhibited favorable thermodynamic and electronic characteristics, but its electrochemical performance was inefficient and poor due to inadequate reaction active sites on the surface. In contrast, the 10 wt.% and 15 wt.% Al into Zn showed approximately three- and two-fold increases in anodic exchange current density relative to pure Zn, respectively. Additionally, the anodic overpotential losses ($η_{0,a}$) measured at a current density of 12 mAcm$^{-2}$ were 0.240 V for Zn${0.9}$Al${0.1}$ and 0.5603 V for Zn${0.85}$Al${0.15}$, significantly lower than that of pure Zn ($η_{0,a} = 1.086$ V). While Zn${0.9}$Al${0.1}$ and Zn${0.85}$Al${0.15}$ showed similar charge transfer resistance ($R_{\rm CT}$), Zn${0.9}$Al${0.1}$ demonstrated superior reaction kinetics and lower $η_{0,a}$ across all samples tested. Furthermore, the improved kinetics and reduced overpotential of the Zn-Al alloys favorably compare with those of other transition-metal-based catalysts, including Fe-Co-Ni-Mo alloys and Fe-doped CuO.

关键词: Zn-Al alloy electrodes, oxygen evolution reaction, alkaline water splitting, electrochemical performance, first-principles calculations, anodic exchange current density, overpotential, reaction kinetics

293. ❌ In-phase current and temperature oscillations reduce PEM fuel cell resistivity: A modeling study

作者: Andrei Kulikovsky 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究质子交换膜燃料电池阴极催化剂层的非等温阻抗模型，通过同相电流和温度振荡来降低电阻率。所有评分关键词均涉及大模型、深度学习、AI技术及其应用，而该论文属于电化学工程和燃料电池物理建模领域，与AI、大模型或深度学习无任何直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个质子交换膜燃料电池阴极催化剂层的非等温阻抗模型，发现同相电流和温度振荡可降低质子传输损耗，从而减少阻抗和静态极化电阻率。

摘要翻译

我们为质子交换膜燃料电池阴极催化剂层（CCL）的阻抗建立了一个非等温解析模型。由于质子传输损耗的降低，电流密度和温度的同相谐波扰动减小了CCL的阻抗和静态极化电阻率。通过对电流和温度扰动幅度的特殊选择，可以完全消除这些损耗。

摘要 (Abstract)

We have developed a non-isothermal analytical model for the impedance of the cathode catalyst layer (CCL) in a PEM fuel cell. In-phase harmonic perturbations to the current density and temperature reduce the impedance and the static polarisation resistivity of the CCL due to lowering proton transport losses. A special selection of the current and temperature perturbation amplitudes allows for complete elimination of these losses.

关键词: PEM fuel cell, cathode catalyst layer, non-isothermal model, impedance, current density, temperature oscillations, proton transport losses, polarization resistivity

294. ❌ TENSO: Software Package for Numerically Exact Open Quantum Dynamics Based on Efficient Tree Tensor Network Decomposition of the Hierarchical Equations of Motion

作者: Juan C. Rodriguez Betancourt, Michelle C. Anderson, Luchang Niu, Xinxian Chen, Ignacio Franco 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文TENSO专注于量子动力学模拟的数值计算软件包开发，基于树张量网络分解方法解决开放量子系统的非马尔可夫动力学问题。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，但论文内容完全不涉及这些领域。唯一可能的相关点是’AI for Science’，因为论文属于计算化学/量子物理的科学计算工具，但论文本身并未使用AI方法，因此给予最低相关分5分。

!!! tip deepseek-chat TL;DR

该论文开发了名为TENSO的开源软件包，通过树张量网络分解方法高效求解层次运动方程，实现了复杂环境下开放量子系统的精确非马尔可夫动力学模拟。

摘要翻译

TENSO是一款多功能且强大的开源软件包，用于对浸入结构化热环境中的量子系统动力学进行数值精确模拟。该软件基于层级运动方程（Hierarchical Equations of Motion, HEOM）的树状张量网络分解，能有效抑制其随环境复杂度增加而出现的维度灾难。因此，即使面对化学和量子信息科学中典型的复杂环境，TENSO也能实现精确的非马尔可夫开放量子动力学模拟。TENSO允许系统中存在时间依赖的驱动以及非对易涨落。更广泛地说，对于任何动力学生成子能以乘积和形式表达的方法（包括HEOM和多层多组态含时Hartree方法），TENSO都能高效地传播其动力学。TENSO支持使用任意阶数的张量树和张量链进行模拟，并为耦合主方程实现了三种传播策略：两种固定秩方法（在动力学模拟期间需要恒定的内存占用）和一种自适应秩方法（其可变内存占用由目标计算误差水平控制）。与配套的理论和算法论文[J. Chem. Phys. 163, 104109 (2025)]不同，本文的重点在于TENSO的实际使用和应用，仅在必要时引入基础理论概念。

摘要 (Abstract)

TENSO is a versatile and powerful open-source software package for numerically exact simulations of the dynamics of quantum systems immersed in structured thermal environments. It is based on a tree tensor network decomposition of the hierarchical equations of motion (HEOM) that efficiently curbs its curse of dimensionality with bath complexity. As such, TENSO enables exact non-Markovian open quantum dynamics simulations even with complex environments typical of chemistry and quantum information science. TENSO allows for time-dependent drive in the system, and for non-commuting fluctuations. More generally, TENSO efficiently propagates the dynamics for any method with a generator of the dynamics that can be expressed in a sum-of-products form, including the HEOM and multi-layer multiconfigurational time-dependent Hartree methods. TENSO enables simulations using tensor trees and trains of arbitrary order, and implements three propagation strategies for the coupled master equations; two fixed-rank methods that require a constant memory footprint during the dynamics and one adaptive rank method with a variable memory footprint controlled by the target level of computational error. In contrast to the accompanying theory and algorithmic paper [J. Chem. Phys. 163, 104109 (2025)] the focus here is on the practical usage and applications of TENSO with underlying theoretical concepts introduced only as needed.

关键词: open quantum dynamics, hierarchical equations of motion, tree tensor network, non-Markovian dynamics, numerical simulation, tensor decomposition, quantum systems, computational chemistry

295. ❌ Interface-dependent Phase Transitions and Ultrafast Hydrogen Superionic Diffusion of H2O Ice

作者: Pengfei Hou, Yumiao Tian, Zifeng Liu, Junwen Duan, Hanyu Liu, Xing Meng, Russell J. Hemley, Yanming Ma 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究高压下H2O冰的界面效应、相变和超离子扩散，使用人工神经网络（ANN）和主动学习结合分子动力学模拟。论文主题属于计算材料科学/物理化学领域，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文使用了ANN进行科学模拟，属于AI在科学领域的应用，但并非核心的生物信息学或化学信息学，且未涉及大模型或深度学习技术原理的创新。

!!! tip deepseek-chat TL;DR

该研究通过结合人工神经网络和分子动力学模拟，揭示了金刚石压砧界面如何显著影响高压H2O冰的超离子转变温度、诱导相变，并重新定义了冰的稳定相图。

摘要翻译

采用金刚石对顶砧的高压实验揭示了极端条件下H2O的新颖物性与相行为。在金刚石压腔中，H2O样品通常直接与金刚石砧面接触。然而，这种界面在多大程度上影响所测量的压力诱导性质与行为——包括冰相的共存线——仍属未知。通过将人工神经网络方法与主动学习策略同大规模分子动力学模拟相结合，我们阐明了界面对高压冰相多种性质的影响，包括超离子态、固-固相变及熔化过程。研究结果表明，该界面的存在可显著降低氢超离子转变温度。值得注意的是，界面还能通过逆贝茵机制诱导体心立方（bcc）基冰相向面心立方（fcc）基冰相的自发转变。此外，我们重新划定了熔化线以下bcc冰与fcc冰的稳定区域，并预测fcc冰可在远低于此前认知的压力下存在。更广泛而言，这些结果强调了界面效应对于理解高压冰实验研究中诸多现象的重要性，包括该基础体系理论与实验结果间的差异。

摘要 (Abstract)

High-pressure experiments using diamond anvils have revealed novel properties and phase behavior of H2O under extreme conditions. When contained in diamond-anvil cells, the H2O samples are usually in direct contact with the diamond anvil. However, the extent to which this interface affects measured pressure-induced properties and behavior, including coexistence lines of ice phases, remains unknown. Combining artificial neural network methods and active learning schemes with large-scale molecular dynamics simulations, we elucidate the interfacial effects on various properties of high-pressure ice phases, including superionic states, solid-solid phase transitions, and melting. The results reveal that the presence of this interface can significantly lower the hydrogen superionic transition temperature. Remarkably, the interface can also induce a spontaneous transition from bcc- to fcc-based ice following the inverse Bain mechanism. Further, we redefined a stability field of bcc and fcc ice below the melting line and predicted the existence of fcc ice at much lower pressures than previously thought. More broadly, the results emphasize the importance of interface effects in understanding a wide range of phenomena reported in experimental studies of ice under pressure, including inconsistencies between theoretical and experimental results of this fundamental system.

关键词: H2O ice, high-pressure, interface effects, phase transitions, superionic diffusion, molecular dynamics simulations, artificial neural network, diamond anvil cell

296. ❌ Free-Energy Analysis of Bubble Nucleation on Electrocatalytic Surfaces

作者: Qingguang Xie, Paolo Malgaretti, Othmane Aouane, Simon Thiele, Jens Harting 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17486v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究电催化表面气泡成核的自由能分析，属于物理化学和电化学工程领域，与所有大模型和深度学习相关的关键词完全无关。论文内容涉及理论建模、实验验证和工程应用，但未使用任何人工智能、机器学习或大模型技术。

!!! tip deepseek-chat TL;DR

该论文通过建立自由能模型，定量预测了电催化表面气泡成核的活化能和临界核尺寸，为电解槽催化剂层设计提供了理论指导。

摘要翻译

催化剂表面的气泡成核对电解槽运行至关重要。然而，由于对其内在机制的认识有限，实现可控气泡成核仍具挑战性。本文提出了一种自由能模型，该模型能够定量预测在给定过饱和度、温度、压力和表面润湿性条件下的气泡活化能与临界核尺寸。我们发现，活化能 $ΔG_{max}$ 随过饱和度 $ζ$ 的增加而降低，遵循 $ΔG_{max} \sim ζ^{-2}$ 的幂律标度关系，而临界核半径 $R_c$ 则遵循 $R_c\sim ζ^{-1}$ 的标度关系。我们对氢气、氧气和氮气泡临界核半径的理论预测与实验测量结果定量吻合。最后，我们提出了一个耦合气体扩散与电化学反应动力学的简单模型，用以确定给定电流密度下的最大气体过饱和度。我们的研究结果深化了对催化剂表面气泡成核的基本理解，并为优化催化剂层设计以提高电解槽性能提供了实用指导。

摘要 (Abstract)

Bubble nucleation at catalyst surfaces plays a critical role in the operation of electrolyzers. However, achieving controlled bubble nucleation remains challenging due to limited understanding of the underlying mechanisms. Here, we present a free-energy model that quantitatively predicts both the activation energy and critical nucleus size of bubbles at given supersaturation, temperature, pressure, and surface wettability. We find that the activation energy $ΔG_{max}$ decreases with increasing supersaturation $ζ$, following a power-law scaling of $ΔG_{max} \sim ζ^{-2}$, while the critical nucleus radius $R_c$ scales as $R_c\sim ζ^{-1}$. Our theoretical predictions for the critical nucleus radius of hydrogen, oxygen and nitrogen bubbles are in quantitative agreement with experimental measurements. Finally, we present a simple model that couples gas diffusion and electrochemical reaction kinetics to determine the maximum gas supersaturation at a given current density. Our results advance the fundamental understanding of bubble nucleation at catalyst surfaces and provide practical guidelines for catalyst layer design to improve the performance of electrolyzers.

关键词: bubble nucleation, electrocatalytic surfaces, free-energy model, activation energy, critical nucleus size, electrolyzers, gas supersaturation, catalyst layer design

297. ❌ Quantum Field Approaches to Chemical Systems

作者: Reza Karimpour, Matteo Gori, Alexandre Tkatchenko 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于量子场论在化学系统中应用的综述，专注于量子场论方法在分子相互作用、化学反应和环境效应中的理论发展。论文内容完全属于理论化学和量子物理领域，不涉及任何大模型、深度学习、人工智能技术或相关方法。所有评分关键词均与大模型技术、AI应用或相关方法相关，与该论文的研究主题无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

这篇综述探讨了量子场论方法在化学系统中的应用，解决了传统量子物质理论在处理大分子系统和量子场效应时的局限性，展示了量子场论如何为化学理论提供新见解并揭示原子分子性质的新标度律。

摘要翻译

基于薛定谔方程或狄拉克方程的量子物质理论，在分子内与分子间相互作用的研究中已确立稳固地位。然而，该理论存在两个关键问题。首先，为实现高精度计算所需的较高计算成本，阻碍了其在大型分子复合体系中的应用。其次，场本身也是量子客体，能够产生许多超出标准量子物质理论方法范畴的、作用于分子体系的有趣效应。本综述聚焦于量子场论方法的最新进展，这些方法用于研究真空中以及处于腔体、溶剂等环境中的分子的共价与非共价相互作用。量子场论为新颖的化学理论和见解提供了丰富的研究平台。例如，化学反应和范德华相互作用可通过腔体、边界及光学激发进行调控；当分子与量子化场相互作用时，会产生新型相互作用；采用粗粒化量子场论形式体系，不久或将能处理包含数百万原子的体系；将量子场论应用于化学体系集合时，可能涌现出原子与分子性质意想不到的标度律。本综述为化学理论的进一步发展，铺垫了一条令人振奋的量子场论驱动之路。

摘要 (Abstract)

Quantum-matter theory (QMT), based on the Schrödinger or Dirac equations, is firmly established for both intra- and intermolecular interactions. However, there are two key issues with QMT. First, its applicability to large molecular complexes is hindered by the relatively high computational cost of the calculations required to achieve high accuracy. Second, fields are also quantum objects that produce many intriguing effects beyond standard QMT approaches to molecular systems. This review focuses on recent developments in quantum-field theory (QFT) approaches to both covalent and non-covalent interactions for molecules in vacuum and subject to environments such as cavities and solvents. QFT provides a rich playground for novel chemical theories and insights. For example, chemical reactions and van der Waals interactions can be manipulated by cavities, boundaries, and optical excitations; novel interactions emerge when molecules interact with quantized fields; systems with millions of atoms could soon be treated with coarse-grained QFT formalisms; and unexpected scaling laws for atomic and molecular properties can emerge when QFT is applied to sets of chemical systems. This review sets the stage for an exciting QFT-driven path for further development of chemical theory.

关键词: Quantum Field Theory, Chemical Systems, Molecular Interactions, Scaling Laws, Quantum-Matter Theory, Cavity Effects, Coarse-grained Formalisms, Chemical Reactions

298. ❌ Analysis of molecular dynamics simulation data via statistical distances between covariance matrices

作者: Yusuke Ono, Takumi Sato, Kenji Yasuoka, Linyu Peng 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于分子动力学模拟数据的统计分析，提出了一种基于协方差矩阵统计距离的框架，用于分析粒子数据分布和系统状态差异。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词都特指深度学习和大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学/分子模拟领域，是AI在科学（具体是化学/材料科学）中的应用，但论文并未明确使用深度学习或大模型方法，而是传统的统计分析和降维技术，因此相关性较低，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于协方差矩阵统计距离的分子动力学模拟数据分析框架，通过降维提取系统动态特征，并在Lennard-Jones粒子和水系统中验证了该方法能有效关联局部统计信息与全局物理性质（如扩散系数），并区分不同相态。

摘要翻译

分子动力学模拟是一种强大的工具，能够从微观原子行为中阐明材料的宏观物理性质。然而，分子动力学模拟产生的大规模、高维数据集对分析构成了重大挑战，需要高效的降维和特征提取技术。尽管目前已应用主成分分析和无监督学习等方法，但在数据效率和计算成本方面仍存在问题。在本研究中，我们提出一个统计分析框架，重点通过粒子数据的协方差矩阵（对应于分子动力学轨迹数据的二阶矩）来分析其分布。系统状态之间的差异通过计算这些协方差矩阵之间的统计距离来量化。通过对所得距离矩阵进行降维处理，我们提取出表征系统动力学的低维特征。我们使用不同温度条件下的伦纳德-琼斯粒子系统，以及独立的冰与液态水块体系统，对所提方法进行了验证。伦纳德-琼斯粒子的结果表明，通过对距离矩阵降维得到的第一主成分与扩散系数之间存在近似线性相关性。这表明，可以从局部统计信息（如协方差矩阵）有效推断全局物理性质，为分析复杂分子系统提供了一种数据高效的替代方案。此外，在独立的冰与液态水块体系统的案例中，该方法成功区分了两种物相，凸显了其在表征分子系统相变和结构差异方面的潜力。

摘要 (Abstract)

Molecular dynamics (MD) simulations are powerful tools for elucidating the macroscopic physical properties of materials from microscopic atomic behaviors. However, the massive, high-dimensional datasets generated by MD simulations pose a significant challenge for analysis, necessitating efficient dimensionality reduction and feature extraction techniques. While existing methods such as principal component analysis and unsupervised learning have been utilized, issues regarding data efficiency and computational cost remain. In this study, we propose a statistical analysis framework focusing on the analysis of the particle data distributions through their covariance matrices, corresponding to the second-order moments of MD trajectory data. Discrepancies between system states are quantified using statistical distances between these covariance matrices. By applying dimensionality reduction to the resulting distance matrix, we extract lower-dimensional features that characterize the systems’ dynamics. We validate the proposed method using Lennard-Jones (LJ) particle systems under different temperature conditions, as well as separate bulk systems of ice and liquid water. The results of LJ particles demonstrate an approximately linear correlation between the first principal component obtained through dimensionality reduction of the distance matrix and the diffusion coefficient. This suggests that global physical properties can be effectively inferred from local statistical information, such as covariance matrices, offering a data-efficient alternative for analyzing complex molecular systems. Furthermore, in the case of separate bulk systems of ice and liquid water, the method successfully distinguishes between the two phases, highlighting its potential for characterizing phase transitions and structural differences in molecular systems.

关键词: Molecular dynamics simulation, Covariance matrices, Statistical distances, Dimensionality reduction, Diffusion coefficient, Phase transition, Lennard-Jones particles, Water systems

299. ❌ Ultrafast laser-driven quantum dynamics in positronium chloride

作者: Einar Aurbakken, Håkon Emil Kristiansen, Simen Kvaal, Antoine Camper, Thomas Bondo Pedersen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.17203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究激光驱动的量子动力学计算，属于计算物理/化学领域，与绝大多数大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算科学应用，但论文未明确使用AI/机器学习方法，而是基于传统的量子化学计算方法（时变Hartree-Fock理论），因此给予较低的相关度评分（5分）。

!!! tip deepseek-chat TL;DR

该论文通过时变Hartree-Fock理论和球极伪谱表示，计算研究了激光驱动下正电子素、PsH和PsCl的量子动力学，发现PsCl中电子电离略有增强且正电子响应更快，并提出了通过光正电子谱在多光子区域直接观测PsCl形成的方法。

摘要翻译

我们在含时哈特里-福克理论层面，对激光驱动的正电子素（Ps）、PsH与PsCl的量子动力学进行了计算研究。为消除有限基组效应并准确描述连续态动力学，我们采用球极坐标伪谱表示方法。本文详细阐述了多组分理论及其数值实现。研究发现，在PsH中正电子的存在会延迟电子电离，而在PsCl中则观察到电子电离的轻微增强。两种体系中正电子响应均快于电子响应。我们提出，在多光子区域可通过光致正电子谱直接观测PsCl的形成：PsCl的谱峰能量预计约为Ps峰值的两倍，从而可与Ps清晰区分。然而在隧穿区域，仅当Ps含量足够低时，光致正电子再散射峰才可能被有效区分。

摘要 (Abstract)

We present a computational study of the laser-driven quantum dynamics of positronium (Ps), PsH, and PsCl at the time-dependent Hartree-Fock level of theory. To eliminate finite-basis effects and to properly capture continuum dynamics, we use a spherical polar pseudospectral representation. The multicomponent theory and its implementation are described in detail. We find that while the presence of the positron delays electron ionization in PsH, a slight enhancement of electron ionization is observed in PsCl. In both cases, the positronic response is faster than that of the electrons. We propose that the formation of PsCl may be directly observed through photopositron spectra in the multiphoton regime, where PsCl peaks are expected at roughly twice the energy of Ps peaks, making PsCl clearly distuinguishable from Ps. In the tunelling regime, however, photopositron rescattering peaks may only be distuinguishable if the amount of Ps is sufficiently low.

关键词: quantum dynamics, positronium chloride, time-dependent Hartree-Fock, laser-driven, pseudospectral representation, electron ionization, photopositron spectra, multicomponent theory

300. ❌ Engineering strong coupling with molecular coatings in optical nanocavities

作者: Athul S. Rema, Adrián E. Rubio López, Felipe Herrera 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究光学纳米腔中量子发射器与分子涂层的强耦合现象，属于量子光学、纳米光子学和量子电动力学领域。所有评分关键词均涉及大模型、深度学习及相关技术，而论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过理论计算证明，在银纳米颗粒表面涂覆J-聚集体分子层可以重构局部电磁真空，使量子点发射器在偶极模式频率下实现强耦合和拉比振荡，从而在光学纳米腔中观测到相干量子动力学。

摘要翻译

银纳米颗粒表面附近的量子发射器因与非辐射近场多极模式的强耦合作用，其电子布居动力学呈现拉比振荡。低频纳米颗粒偶极模式虽具有辐射性，但与量子发射器的耦合强度不足。这些特性限制了强耦合的观测。通过采用宏观量子电动力学理论，并对非马尔可夫相互作用核采用洛伦兹伪模式近似，我们证明：在球形银纳米颗粒表面包覆一层分子J聚集体薄层后，所形成的核壳结构等离激子共振能够重构偶极模式频率处的局域电磁真空，从而使原本仅发生指数式布居衰减的量子发射器能够产生拉比振荡。具体而言，我们以半径为20纳米的银纳米球近场中的量子点发射器为例，证明使用2纳米厚的J聚集体壳层即可诱导弱耦合到强耦合的转变。本研究揭示了分子聚集体在实现真空场的深亚波长结构重构方面的潜力，为在光学纳米腔中观测相干量子动力学提供了新途径。

摘要 (Abstract)

Quantum emitters near the surface of silver nanoparticles undergo Rabi oscillations in electronic population dynamics due to strong coupling with near-field multipole modes that are not radiative. Low-frequency nanoparticle dipole modes are radiative but do not couple strong enough to quantum emitters. These features limit the observation of strong coupling. Using macroscopic quantum electrodynamics theory within a Lorentzian pseudo-mode approximation for the non-Markovian interaction kernel, we demonstrate that by coating spherical silver nanoparticles with a thin molecular J-aggregate layer, the resulting core-shell plexciton resonance restructures the local electromagnetic vacuum at dipole-mode frequencies to enable Rabi oscillations for quantum emitters that otherwise would only undergo exponential population decay. Specifically, we show for quantum dot emitters in the near field of silver nanospheres of 20 nm radius, that weak-to-strong coupling crossovers can be induced using 2 nm J-aggregate shells. Our work demonstrates the potential of molecular aggregates to enable deep sub-wavelength structuring of the vacuum field for the observation of coherent quantum dynamics in optical nanocavities.

关键词: strong coupling, optical nanocavities, molecular coatings, quantum emitters, Rabi oscillations, J-aggregate shells, quantum electrodynamics, vacuum field structuring

301. ❌ Extended Lagrangian molecular dynamics on vibronic surfaces in the nuclear-electronic orbital framework

作者: Joseph A. Dickinson, Mathew Chow, Eno Paenurk, Sharon Hammes-Schiffer 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，提出了一种在核电子轨道框架内模拟质子转移动力学的新方法（NEO-ELMD），并引入了密度矩阵外推和纯化技术以加速计算。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，但论文内容属于传统的量子化学模拟方法学，未涉及任何大模型、深度学习或现代AI技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（与化学信息学相关），但论文并未使用AI方法，因此给予5分（有一定关联）。其他关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种在核电子轨道框架内模拟质子转移动力学的新方法（NEO-ELMD），通过密度矩阵外推和纯化技术加速计算，并在丙二醛和苯并咪唑-酚系统中验证了其准确性和效率。

摘要翻译

质子转移是众多化学过程的核心。模拟质子转移动力学需要包含核量子效应，例如零点能、核离域与隧穿效应。本文在核-电子轨道（NEO）框架内引入相关方法，其中特定原子核与电子在同一量子力学层面上被处理，用于模拟质子转移动力学。具体而言，采用NEO密度泛函理论对转移质子进行量子力学处理，而其他原子核则在绝热电子振动基态势能面上进行经典传播。我们构建了一种NEO扩展拉格朗日分子动力学（NEO-ELMD）方法，以在此类模拟中纳入核基函数中心的运动。通过引入密度矩阵外推与纯化技术，减少每一步收敛所需的迭代次数，从而加速每个时间步的NEO自洽场计算过程。通过将其与丙二醛分子内质子转移的相关动力学方法进行比较，我们验证了NEO-ELMD方法的准确性与效率。我们还应用这些加速技术模拟了更大规模苯并咪唑-苯酚体系中质子耦合电子转移的非平衡单质子与双质子转移动力学。此项工作为未来在NEO-DFT框架内高效模拟质子转移动力学，同时纳入绝热电子振动态间非绝热效应的新方法奠定了基础。

摘要 (Abstract)

Proton transfer is central to many processes of chemical interest. The simulation of proton transfer dynamics requires the inclusion of nuclear quantum effects, such as zero-point energy, nuclear delocalization, and tunneling. Herein, we introduce methods within the nuclear-electronic orbital (NEO) framework, where specified nuclei are treated quantum mechanically on the same level as the electrons, for the simulation of proton transfer dynamics. Specifically, NEO density functional theory is used to treat the transferring protons quantum mechanically, and the other nuclei are propagated classically on the adiabatic vibronic ground-state surface. We formulate a NEO extended Lagrangian molecular dynamics (NEO-ELMD) approach to incorporate the motion of the nuclear basis function centers during such simulations. Density matrix extrapolation and purification are introduced as a means to accelerate the NEO self-consistent field procedure at each time step by reducing the number of iterations required for convergence. We demonstrate the fidelity and efficiency of NEO-ELMD by comparison to related dynamics methods for intramolecular proton transfer in malonaldehyde. We also use these accelerated techniques to simulate the nonequilibrium single and double proton transfer dynamics of proton-coupled electron transfer in much larger benzimidazole-phenol systems. This work provides a foundation for future methodologies to efficiently simulate proton transfer dynamics within the NEO-DFT framework while incorporating nonadiabatic effects between adiabatic vibronic states.

关键词: proton transfer, nuclear-electronic orbital, molecular dynamics, density functional theory, nonadiabatic effects, quantum mechanics, simulation acceleration, chemical dynamics

302. ❌ Comment on “Efficient implementation of the superposition of atomic potentials initial guess for electronic structure calculations in Gaussian basis sets”

作者: Kshitijkumar A. Surjuse, Zhihao Deng, Andrey Asadchev, Edward F. Valeev 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于量子化学计算中高斯基组电子结构计算的初始猜测方法的技术改进，属于计算化学领域。论文内容与绝大多数大模型和深度学习关键词完全无关，因为这些关键词涉及的是人工智能、机器学习和大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文属于计算化学领域，而计算化学是化学信息学(Cheminformatics)和科学AI(AI for Science)的相关领域，但论文本身并未使用AI方法，而是传统数值计算方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过简单修改单电子核吸引积分来高效计算高斯原子轨道表示下的原子势叠加初始猜测的方法，改进了量子化学计算中的电子结构计算效率。

摘要翻译

在《化学物理杂志》第152卷第144105页（2020年）中，Lehtola等人提出了原子势叠加（Superposition of Atomic Potentials，简称SAP）的高斯基函数高效表示方法，该方法“可通过双电子积分在任何高斯基组的量子化学代码中轻松实现”。本文中，我们证明通过对单电子核吸引积分进行近乎微小的修改，即可实现SAP的高斯原子轨道表示计算。

摘要 (Abstract)

In J. Chem. Phys. 152, 144105 (2020) Lehtola et al introduced the efficient Gaussian-basis representation of Superposition of Atomic Potentials (SAP) which “can be easily implemented in any Gaussian-basis quantum chemistry code in terms of two-electron integrals”. Here we demonstrate that it is possible to evaluate Gaussian AO representation of SAP by nearly trivial modification of one-electron nuclear attraction integrals.

关键词: Superposition of Atomic Potentials, Gaussian basis sets, electronic structure calculations, one-electron nuclear attraction integrals, quantum chemistry, initial guess, SAP, computational efficiency

Token 消耗统计

总计: 929,415 tokens（输入 619,128 / 输出 310,287）

模型	输入	输出	合计
deepseek-chat	537,910	298,242	836,152
glm-4.7	81,218	12,045	93,263

📊 ArXiv 研究报告 (2026-03-20)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization#

CODMAS：用于结构化RTL优化的辩证多智能体协作框架#

2. Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents#

Sensi：一次学一件事——基于课程的大模型游戏智能体测试时学习#

3. Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarit#

4. EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards#

5. Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift#

6. Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text#

构建即匿名：一种用于隐私保护文本的LLM驱动框架#

7. Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval#

通过领域基础分层检索缓解大模型幻觉#

8. Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis#

9. On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings#

📋 所有论文列表#

1. ✅ CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization#

2. ✅ Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents#

3. ✅ Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search#

4. ✅ EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards#

5. ✅ Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift#

6. ✅ Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text#

7. ✅ Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval#

8. ✅ Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis#

9. ✅ On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings#

10. ❌ UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models#

11. ❌ Ruyi2.5 Technical Report#

12. ❌ Text-to-Stage: Spatial Layouts from Long-form Narratives#

13. ❌ Discovering Decoupled Functional Modules in Large Language Models#

14. ❌ Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models#

15. ❌ Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT#

16. ❌ Evaluating Ill-Defined Tasks in Large Language Models#

17. ❌ AI-Assisted Goal Setting Improves Goal Progress Through Social Accountability#

18. ❌ A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users#

19. ❌ The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering#

20. ❌ AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse#

21. ❌ Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models#

22. ❌ Unified Spatio-Temporal Token Scoring for Efficient Video VLMs#

23. ❌ Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection#

24. ❌ Specification-Aware Distribution Shaping for Robotics Foundation Models#

25. ❌ TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis#

26. ❌ VideoAtlas: Navigating Long-Form Video in Logarithmic Compute#

27. ❌ IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia#

28. ❌ CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention#

29. ❌ Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs#

30. ❌ scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns#

31. ❌ RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference#

32. ❌ Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification#

33. ❌ Procedural Generation of Algorithm Discovery Tasks in Machine Learning#

34. ❌ How do LLMs Compute Verbal Confidence#

35. ❌ Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control#

36. ❌ RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy#

37. ❌ CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents#

38. ❌ FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair#

39. ❌ ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation#

40. ❌ Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference#

41. ❌ Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients#

42. ❌ RangeAD: Fast On-Model Anomaly Detection#

43. ❌ Governed Memory: A Production Architecture for Multi-Agent Workflows#

44. ❌ A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks#

45. ❌ Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory#

46. ❌ CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution#

47. ❌ Attention Sinks Induce Gradient Sinks#

48. ❌ Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor#

49. ❌ Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation#

50. ❌ SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition#

51. ❌ Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)#

52. ❌ From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving#

53. ❌ MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment#

54. ❌ Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization#

55. ❌ Objective Mispricing Detection for Shortlisting Undervalued Football Players via Market Dynamics and News Signals#

56. ❌ WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models#

57. ❌ Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models#

58. ❌ Inhibitory normalization of error signals improves learning in neural circuits#

59. ❌ Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards#

60. ❌ FINER: MLLMs Hallucinate under Fine-grained Negative Queries#

📊 ArXiv 研究报告 (2026-03-20)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

CODMAS：用于结构化RTL优化的辩证多智能体协作框架

2. Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents

Sensi：一次学一件事——基于课程的大模型游戏智能体测试时学习

3. Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarit

4. EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

5. Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

6. Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

构建即匿名：一种用于隐私保护文本的LLM驱动框架

7. Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

通过领域基础分层检索缓解大模型幻觉

8. Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

9. On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

📋 所有论文列表

1. ✅ CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

2. ✅ Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents

3. ✅ Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search

4. ✅ EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

5. ✅ Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

6. ✅ Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

7. ✅ Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

8. ✅ Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

9. ✅ On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

10. ❌ UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

11. ❌ Ruyi2.5 Technical Report

12. ❌ Text-to-Stage: Spatial Layouts from Long-form Narratives

13. ❌ Discovering Decoupled Functional Modules in Large Language Models

14. ❌ Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

15. ❌ Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT

16. ❌ Evaluating Ill-Defined Tasks in Large Language Models

17. ❌ AI-Assisted Goal Setting Improves Goal Progress Through Social Accountability

18. ❌ A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users

19. ❌ The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

20. ❌ AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

21. ❌ Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

22. ❌ Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

23. ❌ Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

24. ❌ Specification-Aware Distribution Shaping for Robotics Foundation Models

25. ❌ TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

26. ❌ VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

27. ❌ IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

28. ❌ CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

29. ❌ Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

30. ❌ scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

31. ❌ RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

32. ❌ Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

33. ❌ Procedural Generation of Algorithm Discovery Tasks in Machine Learning

34. ❌ How do LLMs Compute Verbal Confidence

35. ❌ Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

36. ❌ RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

37. ❌ CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

38. ❌ FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

39. ❌ ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

40. ❌ Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

41. ❌ Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

42. ❌ RangeAD: Fast On-Model Anomaly Detection

43. ❌ Governed Memory: A Production Architecture for Multi-Agent Workflows

44. ❌ A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks

45. ❌ Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory

46. ❌ CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

47. ❌ Attention Sinks Induce Gradient Sinks

48. ❌ Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

49. ❌ Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

50. ❌ SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

51. ❌ Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

52. ❌ From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving

53. ❌ MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

54. ❌ Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

55. ❌ Objective Mispricing Detection for Shortlisting Undervalued Football Players via Market Dynamics and News Signals

56. ❌ WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

57. ❌ Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

58. ❌ Inhibitory normalization of error signals improves learning in neural circuits

59. ❌ Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

60. ❌ FINER: MLLMs Hallucinate under Fine-grained Negative Queries